2 ■
Solutions to Case Studies and Exercises
Appendix A Solutions
(by Rui Ma and Gregory D. Peterson)
A.1 The first challenge of this exercise is to obtain the instruction mix. The instruction
frequencies in Figure A.27 must add to 100, although gap and gcc add to 100.2 and
99.5 percent, respectively, because of rounding error. Because each total must in
reality be 100, we should not attempt to scale the per instruction average frequen-
cies by the shown totals of 100.2 and 99.5. However, in computing the average fre-
quencies to one significant digit to the right of the decimal point, we should be
careful to use an unbiased rounding scheme so that the total of the averaged fre-
quencies is kept as close to 100 as possible. One such scheme is called round to
even, which makes the least significant digit always even. For example, 0.15 rounds
to 0.2, but 0.25 also rounds to 0.2. For a summation of terms, round to even will not
accumulate an error as would, for example, rounding up where 0.15 rounds to 0.2
and 0.25 rounds to 0.3.
For gap and gcc the average instruction frequencies are shown in Figure S.12.
Instruction
load
store
add
sub
mul
compare
load imm
cond branch
cond move
jump
call
return
shift
and
or
xor
other logical
Average of gap and gcc %
25.8
11.8
20.0
2.0
0.8
4.4
3.6
10.7
0.5
0.8
1.1
1.1
2.4
4.4
8.2
2.0
0.2
Figure S.12 MIPS dynamic instruction mix average for gap and gcc.
The exercise statement gives CPI information in terms of four major instruction
categories, with two subcategories for conditional branches. To compute the
Copyright © 2012 Elsevier, Inc. All rights reserved.
Appendix A Solutions
■ 3
average CPI we need to aggregate the instruction frequencies in Figure S.12 to
match these categories. This is the second challenge, because it is easy to mis-
categorize instructions. The four main categories are ALU, load/store, condi-
tional branch, and jumps. ALU instructions are any that take operands from the
set of registers and return a result to that set of registers. Load/store instructions
access memory. Conditional branch instructions must be able to set the program
counter to a new value based on a condition. Jump-type instructions set the pro-
gram counter to a new value no matter what.
With the above category definitions, the frequency of ALU instructions is the
sum of the frequencies of the add, sub, mul, compare, load imm (remember, this
instruction does not access memory, instead the value to load is encoded in a field
within the instruction itself), cond move (implemented as an OR instruction
between a controlling register and the register with the data to move to the desti-
nation register), shift, and, or, xor, and other logical for a total of 48.5%. The fre-
quency of load/store instructions is the sum of the frequencies of load and store
for a total of 37.6%. The frequency of conditional branches is 10.7%. Finally, the
frequency of jumps is the sum of the frequencies of the jump-type instructions,
namely jump, call, and return, for a total of 3.0%.
Now
∑=
Instruction category frequency Clock cycles for category
×
)
(
+
0.367
) 1.4(
)
+
(
0.107
) 0.6(
(
) 2.0(
)
1 0.6–(
) 1.5(
)
)
+
+
(
0.03
)
1.2(
)
+
Effective CPI
categories
(
) 1.0(
0.485
1.24
=
=
A.2 See the solution for A.1 (above) for a discussion regarding the solution methodol-
ogy for this exercise.
As with problem A.1, the frequency of ALU instructions is the sum of the fre-
quencies of the add, sub, mul, compare, load imm (remember, this instruction
does not access memory, instead the value to load is encoded in a field within the
instruction itself ), cond move (implemented as an OR instruction between a con-
trolling register and the register with the data to move to the destination register),
shift, and, or, xor, and other logical for a total of 51.1%. The frequency of load/
store instructions is the sum of the frequencies of load and store for a total of
35.0%. The frequency of conditional branches is 11.0%. Finally, the frequency of
jumps is the sum of the frequencies of the jump-type instructions, namely jump,
call, and return, for a total of 2.8%.
×
11.0% 60% 1.5
11.0% 40% 1.2
2.8% 1.23
35.0% 2.0
+
×
×
×
×
×
+
×
CPI
=
1.0
51.1% 1.4
+
+
=
Copyright © 2012 Elsevier, Inc. All rights reserved.
4 ■
Solutions to Case Studies and Exercises
Instruction
load
store
add
sub
mul
compare
load imm
cond branch
cond move
jump
call
return
shift
and
or
xor
other logical
Average of gzip and perlbmk %
24.4
10.6
21.8
3.8
–
5.2
1.6
11.0
1.5
1.2
0.8
0.8
1.3
5.3
6.8
3.6
0.2
Figure S.13 MIPS dynamic instruction mix average for gzip and perlbmk.
A.3 This exercise is similar to A.1 and A.2, but focuses on floating point intensive
programs.
Instruction
Load
Store
Add
Sub
Mul
Compare
load imm
cond branch
cond move
Jump
Call
Return
Shift
Figure S.14 Continued
Average of gzip and perlbmk %
9.8
2.4
17.8
3.0
0.6
–
5.6
1.0
–
–
–
–
1.0
Copyright © 2012 Elsevier, Inc. All rights reserved.
And
Or
Xor
other logical
load FP
store FP
add FP
sub FP
mul FP
div FP
move reg-reg FP
compare FP
cond mov FP
other FP
Appendix A Solutions
■ 5
0.9
4.1
–
–
16.5
11.6
8.6
6.2
8.2
0.2
1.4
0.4
0.4
0.8
Figure S.14 MIPS dynamic instruction mix for lucas and swim.
ALU instructions: (17.8% + 3.0% + 0.6% + 5.6% + 1.0% + 0.9% + 4.1%) = 33.0%
Load-stores: (9.4% + 2.4%) = 11.8%
Conditional branches: 1.0%
Jumps: 0%
FP add: (8.6% + 6.2%) = 14.8%
Load-store FP: (16.5% + 11.6%) = 28.1%
Other FP: (1.4% + 0.4% + 0.4% + 0.8%) = 3.0%
CPI = 1.0 × 33.0% + 1.4 × 11.8% + 2.0 × 1.0% × 60% + 1.5 × 1.0% × 40% + 6.0
× 8.2% + 4.0 × 14.8% + 20 × 0.2% + 1.5 × 28.1% + 2.0 × 3.0% = 2.12
A.4 This exercise is similar to A.3.
Instruction
Load
Store
Add
Sub
Mul
Compare
load imm
cond branch
cond move
Jump
Figure S.15 Continued
Average of applu and art %
16.0
1.4
30.2
1.2
1.2
3.7
6.8
7.0
0.2
–
Copyright © 2012 Elsevier, Inc. All rights reserved.
6 ■
Solutions to Case Studies and Exercises
Call
Return
Shift
And
Or
Xor
other logical
load FP
store FP
add FP
sub FP
mul FP
div FP
move reg-reg FP
compare FP
cond mov FP
other FP
–
–
0.4
–
1.0
1.6
–
11.7
4.4
3.4
1.4
6.4
0.4
0.8
0.4
0.3
–
Figure S.15 MIPS dynamic instruction mix for applu and art.
ALU instructions: (30.2% + 1.2% + 1.2% + 3.7% + 6.8% + 0.2% + 0.4% +
1.0% + 1.6%) = 46.3%
Load-stores: (16.0% + 1.4%) = 17.4%
Conditional branches: 7.0%
Jumps: 0%
FP add: (3.4% + 1.4%) = 4.8%
Load-store FP: (11.7% + 4.4%) = 16.1%
Other FP: (0.8% + 0.4% + 0.3%) = 1.5%
CPI = 1.0 × 46.3% + 1.4 × 17.4% + 2.0 × 7.0% × 60% + 1.5 × 7.0% × 40% +
6.0 × 6.4% + 4.0 × 4.8% + 20 × 0.4% + 1.5 × 16.1% + 2.0 × 1.5% = 1.76
A.5 Take the code sequence one line at a time.
1. A = B + C
2. B = A + C
= B + C + C
3. D = A – B
= (B + C) – (B + C + C)
= – C
;The operands here are given, not computed
by the code, so copy propagation will not
transform this statement.
;Here A is a computed value, so transform
the code by substituting
A = B + C to get
;Now no operand is computed
;Both operands are computed so substitute
for both to get
;Simplify algebraically to get
;This is a given, not computed, operand
Copyright © 2012 Elsevier, Inc. All rights reserved.
Appendix A Solutions
■ 7
Copy propagation has increased the work in statement 2 from one addition to
two. It has changed the work in statement 3 from subtraction to negation, possi-
bly a savings. The above suggests that writing optimizing compilers means incor-
porating sophisticated trade-off analysis capability to control any optimizing
steps, if the best results are to be achieved.
A.6 No solution provided.
A.7
a. The point of this exercise is to highlight the value of compiler optimizations.
In this exercise registers are not used to hold updated values; values are stored
to memory when updated and subsequently reloaded. Because all the
addresses of all the variables (including all array elements) can fit in 16 bits,
we can use immediate instructions to construct addresses. Figure S.16 shows
one possible translation of the given C code fragment.
ex_a_7:
loop:
DADD
SW
LD
DSLL
DADDI
LD
LD
DADD
LD
DSLL
DADDI
SD
LD
DADDI
SD
LD
DADDI
BNEZ
R1,R0,R0
7000(R0),R1
R1,7000(R0)
R2,R1,#3
R3,R2,#3000
R4,0(R3)
R5,5000(R0)
R6,R4,R5
R1,7000(R0)
R2,R1,#3
R7,R2,#1000
0(R7),R6
R1,7000(R0)
R1,R1,#1
7000(R0),R1
R1,7000(R0)
R8,R1,#-101
R8,loop
;R0 = 0, initialize i = 0
;store i
;get value of i
;R2 = word offset of B[i]
;add base address of B to R2
;load B[i]
;load C
;B[i] + C
;get value of i
;R2 = word offset of A[i]
;add base address of A to R2
;A[i] ← B[i] + C
;get value of i
;increment i
;store i
;get value of i
;is counter at 101?
;if not 101, repeat
Figure S.16 MIPS code to implement the C loop without using registers to hold
updated values for future use or to pass values to a subsequent loop iteration.
The number of instructions executed dynamically is the number of initializa-
tion instructions plus the number of instructions in the loop times the number
of iterations:
Instructions executed = 2 + (16 × 101) = 1618
The number of memory-data references is a count of the load and store
instructions executed:
Memory-data references executed = 0 + (8 × 101) = 808
Copyright © 2012 Elsevier, Inc. All rights reserved.
8 ■
Solutions to Case Studies and Exercises
Since MIPS instructions are 4 bytes in size, code size is the number of
instructions times 4:
Instruction bytes = 4 × 18 = 72
ex_b_7:
movq
movq
movq
$0x0,%rax
$0x0,%rbp
%rax,0x1b58(%rbp) # store i to location 7000
# rax = 0, initialize i = 0
# base pointer = 0
($1b58)
movq
loop:
movq
mov
shl
0x1388(%rbp),%rdx # load C from 5000 ($1388)
0x1b58(%rbp),%rax # get value of i
%rax,%rbx
$0x3,%rbx
# rbx gets copy of i
# rbx now has i * 8
movq 0x0bb8(%rbx),%rcx # load B[i] (3000 = $0bb8)
add
mov
shl
movq
to rcx
# B[i] + C
%rdx,%cdx
# rbx gets copy of i
%rax,%rbx
$0x3,%rbx
# rbx now has i * 8
%rcx,0x03e8(%rbx) # A[i] ← B[i] + C
(base address of A is 1000)
movq 0x1b58(%rbp),%rax # get value of i
add $0x1,%rax
movq %rax,0x1b58(%rbp) # save i
cmpq $0x0065,%rax
jae loop
# increment i
# is counter at 101 ($0065)?
# if not 101, repeat
Figure S.17 x86 code to implement the C loop without using registers to hold
updated values for future use or to pass values to a subsequent loop iteration.
b. This problem is similar to part (a), but with x86-64 instructions instead.
The number of instructions executed dynamically is the number of initializa-
tion instructions plus the number of instructions in the loop times the number
of iterations:
Instructions executed = 3 + (13 × 101) = 1316
A.8 This problem focuses on the challenges related to instruction set encoding. The
length of the instructions are 12 bits. There are 32 registers so the length of the
register field will be 5 bits for each register operand. We use addr[11] to addr[0] to
represent the 12 bits of the address as shown in the tables below.
a.
In the first case, we need to support 3 two-address instructions. These can be
encoded as follows:
3 two-address instr.
Other instructions
addr[11:10]
'00', '01', '10'
'11'
addr[9:5]
'00000' to '11111'
'00000' to '11111'
addr[4:0]
'00000' to '11111'
'00000' to '11111'
Copyright © 2012 Elsevier, Inc. All rights reserved.
Appendix A Solutions
■ 9
Hence, for the one-address and two-address instructions must be mapped to
the remaining 10 bits with the upper two bits encoded as ‘11’. The one-
address instructions are then encoded with the addr[9:5] field using ‘00000’
to ‘11101’ for the 30 instruction types, leaving the addr[4:0] field or the regis-
ter operand. This leaves the patterns with ‘11’ followed by ‘11110’ in the
upper seven bits and ‘00000’ to ‘11111’ in the lower five bits to encode 32 of
the zero-address instructions. The remaining zero-address instructions can be
encoded using ‘11’ followed by ‘11111’ in the upper seven bits and ‘0000’ to
‘01100’ in the lower five bits to encode the other 13 zero-address instructions.
3 two-address instr.
30 one-address instr.
45 zero-address instr.
addr[11:10]
'00', '01', '10'
'11'
'11'
'11'
addr[9:5]
'00000' to '11111'
'00000' to '11101'
'11110'
'11111'
addr[4:0]
'00000' to '11111'
'00000' to '11111'
'00000' to '11111'
'00000' to '01100'
Hence, it is possible to have the above instruction encodings.
c.
b. This scenario is similar to above, with the two-address instructions encoded
with the upper two bits as ‘00’ to ‘01’. The one-address instructions can be
encoded with the upper two bits as ‘11’ and using ‘00000’ to ‘11110’ to dif-
ferentiate the 31 one-address instructions. The pattern ‘11’ and ‘11111’ in the
upper seven bits is then used to encode the zero-address instructions, with the
lower five bits to differentiate them. We can only encode 32 of these patterns,
not the 35 that are required; hence, it is impossible to have these instruction
encodings.
In this part, we already have encoded three two-address instructions as above.
Moreover, we have 24 zero-address instructions encoded as below with
‘00000’ in the addr[9:5] field and ‘00000’ to ‘10111’ in the addr[4:0] field.
We would like to fit as many one-address instructions as we can. Note that we
cannot take advantage of any of the unused encodings with ‘11’ and ‘00000’
in the upper seven bits because we would need to use the entire addr[4:0]
field for the single address of the operand. Hence, we can use the encodings
with ‘00001’ to ‘11111’ in addr[9:5] for the one-address instructions and save
the last five bits for the register address. Because this includes 31 patterns, we
can support up to 31 one-address instructions. Note that we could also add up
to eight additional zero-address instructions if we wish as well.
3 two-address instr.
24 zero-address instr.
X one-address instr.
addr[11:10]
'00', '01', '10'
'11'
'11'
addr[9:5]
'00000' to '11111'
'00000'
'00001' to '11111'
addr[4:0]
'00000' to '11111'
'00000' to '10111'
'00000' to '11111'
Copyright © 2012 Elsevier, Inc. All rights reserved.