-
Notifications
You must be signed in to change notification settings - Fork 3
/
pipeline
35 lines (30 loc) · 1.18 KB
/
pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
pipeline
1 - predict => r_pc ^
2 - icache => r_data | flow control by r_ops fill
3 - decode => r_ops v
4 - rename => r_stata (in issue) ^
5 - issue => r_schedule0 |
6 - regfile => r_mux_idx | flow control by s_shift
7 - bypass => r_reg[ab] |
8 - execute => r_fault & fast=>reg v
9 - execute
10- execute => slow=>reg
A branch first enters the schedule on cycle 4 and cannot leave until after
cycle 8 shifts it out. Thus, branches are in the window for 5 cycles and to
sustain full execution bandwidth, the schedule must be at least 5*decoders
deep. Any dependent delay only adds to this, thus 6*decoders is the minimum.
Obviously, more is better. Consider a vector product:
for (int i = 0; i < n; ++i) out[i] = a[i] * b[i];
The load and multiply add up to a latency of 6. Until the store has been
checked for segfault, that adds up to 11 cycles. This doesn't even consider
an L1 cache miss, in which case you would need even more depth.
On an L1 miss, the following pipeline is executed:
r_reg[ab]
dcache regin
r_miss r_cyc/r_stb
r_issued
r_schedule0
r_mux_idx
r_reg[ab] ack_i => dcache wen_i
dcache regin
... so a cache miss will cost 6 cycles if dbus responds within 4 cycles