pipeline

pipeline
	1 - predict	=> r_pc			^
	2 - icache	=> r_data		| flow control by r_ops fill
	3 - decode	=> r_ops		v
	4 - rename	=> r_stata (in issue)	^
	5 - issue	=> r_schedule0		|
	6 - regfile	=> r_mux_idx		| flow control by s_shift
	7 - bypass	=> r_reg[ab]		|
	8 - execute	=> r_fault & fast=>reg	v
	9 - execute
	10- execute	=> slow=>reg

A branch first enters the schedule on cycle 4 and cannot leave until after
cycle 8 shifts it out.  Thus, branches are in the window for 5 cycles and to
sustain full execution bandwidth, the schedule must be at least 5*decoders
deep.  Any dependent delay only adds to this, thus 6*decoders is the minimum.

Obviously, more is better. Consider a vector product:
	for (int i = 0; i < n; ++i) out[i] = a[i] * b[i];
The load and multiply add up to a latency of 6. Until the store has been
checked for segfault, that adds up to 11 cycles. This doesn't even consider
an L1 cache miss, in which case you would need even more depth.


On an L1 miss, the following pipeline is executed:
	r_reg[ab]
	dcache regin
	r_miss			r_cyc/r_stb
	r_issued
	r_schedule0
	r_mux_idx
	r_reg[ab]		ack_i => dcache wen_i
	dcache regin

... so a cache miss will cost 6 cycles if dbus responds within 4 cycles