Generating highest-performance implementations with coherent TL ports #26

jerryz123 · 2024-09-14T12:49:22Z

Can you help me with the compile flags for VexiiRiscv to generate a system with cache-coherent TL ports? The default seems to build a "cacheless config".

In general, what are the best set of flags for the system for maximum performance? It would be good to leave these documented somewhere.

The set of configuration flags is very high (which is great), but it would be good to have a "recommended" set of options for maximum performance and capability.

Dolu1990 · 2024-09-14T13:58:59Z

Hi,

I just pushed a an additional generator. Now you can use :
vexiiriscv.GenerateTilelink

Here is an example of parameter list which should give good performance (rocket like / single issue):

--with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare

You can add late alu support with :
--with-late-alu

And add dual issue with
--decoders=2 --lanes=2 --with-dispatcher-buffer

but it would be good to have a "recommended" set of options for maximum performance and capability.

Yes right, i need to document this :D

Keep in mind this that configuration has a specific memory region setup. Boot at 0x80000000 and assume that IO are at 0x10000000-0x1FFFFFFF

jerryz123 · 2024-09-14T14:35:13Z

Thanks. This looks great. I have some questions:

I see two groups of signals named FetchL1Plugin and LsuL1Plugin. These are the requests from the core's L1I$ and L1D$ to the outer memory system?

What are the LsuPlugin signals? These appear to be additional memory request signals from the core to the outer memory system. What are they for? And what do the fromHart and uopId signals mean?

Keep in mind this that configuration has a specific memory region setup. Boot at 0x80000000 and assume that IO are at 0x10000000-0x1FFFFFFF

Does this mean the core will immediately start fetching from 0x80000000 after reset? Can the core fetch from any addresses < 0x80000000?

Is the address 0x0 - 0xFFFFFFF used for anything? What happens when trying to fetch or accesses these addresses?

Dolu1990 · 2024-09-14T15:10:21Z

Hi,

I just pushed a fix which now get the proper memory regions attributes.

I see two groups of signals named FetchL1Plugin and LsuL1Plugin. These are the requests from the core's L1I$ and L1D$ to the outer memory system?

Right. Note that the IO access are done through another plugin, the LsuPlugin.

What are the LsuPlugin signals? These appear to be additional memory request signals from the core to the outer memory system. What are they for?

IO access are done from the LsuPlugin
The LsuPlugin job are :

Bind the MMU TLB lookup logic
handle PMA checks
Handle load/store related trap
Handle Atomic access (through memory coherency cache line lock)
Generate cache request and handle its response
Implement the store buffer
Arbitrate cache requests (from AGU, MMU refill, prefetch, store buffer)

The LsuL1Plugin implement :

tags/banks access and checks
memory coherency / probe
refill slots
writeback slots

Does this mean the core will immediately start fetching from 0x80000000 after reset?

Yes

Can the core fetch from any addresses < 0x80000000?

The IO region is currently set as executable => 0x10000000-0x1FFFFFFF can be executed

Note that you can feed a custom region mapping via for instance :
--region base=80000000,size=80000000,main=1,exe=1 --region base=10000000,size=10000000,main=0,exe=1
Which is equivalent to the default config.

And what do the fromHart and uopId signals mean?

fromHart is used by the simulation lockstep checker to figure out if the access come from a regular load/store, or from the MMU refill

uopId is used by the simulation lockstep checker to keep track from which instruction the given access comming from (to check if that is fine)

More generaly, VexiiRiscv is verified against RVLS (itself based on spike) in a lockstep manner, so there is quite a few probe there and there, but mostly there is the WhiteboxerPlugin which expose a looooooot of vexiiriscv behaviour to the SpinalHDL simulation API

Is the address 0x0 - 0xFFFFFFF used for anything? What happens when trying to fetch or accesses these addresses?

It should trap. in every case.

let's me know if anything goes wrong ^^

jerryz123 · 2024-09-14T15:15:20Z

Ah ok, so there's enough flexibility where I can specify precisely what the memory map of the SoC is with the --region flags. That seems good enough for me.

If I'm not doing lockstep cosim, can I just ignore those signals?

Also, for multicore configs, you typically need a hartId wire into the core to differentiate their mhartid CSRs. How is this done with VexiiRiscv?

Dolu1990 · 2024-09-14T15:35:37Z

If I'm not doing lockstep cosim, can I just ignore those signals?

Yes, thing is, with this GenerateTilelink, you get VexiiRiscvTilelink.v, which doesn't expose them anymore in the io of the toplevel.

Also, for multicore configs, you typically need a hartId wire into the core to differentiate their mhartid CSRs. How is this done with VexiiRiscv?

So far, this is was a internal constant in the CPU (loaded through parameters)
I just pushed an option, you can now do --with-hart-id-input and it will give you a input 32 bits signal "hartId"

jerryz123 · 2024-09-15T17:55:00Z

Is it possible to add a configuration select for the default boot address? I typically boot the cores into a bootrom at 0x10000, where they hang until interrupted to jump to 0x80000000

Dolu1990 · 2024-09-15T19:46:28Z

Yes,
--reset-vector=0x10000

Just dont forget to specify all the --region, else it will trap endlessly ^^

Dolu1990 · 2024-09-15T20:08:01Z

also, if you specify any --region, all region need to be specified (it lose the default one)

jerryz123 · 2024-09-15T20:34:06Z

Thanks. Much closer now. Is there a way to get trace log information in the generated verilog? Through printf or something?

Or have any signal in the generated verilog that corresponds to a PC trace?

Dolu1990 · 2024-09-16T07:23:21Z

There isno verilog printf traces, instead the sim i do use spinalsim probes in the dut.
I'm in travel, once back i will check for a solution

jerryz123 · 2024-09-16T07:53:22Z

Thanks. If you expose IO for a PC trace, I can write the printf myself in the wrapper.

…xer-outputs Add WhiteboxerPlugin_logic_commits_xxx to trace commits activity easily

Dolu1990 · 2024-09-17T11:00:58Z

Hi,

I added an option to get all the whitebox signals as an output. --with-whiteboxer-outputs

So, with :

--with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare --with-late-alu --decoders=2 --lanes=2 --with-dispatcher-buffer --trace-all --dual-sim --with-whiteboxer-outputs

i get :

5.04 Coremark/MHz

The outputs which should interreset you are :

  output wire          WhiteboxerPlugin_logic_commits_ports_0_valid /* verilator public */ ,
  output wire [39:0]   WhiteboxerPlugin_logic_commits_ports_0_pc /* verilator public */ ,
  output wire [31:0]   WhiteboxerPlugin_logic_commits_ports_0_uop /* verilator public */ ,
  output wire          WhiteboxerPlugin_logic_commits_ports_1_valid /* verilator public */ ,
  output wire [39:0]   WhiteboxerPlugin_logic_commits_ports_1_pc /* verilator public */ ,
  output wire [31:0]   WhiteboxerPlugin_logic_commits_ports_1_uop /* verilator public */ ,

ports_0 and ports_1 are ordered (0 meaning oldest instruction)
uop is 1 to 1 RISC-V instruction

Note that i just added those WhiteboxerPlugin_logic_commits_ports, should normaly work, but isn't tested much

Dolu1990 · 2024-09-17T11:13:09Z

Keep in mind, i'm testing VexiiRiscv only against Verilator and hardware, not against x-prop aware simulators.

If things doesn't works, you can send me a wave (vcd or fst, fst idealy) and the software you run (.asm or .elf)

jerryz123 · 2024-09-18T20:08:33Z

I think its close, the VexiiRiscvTile seems to generate a TL fetch to 0x10000, but something gets stuck afterwards, likely due to x-prop.
I've attached some waves here:
vexii.tar.gz

For reference, I'm using the flags:

Test/runMain vexiiriscv.GenerateTilelink --with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare --with-late-alu --decoders=2 --lanes=2 --with-dispatcher-buffer --with-hart-id-input --reset-vector=0x10000 --with-whiteboxer-outputs --region base=3000,size=1000,main=0,exe=1 --region base=2010000,size=1000,main=0,exe=1 --region base=1000,size=1000,main=0,exe=1 --region base=10020000,size=1000,main=0,exe=1 --region base=2000000,size=10000,main=0,exe=1 --region base=C000000,size=4000000,main=0,exe=1 --region base=0,size=1000,main=0,exe=1 --region base=10000,size=10000,main=0,exe=1 --region base=100000,size=1000,main=0,exe=1 --region base=110000,size=1000,main=0,exe=1 --region base=80000000,size=10000000,main=1,exe=1 --region base=8000000,size=10000,main=1,exe=1

Dolu1990 · 2024-09-19T08:23:28Z

Hi,

Yes right, x-prop hell XD
I have seen x-prop easily leak from the instruction cache data banks toward the RVC decompresssor / aligner.

Can you try with an additional --with-boot-mem-init ? This will add logic to initialize all the l1 data bank, aswell as the prediction memories (branch + prefetch)

Let's me know how it goes, also, send me a wave if not fixed. Don't forget to also send the software elf or asm files ^^

jerryz123 · 2024-09-19T20:35:49Z

Still seems to be not committing instructions. I've attached a dump of the bootrom, and the binary.

sw.tar.gz
vexii.tar.gz

Dolu1990 · 2024-09-20T10:57:26Z

I'm on it, i can see a few false x-prop XD
Will update you when the fix is up.
I also got a way to run all my regression using iverilog instead of verilator. That is veeeeeeery slow, but will help to fix all the fale x-prop

Dolu1990 · 2024-09-20T12:54:24Z

Hi,

I just pushed fixes. With those, i can run the VexiiRiscv regression with iverilog and the xprop seems fully resolved.
It can enter the linux boot process, (but that just too long to simulate, ididn't let it through XD)

Let's me know if you have any issues

jerryz123 · 2024-09-20T15:54:31Z

Getting further. I'm seeing some tilelink errors from the assertions in my environment.

I believe the E channel does not have sink exposed correctly. Can this be added?

Dolu1990 · 2024-09-20T16:25:38Z

Hi,

Yes, I just added support for --tl-sink-width=X
Note, that only probe request are support on the tilelink channel B.
Also, the instruction cache isn't exposed to the probe (fence.i will flush the D$ + I$)

jerryz123 · 2024-09-20T16:57:57Z

I'm seeing X coming out of that port.

VCD link: https://drive.google.com/file/d/14Ca3gB--b7xCjn1nCZKDPJJ6z-Ivssyy/view?usp=sharing

Dolu1990 · 2024-09-20T17:24:03Z

Seems it is due to mem_node_bus_d_payload_sink not being connected ('Z') on your end of the blackbox :D

jerryz123 · 2024-09-20T18:30:52Z

Oops, sorry. I'm seeing a hang due to core not be ready for the probe on b. Seems to get stuck

https://drive.google.com/file/d/14Ca3gB--b7xCjn1nCZKDPJJ6z-Ivssyy/view?usp=sharing

Dolu1990 · 2024-09-20T20:16:35Z

Hi,

No worries ^^
The the reason for this hang, is that the source id 0 used to probe is currently targeting the fetch l1, not the lsu l1.

I pushed some code in the GenerateTileLink which now report after the generation which memory agent is being which source ids :

toplevel/FetchL1TileLinkPlugin
- [0x0 .. 0x1] G   
toplevel/LsuTileLinkPlugin
- [0x4 .. 0x4] GF
toplevel/LsuL1TileLinkPlugin
- [0x8 .. 0xb] TB

G => Get
F => put Full
T => acquire Trunk
b => acquire Branch

Which mean, you need to specify to chipyard, that only source ID 0x8 to 0xB should be used to probe the D$.
Other source ID do not support probing.

jerryz123 · 2024-09-20T20:50:52Z

Thanks! I have my hello world working now :). I'll run more benchmarks to test things out

Btw, it would be nice to change marchid to vexiiriscv's unique marchid. Right now it reports marchid=0.

Dolu1990 · 2024-09-20T21:44:56Z

Thanks! I have my hello world working now :). I'll run more benchmarks to test things out

Nice :D :D
Let's me know if any thing is weird ^^

Btw, it would be nice to change marchid to vexiiriscv's unique marchid. Right now it reports marchid=0.

On its way ^^
riscv/riscv-isa-manual#1644

!! Note !!
I just pushed a change, before this, the datawidth of the toplevel mem bus was tight to the xlen. Now, the mem bus width will follow the max between fetch-l1-mem-data-width-min and lsu-l1-mem-data-width-min => 128 bits in your config.

So, don't forget to update the data width on the blackbox on your side when you pull the git ^^

Dolu1990 · 2024-09-21T07:02:14Z

Hi ^^

Thanks for running the tests #27 #28 :D
I did skip some of the riscv-test, for a few reasons, some of which are that FPGA softcore generaly do not aim at full compliance / can not run all tests. For instance because the lack of PMA.
But overall now, i'am aiming at as much compliance as possible (at least in a optional way)

I will check #27 #28 Monday :)

Also, if you have a way for me to run all the testframework you have on chipyard / rocket, but on vexiiriscv, let me know ^^

jerryz123 · 2024-09-21T16:41:14Z

Thanks. I'm working on a PR to add this to chipyard. Almost done, just working through a few remaining test failures.

When simulating under verilator, I find that a.payload.mask is 0, which is not legal. I'm not sure why this only appears in verliator...

vexii.tar.gz

Dolu1990 · 2024-09-21T16:55:26Z

Ahhh right.
So far, i think i only considered that mask has only meaning for put_partial / put_full, and no meaning for all other cases.
I'm realy not sure why the tilelink spec added all those extra requirements :

On channels A and B, which carry a mask field, the mask must always be LOW for all inactive byte
lanes. Furthermore, for all messages other than PutPartialData, the bits of mask for all active
byte lanes must be HIGH. PutPartialData may lower individual bits of the mask and these bits
do not have to be contiguous.
The mask is also used for messages without a data payload. When the operation size is smaller
than the data bus, the mask should be generated identically to an operation which does carry a
data payload. For data-less operations which are larger than the data bus, all bits of the mask
should be HIGH, although the message remains a single-beat. See, for example, Figure 4.6.

jerryz123 · 2024-09-23T16:46:49Z

Performance-wise this implementation is quite aggressive. Very many bypass paths and flexibility for scheduling ops in late-ALU.

Where there any bypass paths/scheduling paths that you chose not to support due to physical constraints?

Dolu1990 · 2024-09-23T17:26:01Z

Where there any bypass paths/scheduling paths that you chose not to support due to physical constraints?

Here is a few comments :

To save area, the FPU only start to bypass from the LSU load response point (remove 2 bypasses point)
To improve timings on FPGA, the branch are solved over 2 stage instead of one
Currently, the execute plugin implement some bypass logic which allow the store unit to use late results without waiting, but that is only for the integer register file (you can chain lw x1; sw x1 without penality).
There is still some work todo on the decode / dispatch pipeline to split things over two cycles. Currently this is all done in a single cycle. I don't know about ASIC, but on FPGA, that is tooo much XD

flexibility for scheduling ops in late-ALU.

Also, notice that if one day you want a half late/half early alu, it could be added just via parameters XD
For instance :

      val middle0 = new LaneLayer("middle0", lane0, priority = -1) // priority guess
      plugins += new SrcPlugin(middle0, executeAt = 1, relaxedRs = 1)
      plugins += new IntAluPlugin(middle0, aluAt = 1, formatAt = 1)
      plugins += shifter(middle0, shiftAt = 1, formatAt = 1)
      plugins += new BranchPlugin(middle0, aluAt = 1, jumpAt = 1, wbAt = 1)

Performance-wise this implementation is quite aggressive

By aggressive, do you mean IPC or timings ? Note i never tried on ASIC tooling, only FPGA
Did you got some number :D ?

Dolu1990 · 2024-09-23T17:30:31Z

Note, the version you have is friendly for FPGA with distributed ram.
The register file emulate multiport writes using multiple 1r_1w memories.
This can be tunned for ASIC.

jerryz123 · 2024-09-23T18:06:00Z

I haven't tried a ASIC flow with this. I'm just comparing the behavior to my own dual-issue design, which is much more conservative around bypass paths and late-ALU execution (I can't do late-branch, for instance).

Dolu1990 · 2024-09-23T19:20:39Z

Is your implementation public ?

jerryz123 · 2024-09-23T21:02:39Z

https://github.com/ucb-bar/shuttle

jerryz123 · 2024-09-24T18:23:13Z

@Dolu1990 sorry to poke, do you have a fix for this?

When simulating under verilator, I find that a.payload.mask is 0, which is not legal. I'm not sure why this only appears in verliator...

Dolu1990 · 2024-09-24T18:37:40Z

sorry to poke, do you have a fix for this?

Ahhhh i forgot about that one XD
I'm still unsure how i should fix it, for instance, the tilelink implementation in SpinalHDL do support read-only or write-only tilelink buses, meaning no data/mask signal on the A channel, or on the D channel.
Thinking that maybe i can just patch it only at the point before exposing tilelink to the toplevel io

Dolu1990 · 2024-09-26T10:41:21Z

Hi,
I pushed a fix for the channel A mask. The generated hardware "look" good.
let's me know if you hit any issue.

jerryz123 · 2024-09-26T18:31:22Z

Thanks for your help! Everything works now

Dolu1990 · 2024-09-30T21:34:57Z

<3 https://chipyard.readthedocs.io/en/latest/Generators/VexiiRiscv.html#vexiiriscv-core <3

Dolu1990 added a commit that referenced this issue Sep 14, 2024

#26 add GenerateTilelink

6263911

Dolu1990 added a commit that referenced this issue Sep 14, 2024

#26 fix generate tilelink PMA

061a9b5

Dolu1990 added a commit that referenced this issue Sep 14, 2024

#26 add --with-hart-id-input

2d612c0

Dolu1990 added a commit that referenced this issue Sep 17, 2024

#26 WhiteboxerPlugin can provide probes as outputs via --with-whitebo…

6a8a6b6

…xer-outputs Add WhiteboxerPlugin_logic_commits_xxx to trace commits activity easily

Dolu1990 added a commit that referenced this issue Sep 20, 2024

#26 Fix BtbPlugin bootMemClear

aadddb3

Dolu1990 added a commit that referenced this issue Sep 20, 2024

#26 Workaround FetchL1 false x-prop corrupting AlignerPlugin

30dbecc

Dolu1990 added a commit that referenced this issue Sep 20, 2024

#26 fix false xprop

25ac30e

Dolu1990 added a commit that referenced this issue Sep 20, 2024

#26 add --tl-sink-width=Int

bd36a07

Dolu1990 added a commit that referenced this issue Sep 20, 2024

#26 GenerateTilelink now report tilelink source ID

2cd18f9

Dolu1990 added a commit that referenced this issue Sep 20, 2024

#26 fix GenerateTilelink mem bus data width

194adb4

jerryz123 closed this as completed Sep 24, 2024

jerryz123 reopened this Sep 24, 2024

Dolu1990 added a commit that referenced this issue Sep 26, 2024

#26 GenerateTilelink now patch mem.a.mask to be compliant

b9f015a

jerryz123 closed this as completed Sep 26, 2024

Generating highest-performance implementations with coherent TL ports #26

Generating highest-performance implementations with coherent TL ports #26

Comments

jerryz123 commented Sep 14, 2024

Dolu1990 commented Sep 14, 2024

jerryz123 commented Sep 14, 2024

Dolu1990 commented Sep 14, 2024 • edited Loading

jerryz123 commented Sep 14, 2024

Dolu1990 commented Sep 14, 2024 • edited Loading

jerryz123 commented Sep 15, 2024

Dolu1990 commented Sep 15, 2024

Dolu1990 commented Sep 15, 2024

jerryz123 commented Sep 15, 2024

Dolu1990 commented Sep 16, 2024

jerryz123 commented Sep 16, 2024

Dolu1990 commented Sep 17, 2024 • edited Loading

Dolu1990 commented Sep 17, 2024

jerryz123 commented Sep 18, 2024 • edited Loading

Dolu1990 commented Sep 19, 2024

jerryz123 commented Sep 19, 2024

Dolu1990 commented Sep 20, 2024

Dolu1990 commented Sep 20, 2024

jerryz123 commented Sep 20, 2024

Dolu1990 commented Sep 20, 2024

jerryz123 commented Sep 20, 2024 • edited Loading

Dolu1990 commented Sep 20, 2024

jerryz123 commented Sep 20, 2024

Dolu1990 commented Sep 20, 2024

jerryz123 commented Sep 20, 2024

Dolu1990 commented Sep 20, 2024

Dolu1990 commented Sep 21, 2024

jerryz123 commented Sep 21, 2024

Dolu1990 commented Sep 21, 2024

jerryz123 commented Sep 23, 2024

Dolu1990 commented Sep 23, 2024

Dolu1990 commented Sep 23, 2024

jerryz123 commented Sep 23, 2024

Dolu1990 commented Sep 23, 2024

jerryz123 commented Sep 23, 2024

jerryz123 commented Sep 24, 2024 • edited Loading

Dolu1990 commented Sep 24, 2024

Dolu1990 commented Sep 26, 2024

jerryz123 commented Sep 26, 2024

Dolu1990 commented Sep 30, 2024

Dolu1990 commented Sep 14, 2024 •

edited

Loading

Dolu1990 commented Sep 14, 2024 •

edited

Loading

Dolu1990 commented Sep 17, 2024 •

edited

Loading

jerryz123 commented Sep 18, 2024 •

edited

Loading

jerryz123 commented Sep 20, 2024 •

edited

Loading

jerryz123 commented Sep 24, 2024 •

edited

Loading