Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating highest-performance implementations with coherent TL ports #26

Closed
jerryz123 opened this issue Sep 14, 2024 · 40 comments
Closed

Comments

@jerryz123
Copy link

Can you help me with the compile flags for VexiiRiscv to generate a system with cache-coherent TL ports? The default seems to build a "cacheless config".

In general, what are the best set of flags for the system for maximum performance? It would be good to leave these documented somewhere.

The set of configuration flags is very high (which is great), but it would be good to have a "recommended" set of options for maximum performance and capability.

Dolu1990 added a commit that referenced this issue Sep 14, 2024
@Dolu1990
Copy link
Member

Hi,

I just pushed a an additional generator. Now you can use :
vexiiriscv.GenerateTilelink

Here is an example of parameter list which should give good performance (rocket like / single issue):

--with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare

You can add late alu support with :
--with-late-alu

And add dual issue with
--decoders=2 --lanes=2 --with-dispatcher-buffer

but it would be good to have a "recommended" set of options for maximum performance and capability.

Yes right, i need to document this :D

Keep in mind this that configuration has a specific memory region setup. Boot at 0x80000000 and assume that IO are at 0x10000000-0x1FFFFFFF

@jerryz123
Copy link
Author

Thanks. This looks great. I have some questions:

I see two groups of signals named FetchL1Plugin and LsuL1Plugin. These are the requests from the core's L1I$ and L1D$ to the outer memory system?

What are the LsuPlugin signals? These appear to be additional memory request signals from the core to the outer memory system. What are they for? And what do the fromHart and uopId signals mean?

Keep in mind this that configuration has a specific memory region setup. Boot at 0x80000000 and assume that IO are at 0x10000000-0x1FFFFFFF

Does this mean the core will immediately start fetching from 0x80000000 after reset? Can the core fetch from any addresses < 0x80000000?

Is the address 0x0 - 0xFFFFFFF used for anything? What happens when trying to fetch or accesses these addresses?

Dolu1990 added a commit that referenced this issue Sep 14, 2024
@Dolu1990
Copy link
Member

Dolu1990 commented Sep 14, 2024

Hi,

I just pushed a fix which now get the proper memory regions attributes.

I see two groups of signals named FetchL1Plugin and LsuL1Plugin. These are the requests from the core's L1I$ and L1D$ to the outer memory system?

Right. Note that the IO access are done through another plugin, the LsuPlugin.

What are the LsuPlugin signals? These appear to be additional memory request signals from the core to the outer memory system. What are they for?

IO access are done from the LsuPlugin
The LsuPlugin job are :

  • Bind the MMU TLB lookup logic
  • handle PMA checks
  • Handle load/store related trap
  • Handle Atomic access (through memory coherency cache line lock)
  • Generate cache request and handle its response
  • Implement the store buffer
  • Arbitrate cache requests (from AGU, MMU refill, prefetch, store buffer)

The LsuL1Plugin implement :

  • tags/banks access and checks
  • memory coherency / probe
  • refill slots
  • writeback slots

Does this mean the core will immediately start fetching from 0x80000000 after reset?

Yes

Can the core fetch from any addresses < 0x80000000?

The IO region is currently set as executable => 0x10000000-0x1FFFFFFF can be executed

Note that you can feed a custom region mapping via for instance :
--region base=80000000,size=80000000,main=1,exe=1 --region base=10000000,size=10000000,main=0,exe=1
Which is equivalent to the default config.

And what do the fromHart and uopId signals mean?

fromHart is used by the simulation lockstep checker to figure out if the access come from a regular load/store, or from the MMU refill

uopId is used by the simulation lockstep checker to keep track from which instruction the given access comming from (to check if that is fine)

More generaly, VexiiRiscv is verified against RVLS (itself based on spike) in a lockstep manner, so there is quite a few probe there and there, but mostly there is the WhiteboxerPlugin which expose a looooooot of vexiiriscv behaviour to the SpinalHDL simulation API

Is the address 0x0 - 0xFFFFFFF used for anything? What happens when trying to fetch or accesses these addresses?

It should trap. in every case.

let's me know if anything goes wrong ^^

@jerryz123
Copy link
Author

Ah ok, so there's enough flexibility where I can specify precisely what the memory map of the SoC is with the --region flags. That seems good enough for me.

If I'm not doing lockstep cosim, can I just ignore those signals?

Also, for multicore configs, you typically need a hartId wire into the core to differentiate their mhartid CSRs. How is this done with VexiiRiscv?

Dolu1990 added a commit that referenced this issue Sep 14, 2024
@Dolu1990
Copy link
Member

Dolu1990 commented Sep 14, 2024

If I'm not doing lockstep cosim, can I just ignore those signals?

Yes, thing is, with this GenerateTilelink, you get VexiiRiscvTilelink.v, which doesn't expose them anymore in the io of the toplevel.

Also, for multicore configs, you typically need a hartId wire into the core to differentiate their mhartid CSRs. How is this done with VexiiRiscv?

So far, this is was a internal constant in the CPU (loaded through parameters)
I just pushed an option, you can now do --with-hart-id-input and it will give you a input 32 bits signal "hartId"

@jerryz123
Copy link
Author

Is it possible to add a configuration select for the default boot address? I typically boot the cores into a bootrom at 0x10000, where they hang until interrupted to jump to 0x80000000

@Dolu1990
Copy link
Member

Yes,
--reset-vector=0x10000

Just dont forget to specify all the --region, else it will trap endlessly ^^

@Dolu1990
Copy link
Member

also, if you specify any --region, all region need to be specified (it lose the default one)

@jerryz123
Copy link
Author

Thanks. Much closer now. Is there a way to get trace log information in the generated verilog? Through printf or something?

Or have any signal in the generated verilog that corresponds to a PC trace?

@Dolu1990
Copy link
Member

There isno verilog printf traces, instead the sim i do use spinalsim probes in the dut.
I'm in travel, once back i will check for a solution

@jerryz123
Copy link
Author

Thanks. If you expose IO for a PC trace, I can write the printf myself in the wrapper.

Dolu1990 added a commit that referenced this issue Sep 17, 2024
…xer-outputs

Add WhiteboxerPlugin_logic_commits_xxx to trace commits activity easily
@Dolu1990
Copy link
Member

Dolu1990 commented Sep 17, 2024

Hi,

I added an option to get all the whitebox signals as an output. --with-whiteboxer-outputs

So, with :

--with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare --with-late-alu --decoders=2 --lanes=2 --with-dispatcher-buffer --trace-all --dual-sim --with-whiteboxer-outputs

i get :

5.04 Coremark/MHz

The outputs which should interreset you are :

  output wire          WhiteboxerPlugin_logic_commits_ports_0_valid /* verilator public */ ,
  output wire [39:0]   WhiteboxerPlugin_logic_commits_ports_0_pc /* verilator public */ ,
  output wire [31:0]   WhiteboxerPlugin_logic_commits_ports_0_uop /* verilator public */ ,
  output wire          WhiteboxerPlugin_logic_commits_ports_1_valid /* verilator public */ ,
  output wire [39:0]   WhiteboxerPlugin_logic_commits_ports_1_pc /* verilator public */ ,
  output wire [31:0]   WhiteboxerPlugin_logic_commits_ports_1_uop /* verilator public */ ,

ports_0 and ports_1 are ordered (0 meaning oldest instruction)
uop is 1 to 1 RISC-V instruction

Note that i just added those WhiteboxerPlugin_logic_commits_ports, should normaly work, but isn't tested much

@Dolu1990
Copy link
Member

Keep in mind, i'm testing VexiiRiscv only against Verilator and hardware, not against x-prop aware simulators.

If things doesn't works, you can send me a wave (vcd or fst, fst idealy) and the software you run (.asm or .elf)

@jerryz123
Copy link
Author

jerryz123 commented Sep 18, 2024

I think its close, the VexiiRiscvTile seems to generate a TL fetch to 0x10000, but something gets stuck afterwards, likely due to x-prop.
I've attached some waves here:
vexii.tar.gz

For reference, I'm using the flags:

Test/runMain vexiiriscv.GenerateTilelink --with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare --with-late-alu --decoders=2 --lanes=2 --with-dispatcher-buffer --with-hart-id-input --reset-vector=0x10000 --with-whiteboxer-outputs --region base=3000,size=1000,main=0,exe=1 --region base=2010000,size=1000,main=0,exe=1 --region base=1000,size=1000,main=0,exe=1 --region base=10020000,size=1000,main=0,exe=1 --region base=2000000,size=10000,main=0,exe=1 --region base=C000000,size=4000000,main=0,exe=1 --region base=0,size=1000,main=0,exe=1 --region base=10000,size=10000,main=0,exe=1 --region base=100000,size=1000,main=0,exe=1 --region base=110000,size=1000,main=0,exe=1 --region base=80000000,size=10000000,main=1,exe=1 --region base=8000000,size=10000,main=1,exe=1

@Dolu1990
Copy link
Member

Hi,

Yes right, x-prop hell XD
I have seen x-prop easily leak from the instruction cache data banks toward the RVC decompresssor / aligner.

Can you try with an additional --with-boot-mem-init ? This will add logic to initialize all the l1 data bank, aswell as the prediction memories (branch + prefetch)

Let's me know how it goes, also, send me a wave if not fixed. Don't forget to also send the software elf or asm files ^^

@jerryz123
Copy link
Author

Still seems to be not committing instructions. I've attached a dump of the bootrom, and the binary.

sw.tar.gz
vexii.tar.gz

@Dolu1990
Copy link
Member

I'm on it, i can see a few false x-prop XD
Will update you when the fix is up.
I also got a way to run all my regression using iverilog instead of verilator. That is veeeeeeery slow, but will help to fix all the fale x-prop

Dolu1990 added a commit that referenced this issue Sep 20, 2024
Dolu1990 added a commit that referenced this issue Sep 20, 2024
@Dolu1990
Copy link
Member

Hi,

I just pushed fixes. With those, i can run the VexiiRiscv regression with iverilog and the xprop seems fully resolved.
It can enter the linux boot process, (but that just too long to simulate, ididn't let it through XD)

Let's me know if you have any issues

@jerryz123
Copy link
Author

Getting further. I'm seeing some tilelink errors from the assertions in my environment.

I believe the E channel does not have sink exposed correctly. Can this be added?

Dolu1990 added a commit that referenced this issue Sep 20, 2024
@Dolu1990
Copy link
Member

Hi,

Yes, I just added support for --tl-sink-width=X
Note, that only probe request are support on the tilelink channel B.
Also, the instruction cache isn't exposed to the probe (fence.i will flush the D$ + I$)

@jerryz123
Copy link
Author

jerryz123 commented Sep 20, 2024

I'm seeing X coming out of that port.

VCD link: https://drive.google.com/file/d/14Ca3gB--b7xCjn1nCZKDPJJ6z-Ivssyy/view?usp=sharing

@Dolu1990
Copy link
Member

Seems it is due to mem_node_bus_d_payload_sink not being connected ('Z') on your end of the blackbox :D

@jerryz123
Copy link
Author

Oops, sorry. I'm seeing a hang due to core not be ready for the probe on b. Seems to get stuck

https://drive.google.com/file/d/14Ca3gB--b7xCjn1nCZKDPJJ6z-Ivssyy/view?usp=sharing

@Dolu1990
Copy link
Member

Hi,

No worries ^^
The the reason for this hang, is that the source id 0 used to probe is currently targeting the fetch l1, not the lsu l1.

I pushed some code in the GenerateTileLink which now report after the generation which memory agent is being which source ids :

toplevel/FetchL1TileLinkPlugin
- [0x0 .. 0x1] G   
toplevel/LsuTileLinkPlugin
- [0x4 .. 0x4] GF
toplevel/LsuL1TileLinkPlugin
- [0x8 .. 0xb] TB

G => Get
F => put Full
T => acquire Trunk
b => acquire Branch

Which mean, you need to specify to chipyard, that only source ID 0x8 to 0xB should be used to probe the D$.
Other source ID do not support probing.

@jerryz123
Copy link
Author

Thanks! I have my hello world working now :). I'll run more benchmarks to test things out

Btw, it would be nice to change marchid to vexiiriscv's unique marchid. Right now it reports marchid=0.

@Dolu1990
Copy link
Member

Thanks! I have my hello world working now :). I'll run more benchmarks to test things out

Nice :D :D
Let's me know if any thing is weird ^^

Btw, it would be nice to change marchid to vexiiriscv's unique marchid. Right now it reports marchid=0.

On its way ^^
riscv/riscv-isa-manual#1644

!! Note !!
I just pushed a change, before this, the datawidth of the toplevel mem bus was tight to the xlen. Now, the mem bus width will follow the max between fetch-l1-mem-data-width-min and lsu-l1-mem-data-width-min => 128 bits in your config.

So, don't forget to update the data width on the blackbox on your side when you pull the git ^^

@Dolu1990
Copy link
Member

Hi ^^

Thanks for running the tests #27 #28 :D
I did skip some of the riscv-test, for a few reasons, some of which are that FPGA softcore generaly do not aim at full compliance / can not run all tests. For instance because the lack of PMA.
But overall now, i'am aiming at as much compliance as possible (at least in a optional way)

I will check #27 #28 Monday :)

Also, if you have a way for me to run all the testframework you have on chipyard / rocket, but on vexiiriscv, let me know ^^

@jerryz123
Copy link
Author

Thanks. I'm working on a PR to add this to chipyard. Almost done, just working through a few remaining test failures.

When simulating under verilator, I find that a.payload.mask is 0, which is not legal. I'm not sure why this only appears in verliator...

vexii.tar.gz

@Dolu1990
Copy link
Member

Ahhh right.
So far, i think i only considered that mask has only meaning for put_partial / put_full, and no meaning for all other cases.
I'm realy not sure why the tilelink spec added all those extra requirements :

On channels A and B, which carry a mask field, the mask must always be LOW for all inactive byte
lanes. Furthermore, for all messages other than PutPartialData, the bits of mask for all active
byte lanes must be HIGH. PutPartialData may lower individual bits of the mask and these bits
do not have to be contiguous.
The mask is also used for messages without a data payload. When the operation size is smaller
than the data bus, the mask should be generated identically to an operation which does carry a
data payload. For data-less operations which are larger than the data bus, all bits of the mask
should be HIGH, although the message remains a single-beat. See, for example, Figure 4.6.

@jerryz123
Copy link
Author

Performance-wise this implementation is quite aggressive. Very many bypass paths and flexibility for scheduling ops in late-ALU.

Where there any bypass paths/scheduling paths that you chose not to support due to physical constraints?

@Dolu1990
Copy link
Member

Where there any bypass paths/scheduling paths that you chose not to support due to physical constraints?

Here is a few comments :

  • To save area, the FPU only start to bypass from the LSU load response point (remove 2 bypasses point)
  • To improve timings on FPGA, the branch are solved over 2 stage instead of one
  • Currently, the execute plugin implement some bypass logic which allow the store unit to use late results without waiting, but that is only for the integer register file (you can chain lw x1; sw x1 without penality).
  • There is still some work todo on the decode / dispatch pipeline to split things over two cycles. Currently this is all done in a single cycle. I don't know about ASIC, but on FPGA, that is tooo much XD

flexibility for scheduling ops in late-ALU.

Also, notice that if one day you want a half late/half early alu, it could be added just via parameters XD
For instance :

      val middle0 = new LaneLayer("middle0", lane0, priority = -1) // priority guess
      plugins += new SrcPlugin(middle0, executeAt = 1, relaxedRs = 1)
      plugins += new IntAluPlugin(middle0, aluAt = 1, formatAt = 1)
      plugins += shifter(middle0, shiftAt = 1, formatAt = 1)
      plugins += new BranchPlugin(middle0, aluAt = 1, jumpAt = 1, wbAt = 1)

Performance-wise this implementation is quite aggressive

By aggressive, do you mean IPC or timings ? Note i never tried on ASIC tooling, only FPGA
Did you got some number :D ?

@Dolu1990
Copy link
Member

Note, the version you have is friendly for FPGA with distributed ram.
The register file emulate multiport writes using multiple 1r_1w memories.
This can be tunned for ASIC.

@jerryz123
Copy link
Author

I haven't tried a ASIC flow with this. I'm just comparing the behavior to my own dual-issue design, which is much more conservative around bypass paths and late-ALU execution (I can't do late-branch, for instance).

@Dolu1990
Copy link
Member

Is your implementation public ?

@jerryz123
Copy link
Author

@jerryz123 jerryz123 reopened this Sep 24, 2024
@jerryz123
Copy link
Author

jerryz123 commented Sep 24, 2024

@Dolu1990 sorry to poke, do you have a fix for this?

When simulating under verilator, I find that a.payload.mask is 0, which is not legal. I'm not sure why this only appears in verliator...

@Dolu1990
Copy link
Member

sorry to poke, do you have a fix for this?

Ahhhh i forgot about that one XD
I'm still unsure how i should fix it, for instance, the tilelink implementation in SpinalHDL do support read-only or write-only tilelink buses, meaning no data/mask signal on the A channel, or on the D channel.
Thinking that maybe i can just patch it only at the point before exposing tilelink to the toplevel io

@Dolu1990
Copy link
Member

Hi,
I pushed a fix for the channel A mask. The generated hardware "look" good.
let's me know if you hit any issue.

@jerryz123
Copy link
Author

Thanks for your help! Everything works now

@Dolu1990
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants