-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating highest-performance implementations with coherent TL ports #26
Comments
Hi, I just pushed a an additional generator. Now you can use : Here is an example of parameter list which should give good performance (rocket like / single issue): --with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare You can add late alu support with : And add dual issue with
Yes right, i need to document this :D Keep in mind this that configuration has a specific memory region setup. Boot at 0x80000000 and assume that IO are at 0x10000000-0x1FFFFFFF |
Thanks. This looks great. I have some questions: I see two groups of signals named What are the
Does this mean the core will immediately start fetching from 0x80000000 after reset? Can the core fetch from any addresses < 0x80000000? Is the address 0x0 - 0xFFFFFFF used for anything? What happens when trying to fetch or accesses these addresses? |
Hi, I just pushed a fix which now get the proper memory regions attributes.
Right. Note that the IO access are done through another plugin, the LsuPlugin.
IO access are done from the LsuPlugin
The LsuL1Plugin implement :
Yes
The IO region is currently set as executable => 0x10000000-0x1FFFFFFF can be executed Note that you can feed a custom region mapping via for instance :
fromHart is used by the simulation lockstep checker to figure out if the access come from a regular load/store, or from the MMU refill uopId is used by the simulation lockstep checker to keep track from which instruction the given access comming from (to check if that is fine) More generaly, VexiiRiscv is verified against RVLS (itself based on spike) in a lockstep manner, so there is quite a few probe there and there, but mostly there is the WhiteboxerPlugin which expose a looooooot of vexiiriscv behaviour to the SpinalHDL simulation API
It should trap. in every case. let's me know if anything goes wrong ^^ |
Ah ok, so there's enough flexibility where I can specify precisely what the memory map of the SoC is with the If I'm not doing lockstep cosim, can I just ignore those signals? Also, for multicore configs, you typically need a hartId wire into the core to differentiate their |
Yes, thing is, with this GenerateTilelink, you get VexiiRiscvTilelink.v, which doesn't expose them anymore in the io of the toplevel.
So far, this is was a internal constant in the CPU (loaded through parameters) |
Is it possible to add a configuration select for the default boot address? I typically boot the cores into a bootrom at 0x10000, where they hang until interrupted to jump to 0x80000000 |
Yes, Just dont forget to specify all the --region, else it will trap endlessly ^^ |
also, if you specify any --region, all region need to be specified (it lose the default one) |
Thanks. Much closer now. Is there a way to get trace log information in the generated verilog? Through printf or something? Or have any signal in the generated verilog that corresponds to a PC trace? |
There isno verilog printf traces, instead the sim i do use spinalsim probes in the dut. |
Thanks. If you expose IO for a PC trace, I can write the printf myself in the wrapper. |
…xer-outputs Add WhiteboxerPlugin_logic_commits_xxx to trace commits activity easily
Hi, I added an option to get all the whitebox signals as an output. --with-whiteboxer-outputs So, with : --with-fetch-l1 --with-lsu-l1 --lsu-l1-coherency --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=2 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --regfile-async --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --with-mul --with-div --allow-bypass-from=0 --with-lsu-bypass --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=128 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=128 --xlen=64 --with-rvc --with-rvf --with-rvd --with-btb --with-ras --with-gshare --with-late-alu --decoders=2 --lanes=2 --with-dispatcher-buffer --trace-all --dual-sim --with-whiteboxer-outputs i get : 5.04 Coremark/MHz The outputs which should interreset you are :
ports_0 and ports_1 are ordered (0 meaning oldest instruction) Note that i just added those WhiteboxerPlugin_logic_commits_ports, should normaly work, but isn't tested much |
Keep in mind, i'm testing VexiiRiscv only against Verilator and hardware, not against x-prop aware simulators. If things doesn't works, you can send me a wave (vcd or fst, fst idealy) and the software you run (.asm or .elf) |
I think its close, the VexiiRiscvTile seems to generate a TL fetch to 0x10000, but something gets stuck afterwards, likely due to x-prop. For reference, I'm using the flags:
|
Hi, Yes right, x-prop hell XD Can you try with an additional --with-boot-mem-init ? This will add logic to initialize all the l1 data bank, aswell as the prediction memories (branch + prefetch) Let's me know how it goes, also, send me a wave if not fixed. Don't forget to also send the software elf or asm files ^^ |
Still seems to be not committing instructions. I've attached a dump of the bootrom, and the binary. |
I'm on it, i can see a few false x-prop XD |
Hi, I just pushed fixes. With those, i can run the VexiiRiscv regression with iverilog and the xprop seems fully resolved. Let's me know if you have any issues |
Getting further. I'm seeing some tilelink errors from the assertions in my environment. I believe the E channel does not have |
Hi, Yes, I just added support for --tl-sink-width=X |
I'm seeing X coming out of that port. VCD link: https://drive.google.com/file/d/14Ca3gB--b7xCjn1nCZKDPJJ6z-Ivssyy/view?usp=sharing |
Seems it is due to mem_node_bus_d_payload_sink not being connected ('Z') on your end of the blackbox :D |
Oops, sorry. I'm seeing a hang due to core not be ready for the probe on b. Seems to get stuck https://drive.google.com/file/d/14Ca3gB--b7xCjn1nCZKDPJJ6z-Ivssyy/view?usp=sharing |
Hi, No worries ^^ I pushed some code in the GenerateTileLink which now report after the generation which memory agent is being which source ids :
G => Get Which mean, you need to specify to chipyard, that only source ID 0x8 to 0xB should be used to probe the D$. |
Thanks! I have my hello world working now :). I'll run more benchmarks to test things out Btw, it would be nice to change marchid to vexiiriscv's unique marchid. Right now it reports marchid=0. |
Nice :D :D
On its way ^^ !! Note !! So, don't forget to update the data width on the blackbox on your side when you pull the git ^^ |
Hi ^^ Thanks for running the tests #27 #28 :D I will check #27 #28 Monday :) Also, if you have a way for me to run all the testframework you have on chipyard / rocket, but on vexiiriscv, let me know ^^ |
Thanks. I'm working on a PR to add this to chipyard. Almost done, just working through a few remaining test failures. When simulating under verilator, I find that a.payload.mask is 0, which is not legal. I'm not sure why this only appears in verliator... |
Ahhh right.
|
Performance-wise this implementation is quite aggressive. Very many bypass paths and flexibility for scheduling ops in late-ALU. Where there any bypass paths/scheduling paths that you chose not to support due to physical constraints? |
Here is a few comments :
Also, notice that if one day you want a half late/half early alu, it could be added just via parameters XD val middle0 = new LaneLayer("middle0", lane0, priority = -1) // priority guess
plugins += new SrcPlugin(middle0, executeAt = 1, relaxedRs = 1)
plugins += new IntAluPlugin(middle0, aluAt = 1, formatAt = 1)
plugins += shifter(middle0, shiftAt = 1, formatAt = 1)
plugins += new BranchPlugin(middle0, aluAt = 1, jumpAt = 1, wbAt = 1)
By aggressive, do you mean IPC or timings ? Note i never tried on ASIC tooling, only FPGA |
Note, the version you have is friendly for FPGA with distributed ram. |
I haven't tried a ASIC flow with this. I'm just comparing the behavior to my own dual-issue design, which is much more conservative around bypass paths and late-ALU execution (I can't do late-branch, for instance). |
Is your implementation public ? |
@Dolu1990 sorry to poke, do you have a fix for this?
|
Ahhhh i forgot about that one XD |
Hi, |
Thanks for your help! Everything works now |
Can you help me with the compile flags for VexiiRiscv to generate a system with cache-coherent TL ports? The default seems to build a "cacheless config".
In general, what are the best set of flags for the system for maximum performance? It would be good to leave these documented somewhere.
The set of configuration flags is very high (which is great), but it would be good to have a "recommended" set of options for maximum performance and capability.
The text was updated successfully, but these errors were encountered: