Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic Experiment #1

Open
calebmkim opened this issue Mar 7, 2023 · 2 comments
Open

Synthetic Experiment #1

calebmkim opened this issue Mar 7, 2023 · 2 comments

Comments

@calebmkim
Copy link
Collaborator

calebmkim commented Mar 7, 2023

Experiment Process

The .futil file is here (it's named 32-500 bc I am using 500 32 bit registers/adders). It essentially adds 20 + 10, writes results to register, then register writes to memory, then repeats this sequentially (w/ a different register + adder) 500 times.

I then ran resource estimates across the following bounds:
adders: 1,8,32,128,Unbounded
registers: 1,8,32,128,Unbounded
This means 5*5=25 experiments total.

You can see the table summary here.

Interpreting the Table

The first column just gives the resource type (e.g., registers, luts, cell_lut1, etc. I also included worst_slack as well).
For the following columns (e.g., default_8,-1,1): "default" just means I ran Calyx on the default compiler setting (although did disable group2invoke and tdst). 8 is the number of times we share adders. -1 means no bound for the sharing of registers. The third number,1 in this case, is not important for this table.

Some Takeaways

  • predictably, registers go down the more you share them and are not affected by sharing of adders.
  • predictably, muxes are lowest when we don't share at all.
  • for most other resource types (luts, clbs, muxes), it seems there's some sort of cumulative effect, where sharing adders starts to helps if you share registers too... but it doesn't help if you only share adders. Likewise, the improvement of sharing registers seems to become more dramatic once you share adders as well.
  • there's one really weird outlier result, when we don't share adders at all, and we share registers bounded at 32: it has much better resource usage than sharing of similar setting
  • the settings with better resource usage also seem to have better timing too (or at least, they seem to have a better "worst slack")
  • The last three points makes me think that the synthesis tools are trying to do some sort of resource optimization on their own, but only do it in certain situations maybe? Or at least, it seems the synthesis tools have some sort of "mind of their own" that we're not capturing.

Simulation

I still need to install verilator on Havarti.
But I did some simulation (i.e., simulation for some of the settings) on my local, and they all gave the expected results. You can see the simulation results here.

@calebmkim
Copy link
Collaborator Author

I ran the same experiment, except (per Andrew's suggestion) this time, I made the adders read from a memory instead of being constants (file here).

Results here

Takeaways

  • Register and Carry8 usage vary predictably
  • We're seeing a similar "cumulative effect", with register and adder sharing decreasing luts/clbs the most when both resources are shared
  • In fact, for luts/clbs at least, the clear best usage comes when we share both adders and registers >=128 times.
  • Like last time, the best area performers also had the best timing performance
  • Muxes were lowest when we didn't share anything. Interestingly, though, mux usage seemed to be better when we shared a lot compared to when we only shared a little (although both were still worse than no sharing).

Simulation

I randomly selected three settings to do simulation on, they all performed as expected (results here)
Of course, I still need to install verilator on Havarti and then I'll be able to more easily simulate all possible designs.

@calebmkim
Copy link
Collaborator Author

calebmkim commented Mar 11, 2023

The .futil file is here.
It defines a component my_register that does exactly what a register would do: it's defined as a component just so that I can tell the compiler not to share it.
Memories write into (non-shared) my_registers instances and then the my_registers write into an adder, then the adder writes into an actual register, which then writes to memory.

One note is that when I originally ran this, it didn't meet timing. I had to introduce the new_fsm attribute, which instantiates a new fsm to reduce fsm complexity. I should prob open an issue about how fast fsm complexity can blow up.

Table is here.

  • There is a higher overall baseline for register usage (since now we have these non-shareable my_register instances).
  • LUT usage is interesting: there seems to be a benefit to sharing both registers and adders. However, there seems to be (at least to a certain extent) some diminishing returns.
  • Mux usage interestingly decreases as we share more. One possible factor could be this: when the registers write to memory (for example), without sharing, there are 500 different registers trying to write to memory, whereas with sharing only 1 register is trying to write to memory. Seems like this could be helping mux usage?
  • The trend of "better resource usage correlates with better timing performance" doesn't really hold anymore, although they're not inversely correlated either.

Simulation

Randomly selected a few settings to do simulation on, and they all worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant