Synthetic Experiment #1

calebmkim · 2023-03-07T17:58:56Z

Experiment Process

The .futil file is here (it's named 32-500 bc I am using 500 32 bit registers/adders). It essentially adds 20 + 10, writes results to register, then register writes to memory, then repeats this sequentially (w/ a different register + adder) 500 times.

I then ran resource estimates across the following bounds:
adders: 1,8,32,128,Unbounded
registers: 1,8,32,128,Unbounded
This means 5*5=25 experiments total.

You can see the table summary here.

Interpreting the Table

The first column just gives the resource type (e.g., registers, luts, cell_lut1, etc. I also included worst_slack as well).
For the following columns (e.g., default_8,-1,1): "default" just means I ran Calyx on the default compiler setting (although did disable group2invoke and tdst). 8 is the number of times we share adders. -1 means no bound for the sharing of registers. The third number,1 in this case, is not important for this table.

Some Takeaways

predictably, registers go down the more you share them and are not affected by sharing of adders.
predictably, muxes are lowest when we don't share at all.
for most other resource types (luts, clbs, muxes), it seems there's some sort of cumulative effect, where sharing adders starts to helps if you share registers too... but it doesn't help if you only share adders. Likewise, the improvement of sharing registers seems to become more dramatic once you share adders as well.
there's one really weird outlier result, when we don't share adders at all, and we share registers bounded at 32: it has much better resource usage than sharing of similar setting
the settings with better resource usage also seem to have better timing too (or at least, they seem to have a better "worst slack")
The last three points makes me think that the synthesis tools are trying to do some sort of resource optimization on their own, but only do it in certain situations maybe? Or at least, it seems the synthesis tools have some sort of "mind of their own" that we're not capturing.

Simulation

I still need to install verilator on Havarti.
But I did some simulation (i.e., simulation for some of the settings) on my local, and they all gave the expected results. You can see the simulation results here.

The text was updated successfully, but these errors were encountered:

calebmkim · 2023-03-09T14:04:39Z

I ran the same experiment, except (per Andrew's suggestion) this time, I made the adders read from a memory instead of being constants (file here).

Results here

Takeaways

Register and Carry8 usage vary predictably
We're seeing a similar "cumulative effect", with register and adder sharing decreasing luts/clbs the most when both resources are shared
In fact, for luts/clbs at least, the clear best usage comes when we share both adders and registers >=128 times.
Like last time, the best area performers also had the best timing performance
Muxes were lowest when we didn't share anything. Interestingly, though, mux usage seemed to be better when we shared a lot compared to when we only shared a little (although both were still worse than no sharing).

Simulation

I randomly selected three settings to do simulation on, they all performed as expected (results here)
Of course, I still need to install verilator on Havarti and then I'll be able to more easily simulate all possible designs.

calebmkim · 2023-03-11T03:16:05Z

The .futil file is here.
It defines a component my_register that does exactly what a register would do: it's defined as a component just so that I can tell the compiler not to share it.
Memories write into (non-shared) my_registers instances and then the my_registers write into an adder, then the adder writes into an actual register, which then writes to memory.

One note is that when I originally ran this, it didn't meet timing. I had to introduce the new_fsm attribute, which instantiates a new fsm to reduce fsm complexity. I should prob open an issue about how fast fsm complexity can blow up.

Table is here.

There is a higher overall baseline for register usage (since now we have these non-shareable my_register instances).
LUT usage is interesting: there seems to be a benefit to sharing both registers and adders. However, there seems to be (at least to a certain extent) some diminishing returns.
Mux usage interestingly decreases as we share more. One possible factor could be this: when the registers write to memory (for example), without sharing, there are 500 different registers trying to write to memory, whereas with sharing only 1 register is trying to write to memory. Seems like this could be helping mux usage?
The trend of "better resource usage correlates with better timing performance" doesn't really hold anymore, although they're not inversely correlated either.

Simulation

Randomly selected a few settings to do simulation on, and they all worked as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic Experiment #1

Synthetic Experiment #1

calebmkim commented Mar 7, 2023 •

edited

Loading

calebmkim commented Mar 9, 2023

calebmkim commented Mar 11, 2023 •

edited

Loading

Synthetic Experiment #1

Synthetic Experiment #1

Comments

calebmkim commented Mar 7, 2023 • edited Loading

Experiment Process

Interpreting the Table

Some Takeaways

Simulation

calebmkim commented Mar 9, 2023

Takeaways

Simulation

calebmkim commented Mar 11, 2023 • edited Loading

Simulation

calebmkim commented Mar 7, 2023 •

edited

Loading

calebmkim commented Mar 11, 2023 •

edited

Loading