You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With reference to https://www.corsix.org/content/fast-crc32c-4k, what I call crc32_4k is your option 12 ("8-byte Hardware-accelerated"), and what I call crc32_4k_three_way is your option 13 ("Golden"). The theoretical upper bound on option 13 is 64 bits/cycle, which your implementation gets close to, at 62 bits/cycle. What I realised is that:
There's an inferior option, that I call crc32_4k_pclmulqdq, but you might call "Silver".
Gold and silver use separate execution ports, and thus can be alloyed together, for a theoretical upper bound of 120.89 bits/cycle (this is 64+72 bytes every 9 cycles). I'm measuring 93 bits/cycle for this alloy, and I imagine that a well tuned implementation could get closer to 120.89.
The text was updated successfully, but these errors were encountered:
That's awesome. Alloyed, ha!
Given that there are other bottlenecks than just execution ports, like decode or just total uops scheduled/retired, I'm surprised it's possible to do anything with the remaining bandwidth in the processor. But I'll have to check this out!
With reference to https://www.corsix.org/content/fast-crc32c-4k, what I call
crc32_4k
is your option 12 ("8-byte Hardware-accelerated"), and what I callcrc32_4k_three_way
is your option 13 ("Golden"). The theoretical upper bound on option 13 is 64 bits/cycle, which your implementation gets close to, at 62 bits/cycle. What I realised is that:crc32_4k_pclmulqdq
, but you might call "Silver".The text was updated successfully, but these errors were encountered: