NNUE: Use ranges to avoid read/write to memory? #5547
Replies: 3 comments 3 replies
-
These changes are very possible, but will effectively have 0 speed difference, given that most of the time spent in eval is fc_0. I do think it is neater to implement fc_0 + sqr_clipped_relu + clipped_relu as something like an fac_0 rather than the form you have here, though. I also have a patch that saves a store before fc_0, which makes it of decent magnitude. However, I've been unable to debug it (https://github.com/cj5716/Stockfish/tree/ill-just-git-gud/). Perhaps looking into this part of the code would be better? P.S. feel free to test these speedups on fishtest! I wish you luck for your first contribution! |
Beta Was this translation helpful? Give feedback.
-
NNUE code is designed to be flexible to NN architecture changes (different L1/L2/L3 size or more/fewer layers). All layer buffers after FT are likely to be placed within L1 cache so the memory read/write speed is already fast enough. Even if it's slightly faster (with ridiculously short TC), it's another question whether maintainers would accept it or not. |
Beta Was this translation helpful? Give feedback.
-
there are probably a few stores that could be omitted (I assume loads are omitted where possible with O3), but ultimately the vast majority of the cost is in the first 2 layers where the outputs are too large to fit in registers and they are needed as a whole before evaluation of the next layer can proceed due to the nature of fully connected layers. |
Beta Was this translation helpful? Give feedback.
-
A very naive question again...
There is a buffer for every propagate, right?
And so there are lots of read and writes to these buffers...
Can we avoid some of these buffers and read/writes?
Let's focus on these buffers.
Could we try to work with ranges instead of these buffers?
Did someone try?
Beta Was this translation helpful? Give feedback.
All reactions