Another NNUE Idea #5196

cj5716 · 2024-04-28T08:05:39Z

cj5716
Apr 28, 2024

Currently, the sparse affine transform is used in the dense linear <L1, L2>, with the NNZ masks computed right before.
However, I think it might be possible to do this NNZ computation earlier, right before the clipped-relu and element-wise multiply (done in the same loop in our code). This helps to save both the clipping and the multiplication instructions. In addition, for our masks, we can compare against a _vec_set1_epi32(1) rather than 0 because we clamp the values to [0, 127] and shift by 7 (divide by 128) after the multiplication. This is because 1 x 127 >> 7 == 0.

However, this would require double the mask computations, and I'm not sure it would result in a speedup. In addition, as I'm unsure of the implementation details here of how our sparse affine transform operates, I do not know how this is implemented, or if this is possible.

Edit: the number of masks we will need is not 2x but actually 4x more. This is because our accumulators are in int16s and we shrink to int8s during this clipping+clamping process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another NNUE Idea #5196

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Another NNUE Idea #5196

cj5716 Apr 28, 2024

Replies: 0 comments

cj5716
Apr 28, 2024