You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the sparse affine transform is used in the dense linear <L1, L2>, with the NNZ masks computed right before.
However, I think it might be possible to do this NNZ computation earlier, right before the clipped-relu and element-wise multiply (done in the same loop in our code). This helps to save both the clipping and the multiplication instructions. In addition, for our masks, we can compare against a _vec_set1_epi32(1) rather than 0 because we clamp the values to [0, 127] and shift by 7 (divide by 128) after the multiplication. This is because 1 x 127 >> 7 == 0.
However, this would require double the mask computations, and I'm not sure it would result in a speedup. In addition, as I'm unsure of the implementation details here of how our sparse affine transform operates, I do not know how this is implemented, or if this is possible.
Edit: the number of masks we will need is not 2x but actually 4x more. This is because our accumulators are in int16s and we shrink to int8s during this clipping+clamping process.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Currently, the sparse affine transform is used in the dense linear <L1, L2>, with the NNZ masks computed right before.
However, I think it might be possible to do this NNZ computation earlier, right before the clipped-relu and element-wise multiply (done in the same loop in our code). This helps to save both the clipping and the multiplication instructions. In addition, for our masks, we can compare against a
_vec_set1_epi32(1)
rather than 0 because we clamp the values to [0, 127] and shift by 7 (divide by 128) after the multiplication. This is because 1 x 127 >> 7 == 0.However, this would require double the mask computations, and I'm not sure it would result in a speedup. In addition, as I'm unsure of the implementation details here of how our sparse affine transform operates, I do not know how this is implemented, or if this is possible.
Edit: the number of masks we will need is not 2x but actually 4x more. This is because our accumulators are in int16s and we shrink to int8s during this clipping+clamping process.
Beta Was this translation helpful? Give feedback.
All reactions