You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, author.
I really like this paper and have been trying to train it on larger scale models recently. But it was found that the speed would be very slow because the existing flash attention cannot be compatible with selective attention. Can you provide some suggestions on this so that I can train at approximately 7b size and 8k context length.
Best,
Mingyu
The text was updated successfully, but these errors were encountered:
Hi, I do a similar work as selective attention. This series would not accelerate training obviously when tokens are less than 16k. This is mainly because flash atten does not show linear time cost in a short context. You can follow our paper in zipvl. currently, I am developing the training version of zipvl, it is also a challenge for me.
Hello, author.
I really like this paper and have been trying to train it on larger scale models recently. But it was found that the speed would be very slow because the existing flash attention cannot be compatible with selective attention. Can you provide some suggestions on this so that I can train at approximately 7b size and 8k context length.
Best,
Mingyu
The text was updated successfully, but these errors were encountered: