Efficient training implementation #2

xumingyu2021 · 2024-12-13T03:20:27Z

Hello, author.
I really like this paper and have been trying to train it on larger scale models recently. But it was found that the speed would be very slow because the existing flash attention cannot be compatible with selective attention. Can you provide some suggestions on this so that I can train at approximately 7b size and 8k context length.
Best,
Mingyu

Chenfeng1271 · 2025-01-20T00:28:51Z

Hi, I do a similar work as selective attention. This series would not accelerate training obviously when tokens are less than 16k. This is mainly because flash atten does not show linear time cost in a short context. You can follow our paper in zipvl. currently, I am developing the training version of zipvl, it is also a challenge for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient training implementation #2

Efficient training implementation #2

xumingyu2021 commented Dec 13, 2024

Chenfeng1271 commented Jan 20, 2025

Efficient training implementation #2

Efficient training implementation #2

Comments

xumingyu2021 commented Dec 13, 2024

Chenfeng1271 commented Jan 20, 2025