Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient training implementation #2

Open
xumingyu2021 opened this issue Dec 13, 2024 · 1 comment
Open

Efficient training implementation #2

xumingyu2021 opened this issue Dec 13, 2024 · 1 comment

Comments

@xumingyu2021
Copy link

Hello, author.
I really like this paper and have been trying to train it on larger scale models recently. But it was found that the speed would be very slow because the existing flash attention cannot be compatible with selective attention. Can you provide some suggestions on this so that I can train at approximately 7b size and 8k context length.
Best,
Mingyu

@Chenfeng1271
Copy link

Hi, I do a similar work as selective attention. This series would not accelerate training obviously when tokens are less than 16k. This is mainly because flash atten does not show linear time cost in a short context. You can follow our paper in zipvl. currently, I am developing the training version of zipvl, it is also a challenge for me.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants