You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
In your research, it is hard to train model with long sequence such as 768 in gpu.
However, I can't find any special way to reduce gpu memory in your code.
I want to know about your technique for training long sequence tokens with vision transformer.
The text was updated successfully, but these errors were encountered:
Hi @chagmgang, in our experiments, we used Google Cloud TPUs for pretraining. For GPUs, I think one can use gradient accumulation (with --accum_iter) and activation checkpointing to reduce memory usage
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
In your research, it is hard to train model with long sequence such as 768 in gpu.
However, I can't find any special way to reduce gpu memory in your code.
I want to know about your technique for training long sequence tokens with vision transformer.
The text was updated successfully, but these errors were encountered: