-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tracker] Sharding huge models process and current status #16884
Comments
Both of your commits (https://huggingface.co/bigscience/T0/commit/858cd92e88c9548d194f61259af965d1d1e916b7 and https://huggingface.co/t5-11b/commit/82929bfe90cbfc4e9a3dedf38bb967650ddb6ac2) looks good to me, @stas00. One thing I realize just now is that the |
OK, I will wait for a decision on |
Why do you use the threshold size of >11GB? I think we can do this only for >30GB models (30GB is the newly updated Cloudfront file size) |
Because our default shard size is 10GB, and there are quite a few models that are 10.5GB, so no need to bother with those. That's why it's >11GB and not >10GB. We need models to be sharded to smaller chunks not just due to Cloudfront limitations, but primarily because it's very expensive to load these large models cpu memory wise, especially in the DDP situation. e.g. if you have 8 gpus and an unsharded model is 30GB you will need at least 480GB of CPU RAM to load it with the normal setup. ( So here is the breakdown for HF Transformers
Does my math make sense? We already have open Issues where users have difficulties loading the models because they don't have an insane amount of CPU memory available. Note that even on JeanZay the A100 80GB nodes have only 512GB, so it'd be impossible to load huge 60GB+ models on those nodes using HF Transformers models w/o sharding, even though the GPUs are huge. There will be not enough CPU memory to do that. |
so here is what I did to move index files out of LFS:
courtesy of https://stackoverflow.com/a/54119191/9201239 |
OK, all 19 models on the list have been sharded and pushed - if there are more let me know. |
Examples of commits restoring index files to non-LFS:
Example of commit on Eleuther-owned model: |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
sorry to post it here.. How could I shard a checkpoint myself and load it? Would this work?
|
that looks mostly correct, @edchengg just drop for the local filesystem, there is no revision. |
Hi @stas00, thanks so much for this amazing work. Apologies for the following naive question, I'm trying to learn :). You mention transformers/src/transformers/modeling_utils.py Line 2544 in ae54e3c
number_of_processes here refers to multiple processes where a from_pretrained call may be made (such as when doing DDP)?
|
Yes, of course, @alexcoca. As you proposed: When you do DDP you run The only exception to this is Deepspeed ZeRO stage3 which has a feature called Sometimes one has a lot of gpu memory but little cpu memory, in that case you could work around the issue by staggering the loading, so that say only one rank loads at a time, then moves the model onto the gpu and frees up the memory for other ranks to load next. |
That's fantastic, thanks so much for your reply. I assume that this feature gets called whenever we use the Deepspeed ZeRO integration (stage 3) for training with the |
With HF Trainer it's automatic, but if you want to use it w/ your own Trainer, it's 2 lines of code: |
Hi @stas00 Thank you for this work. Is there a sharded version of Flan-T5-XL? I see this but unsure what the original source model was. |
as you can see https://huggingface.co/google/flan-t5-xl is already sharded: https://huggingface.co/google/flan-t5-xl/tree/main all new models that are being added are automatically sharded (unless the user overrides |
Thank you @stas00. Interestingly, Flan-T5-XL, as is, cannot load into a free Colab T4 GPU (out of RAM), while this sharded variant does. Comparing https://huggingface.co/google/flan-t5-xl/blob/main/pytorch_model.bin.index.json and https://huggingface.co/ybelkada/flan-t5-xl-sharded-bf16/blob/main/pytorch_model.bin.index.json, the former has
while the latter has
Does https://huggingface.co/google/flan-t5-xl/tree/main contain the correct "official" model weights from the original Flan-T5 checkpoints? And could one shard it so that it's loadable on a Colab T4? |
as you can probably see one of them is saved in bf16 and the other in fp32, so the former is half the size of the latter.
I'd say open a new issue to discuss the specifics of this model. This issue is not the place to discuss it I think. |
Ok, thanks. |
this is an Issue to track which pre-existing huge models (>11GB) need sharding, which have been completed and the code to do that.
Why shard huge checkpoints?
Because it takes much less CPU memory to load a huge model of say 42GB, especially if you're loading concurrently in multiple processes.
Here is the breakdown for HF Transformers
from_pretrained
model loading with DDP. The example in each case uses a model of 30GB and 8 DDP processes:2 * model size * number of processes
. Example:2*30*8=480GB
low_cpu_mem_usage=True
:model size * number of processes
. Example:30*8=240GB
(but it's slower)(size_of_largest_shard + model size) * number of processes
. Example:(10+30)*8=320GB
size_of_largest_shard * number of processes
. Example:10*8=80GB
Using sharded models
Here is an example of how to get the 42GB T0 model via multi-part sharded branch (about 9GB per shard here):
do note that I called these branches "sharded" but other users may call them anything they want, so check the model's available branches on the hub, e.g. Here is the sharded branch of T0 https://huggingface.co/bigscience/T0/tree/sharded
And you can further re-shard them to an even smaller shards, e.g. to 5GB shards:
Infrastructure decisions
Sharding progress
XXX: fill in more?
Here is how each was sharded,
bigscience/T0
here and the rest below.Verified that it downloaded the right version and the example evals just fine: Using
--model_name_or_path bigscience/T0 --model_revision sharded
The rest of the command line instructions follow:
Click to expand!
git clone https://huggingface.co/allenai/unifiedqa-t5-11b
git clone https://huggingface.co/allenai/unifiedqa-v2-t5-11b-1363200
git clone https://huggingface.co/allenai/unifiedqa-v2-t5-11b-1251000
git clone https://huggingface.co/allenai/macaw-answer-11b
git clone https://huggingface.co/allenai/macaw-11b
git clone https://huggingface.co/facebook/xglm-7.5B
git clone https://huggingface.co/facebook/incoder-6B
git clone https://huggingface.co/facebook/m2m100-12B-last-ckpt
git clone https://huggingface.co/facebook/m2m100-12B-avg-10-ckpt
git clone https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt
autogenerate the code for above models
perl -le '$q=chr(39); print qq[$q./$ $q); \
python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained(
model.save_pretrained($q$-sharded$q)"
mv $-sharded/pytorch_model* $
mv $-sharded/config.json $
cd $_
huggingface-cli lfs-enable-largefiles .
git lfs untrack '.bin.'
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
] for @argv' unifiedqa-t5-11b unifiedqa-v2-t5-11b-1363200 unifiedqa-v2-t5-11b-1251000 macaw-answer-11b macaw-11b xglm-7.5B incoder-6B m2m100-12B-last-ckpt m2m100-12B-avg-10-ckpt m2m100-12B-avg-5-ckpt
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./unifiedqa-t5-11b');
model.save_pretrained('unifiedqa-t5-11b-sharded')"
mv unifiedqa-t5-11b-sharded/pytorch_model* unifiedqa-t5-11b
mv unifiedqa-t5-11b-sharded/config.json unifiedqa-t5-11b
cd unifiedqa-t5-11b
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./unifiedqa-v2-t5-11b-1363200');
model.save_pretrained('unifiedqa-v2-t5-11b-1363200-sharded')"
mv unifiedqa-v2-t5-11b-1363200-sharded/pytorch_model* unifiedqa-v2-t5-11b-1363200
mv unifiedqa-v2-t5-11b-1363200-sharded/config.json unifiedqa-v2-t5-11b-1363200
cd unifiedqa-v2-t5-11b-1363200
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./unifiedqa-v2-t5-11b-1251000');
model.save_pretrained('unifiedqa-v2-t5-11b-1251000-sharded')"
mv unifiedqa-v2-t5-11b-1251000-sharded/pytorch_model* unifiedqa-v2-t5-11b-1251000
mv unifiedqa-v2-t5-11b-1251000-sharded/config.json unifiedqa-v2-t5-11b-1251000
cd unifiedqa-v2-t5-11b-1251000
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./macaw-answer-11b');
model.save_pretrained('macaw-answer-11b-sharded')"
mv macaw-answer-11b-sharded/pytorch_model* macaw-answer-11b
mv macaw-answer-11b-sharded/config.json macaw-answer-11b
cd macaw-answer-11b
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./macaw-11b');
model.save_pretrained('macaw-11b-sharded')"
mv macaw-11b-sharded/pytorch_model* macaw-11b
mv macaw-11b-sharded/config.json macaw-11b
cd macaw-11b
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForCausalLM;
model = AutoModelForCausalLM.from_pretrained('./xglm-7.5B');
model.save_pretrained('xglm-7.5B-sharded')"
mv xglm-7.5B-sharded/pytorch_model* xglm-7.5B
mv xglm-7.5B-sharded/config.json xglm-7.5B
cd xglm-7.5B
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForCausalLM;
model = AutoModelForCausalLM.from_pretrained('./incoder-6B');
model.save_pretrained('incoder-6B-sharded')"
mv incoder-6B-sharded/pytorch_model* incoder-6B
mv incoder-6B-sharded/config.json incoder-6B
cd incoder-6B
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./m2m100-12B-last-ckpt');
model.save_pretrained('m2m100-12B-last-ckpt-sharded')"
mv m2m100-12B-last-ckpt-sharded/pytorch_model* m2m100-12B-last-ckpt
mv m2m100-12B-last-ckpt-sharded/config.json m2m100-12B-last-ckpt
cd m2m100-12B-last-ckpt
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./m2m100-12B-avg-10-ckpt');
model.save_pretrained('m2m100-12B-avg-10-ckpt-sharded')"
mv m2m100-12B-avg-10-ckpt-sharded/pytorch_model* m2m100-12B-avg-10-ckpt
mv m2m100-12B-avg-10-ckpt-sharded/config.json m2m100-12B-avg-10-ckpt
cd m2m100-12B-avg-10-ckpt
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./m2m100-12B-avg-5-ckpt');
model.save_pretrained('m2m100-12B-avg-5-ckpt-sharded')"
mv m2m100-12B-avg-5-ckpt-sharded/pytorch_model* m2m100-12B-avg-5-ckpt
mv m2m100-12B-avg-5-ckpt-sharded/config.json m2m100-12B-avg-5-ckpt
cd m2m100-12B-avg-5-ckpt
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
The text was updated successfully, but these errors were encountered: