Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracker] Sharding huge models process and current status #16884

Closed
19 of 20 tasks
stas00 opened this issue Apr 21, 2022 · 19 comments
Closed
19 of 20 tasks

[tracker] Sharding huge models process and current status #16884

stas00 opened this issue Apr 21, 2022 · 19 comments

Comments

@stas00
Copy link
Contributor

stas00 commented Apr 21, 2022

this is an Issue to track which pre-existing huge models (>11GB) need sharding, which have been completed and the code to do that.

Why shard huge checkpoints?

Because it takes much less CPU memory to load a huge model of say 42GB, especially if you're loading concurrently in multiple processes.

Here is the breakdown for HF Transformers from_pretrained model loading with DDP. The example in each case uses a model of 30GB and 8 DDP processes:

  • non-sharded model: 2 * model size * number of processes. Example: 2*30*8=480GB
  • non-sharded model + low_cpu_mem_usage=True: model size * number of processes. Example: 30*8=240GB (but it's slower)
  • sharded model: (size_of_largest_shard + model size) * number of processes. Example: (10+30)*8=320GB
  • sharded model + deepspeed zero 3: size_of_largest_shard * number of processes. Example: 10*8=80GB

Using sharded models

Here is an example of how to get the 42GB T0 model via multi-part sharded branch (about 9GB per shard here):

  • Directly:
AutoModel.from_pretrained("bigscience/T0", revision="sharded")
  • Via HF Trainer example scripts:
examples/pytorch/translation/run_translation.py \
--model_name_or_path bigscience/T0 --model_revision sharded ...

do note that I called these branches "sharded" but other users may call them anything they want, so check the model's available branches on the hub, e.g. Here is the sharded branch of T0 https://huggingface.co/bigscience/T0/tree/sharded

And you can further re-shard them to an even smaller shards, e.g. to 5GB shards:

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B"); \
model.save_pretrained("t0-sharded", max_shard_size="5GB")'

Infrastructure decisions

  • need to decide how to tell the user about all these different branches. I proposed an automatic extraction in the "Use in Transformers" pop-up, e.g. it could say:
Other available branches: sharded, bf16, fp16 

Sharding progress

  • bigscience/T0
  • bigscience/T0_single_prompt
  • bigscience/T0p
  • bigscience/T0pp
  • t5-11b
  • google/byt5-xxl
  • google/mt5-xxl
  • google/t5-v1_1-xxl
  • allenai/unifiedqa-t5-11b
  • allenai/unifiedqa-v2-t5-11b-1363200
  • allenai/unifiedqa-v2-t5-11b-1251000
  • allenai/macaw-answer-11b
  • allenai/macaw-11b
  • EleutherAI/gpt-j-6B
  • facebook/xglm-7.5B
  • facebook/incoder-6B
  • facebook/m2m100-12B-last-ckpt
  • facebook/m2m100-12B-avg-10-ckpt
  • facebook/m2m100-12B-avg-5-ckpt

XXX: fill in more?


Here is how each was sharded, bigscience/T0 here and the rest below.

git lfs install

git clone https://huggingface.co/bigscience/T0

python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('./T0'); \
model.save_pretrained('T0-sharded')"

mv T0-sharded/pytorch_model* T0
mv T0-sharded/config.json T0

cd T0
huggingface-cli lfs-enable-largefiles .
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -

Verified that it downloaded the right version and the example evals just fine: Using --model_name_or_path bigscience/T0 --model_revision sharded

export BS=1; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 deepspeed \
--num_gpus=1 examples/pytorch/translation/run_translation.py \
--model_name_or_path bigscience/T0 --model_revision sharded --output_dir \
output_dir --adam_eps 1e-06 --evaluation_strategy=steps --do_eval \
--label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step \
--logging_steps 500 --max_source_length 128 --max_target_length 128 \
--overwrite_output_dir --per_device_eval_batch_size 1 --predict_with_generate \
--sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 \
--dataset_config ro-en --source_prefix 'translate English to Romanian: ' \
--val_max_target_length 128 --warmup_steps 50 --max_eval_samples 50 \
--deepspeed tests/deepspeed/ds_config_zero3.json --fp16 --skip_memory_metrics 0

The rest of the command line instructions follow:

Click to expand!


git clone https://huggingface.co/t5-11b

python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('./t5-11b'); \
model.save_pretrained('t5-11b-sharded')"

mv t5-11b-sharded/pytorch_model* t5-11b
mv t5-11b-sharded/config.json t5-11b

cd t5-11b
huggingface-cli lfs-enable-largefiles .
git checkout -b sharded
git rm pytorch_model.bin
git rm tf_model.h5
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -

------------------

git clone https://huggingface.co/bigscience/T0_single_prompt

python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('./T0_single_prompt'); \
model.save_pretrained('T0_single_prompt-sharded')"

mv T0_single_prompt-sharded/pytorch_model* T0_single_prompt
mv T0_single_prompt-sharded/config.json T0_single_prompt

cd T0_single_prompt
huggingface-cli lfs-enable-largefiles .
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -
------------------


git clone https://huggingface.co/bigscience/T0p

python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('./T0p'); \
model.save_pretrained('T0p-sharded')"

mv T0p-sharded/pytorch_model* T0p
mv T0p-sharded/config.json T0p

cd T0p
huggingface-cli lfs-enable-largefiles .
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


------------------



git clone https://huggingface.co/bigscience/T0pp

python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained('./T0pp'); \
model.save_pretrained('T0pp-sharded')"

mv T0pp-sharded/pytorch_model* T0pp
mv T0pp-sharded/config.json T0pp

cd T0pp
huggingface-cli lfs-enable-largefiles .
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -

git clone https://huggingface.co/allenai/unifiedqa-t5-11b
git clone https://huggingface.co/allenai/unifiedqa-v2-t5-11b-1363200
git clone https://huggingface.co/allenai/unifiedqa-v2-t5-11b-1251000
git clone https://huggingface.co/allenai/macaw-answer-11b
git clone https://huggingface.co/allenai/macaw-11b

git clone https://huggingface.co/facebook/xglm-7.5B
git clone https://huggingface.co/facebook/incoder-6B
git clone https://huggingface.co/facebook/m2m100-12B-last-ckpt
git clone https://huggingface.co/facebook/m2m100-12B-avg-10-ckpt
git clone https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt

autogenerate the code for above models

perl -le '$q=chr(39); print qq[
python -c "from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained($q./$$q); \
model.save_pretrained($q$
-sharded$q)"

mv $-sharded/pytorch_model* $
mv $-sharded/config.json $

cd $_
huggingface-cli lfs-enable-largefiles .
git lfs untrack '.bin.'
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


] for @argv' unifiedqa-t5-11b unifiedqa-v2-t5-11b-1363200 unifiedqa-v2-t5-11b-1251000 macaw-answer-11b macaw-11b xglm-7.5B incoder-6B m2m100-12B-last-ckpt m2m100-12B-avg-10-ckpt m2m100-12B-avg-5-ckpt


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./unifiedqa-t5-11b');
model.save_pretrained('unifiedqa-t5-11b-sharded')"

mv unifiedqa-t5-11b-sharded/pytorch_model* unifiedqa-t5-11b
mv unifiedqa-t5-11b-sharded/config.json unifiedqa-t5-11b

cd unifiedqa-t5-11b
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./unifiedqa-v2-t5-11b-1363200');
model.save_pretrained('unifiedqa-v2-t5-11b-1363200-sharded')"

mv unifiedqa-v2-t5-11b-1363200-sharded/pytorch_model* unifiedqa-v2-t5-11b-1363200
mv unifiedqa-v2-t5-11b-1363200-sharded/config.json unifiedqa-v2-t5-11b-1363200

cd unifiedqa-v2-t5-11b-1363200
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./unifiedqa-v2-t5-11b-1251000');
model.save_pretrained('unifiedqa-v2-t5-11b-1251000-sharded')"

mv unifiedqa-v2-t5-11b-1251000-sharded/pytorch_model* unifiedqa-v2-t5-11b-1251000
mv unifiedqa-v2-t5-11b-1251000-sharded/config.json unifiedqa-v2-t5-11b-1251000

cd unifiedqa-v2-t5-11b-1251000
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./macaw-answer-11b');
model.save_pretrained('macaw-answer-11b-sharded')"

mv macaw-answer-11b-sharded/pytorch_model* macaw-answer-11b
mv macaw-answer-11b-sharded/config.json macaw-answer-11b

cd macaw-answer-11b
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./macaw-11b');
model.save_pretrained('macaw-11b-sharded')"

mv macaw-11b-sharded/pytorch_model* macaw-11b
mv macaw-11b-sharded/config.json macaw-11b

cd macaw-11b
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForCausalLM;
model = AutoModelForCausalLM.from_pretrained('./xglm-7.5B');
model.save_pretrained('xglm-7.5B-sharded')"

mv xglm-7.5B-sharded/pytorch_model* xglm-7.5B
mv xglm-7.5B-sharded/config.json xglm-7.5B

cd xglm-7.5B
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForCausalLM;
model = AutoModelForCausalLM.from_pretrained('./incoder-6B');
model.save_pretrained('incoder-6B-sharded')"

mv incoder-6B-sharded/pytorch_model* incoder-6B
mv incoder-6B-sharded/config.json incoder-6B

cd incoder-6B
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./m2m100-12B-last-ckpt');
model.save_pretrained('m2m100-12B-last-ckpt-sharded')"

mv m2m100-12B-last-ckpt-sharded/pytorch_model* m2m100-12B-last-ckpt
mv m2m100-12B-last-ckpt-sharded/config.json m2m100-12B-last-ckpt

cd m2m100-12B-last-ckpt
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./m2m100-12B-avg-10-ckpt');
model.save_pretrained('m2m100-12B-avg-10-ckpt-sharded')"

mv m2m100-12B-avg-10-ckpt-sharded/pytorch_model* m2m100-12B-avg-10-ckpt
mv m2m100-12B-avg-10-ckpt-sharded/config.json m2m100-12B-avg-10-ckpt

cd m2m100-12B-avg-10-ckpt
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -


python -c "from transformers import AutoModelForSeq2SeqLM;
model = AutoModelForSeq2SeqLM.from_pretrained('./m2m100-12B-avg-5-ckpt');
model.save_pretrained('m2m100-12B-avg-5-ckpt-sharded')"

mv m2m100-12B-avg-5-ckpt-sharded/pytorch_model* m2m100-12B-avg-5-ckpt
mv m2m100-12B-avg-5-ckpt-sharded/config.json m2m100-12B-avg-5-ckpt

cd m2m100-12B-avg-5-ckpt
huggingface-cli lfs-enable-largefiles .
git lfs untrack .bin.
git checkout -b sharded
git rm pytorch_model.bin
git add pytorch_model*
git commit -am "add sharded checkpoint"
git push --set-upstream origin sharded
cd -

@julien-c
Copy link
Member

Both of your commits (https://huggingface.co/bigscience/T0/commit/858cd92e88c9548d194f61259af965d1d1e916b7 and https://huggingface.co/t5-11b/commit/82929bfe90cbfc4e9a3dedf38bb967650ddb6ac2) looks good to me, @stas00.

One thing I realize just now is that the pytorch_model.bin.index.json index files are LFS-tracked even though they're small JSON files, because there is the *.bin.* pattern in .gitattributes (in all repos)
That's no huge deal though, cc @sgugger @LysandreJik @Pierrci, the only drawback is that we won't get nice diffs on them

@stas00
Copy link
Contributor Author

stas00 commented Apr 22, 2022

OK, I will wait for a decision on *.bin.* as it is being discussed on slack and then adjust accordingly.

@julien-c
Copy link
Member

Why do you use the threshold size of >11GB?

I think we can do this only for >30GB models (30GB is the newly updated Cloudfront file size)

@stas00
Copy link
Contributor Author

stas00 commented Apr 22, 2022

Because our default shard size is 10GB, and there are quite a few models that are 10.5GB, so no need to bother with those. That's why it's >11GB and not >10GB.

We need models to be sharded to smaller chunks not just due to Cloudfront limitations, but primarily because it's very expensive to load these large models cpu memory wise, especially in the DDP situation.

e.g. if you have 8 gpus and an unsharded model is 30GB you will need at least 480GB of CPU RAM to load it with the normal setup. (2*30*8)

So here is the breakdown for HF Transformers from_pretrained model loading with DDP. The example in each case uses a model of 30GB and 8 DDP processes:

  • non-sharded model: 2 * model size * number of processes. Example: 2*30*8=480GB
  • non-sharded model + low_cpu_mem_usage=True: model size * number of processes. Example: 30*8=240GB (but it's slower)
  • sharded model: (size_of_largest_shard + model size) * number of processes. Example: (10+30)*8=320GB
  • sharded model + deepspeed zero 3: size_of_largest_shard * number of processes. Example: 10*8=80GB

Does my math make sense?

We already have open Issues where users have difficulties loading the models because they don't have an insane amount of CPU memory available.

Note that even on JeanZay the A100 80GB nodes have only 512GB, so it'd be impossible to load huge 60GB+ models on those nodes using HF Transformers models w/o sharding, even though the GPUs are huge. There will be not enough CPU memory to do that.

@stas00
Copy link
Contributor Author

stas00 commented Apr 22, 2022

so here is what I did to move index files out of LFS:

git lfs untrack '*.bin.*'
git add --renormalize .
git commit -am 'Restore file contents that were previously in LFS'

courtesy of https://stackoverflow.com/a/54119191/9201239

@stas00 stas00 changed the title [tracker] Sharding huge models [tracker] Sharding huge models process and current status Apr 24, 2022
@stas00
Copy link
Contributor Author

stas00 commented Apr 24, 2022

OK, all 19 models on the list have been sharded and pushed - if there are more let me know.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@edchengg
Copy link

edchengg commented Jun 9, 2022

sorry to post it here..

How could I shard a checkpoint myself and load it?

Would this work?

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("save_dir/mT5_finetuned/"); \
model.save_pretrained("save_dir/mT5_finetuned_sharded/", max_shard_size="9GB")

model2 = AutoModelForSeq2SeqLM.from_pretrained("save_dir/mT5_finetuned_sharded/") # load directly from sharded file folder

# move all config files to sharded file folder as well

@stas00
Copy link
Contributor Author

stas00 commented Jun 9, 2022

that looks mostly correct, @edchengg

just drop revision="sharded" - you'd only use that if you upload the saved sharded model checkpoint to a hub, into a branch called "sharded" (instead of "main").

for the local filesystem, there is no revision.

@alexcoca
Copy link

alexcoca commented Mar 2, 2023

  • (size_of_largest_shard + model size) * number of processes

Hi @stas00, thanks so much for this amazing work. Apologies for the following naive question, I'm trying to learn :). You mention number_of_processes in your calculation, and I had the curiosity to briefly skim through the from_pretrained call

def _load_pretrained_model(
, yet I could not see anything to indicate that the loading happens in multiple processes? So I presume number_of_processes here refers to multiple processes where a from_pretrained call may be made (such as when doing DDP)?

@stas00
Copy link
Contributor Author

stas00 commented Mar 2, 2023

Yes, of course, @alexcoca. As you proposed:

When you do DDP you run n_procs == n_gpus so each of these processes calls from_pretrained and thus each of them needs to load a copy of the model. Hence you need model_size_in_bytes * n_procs cpu memory to load a model.

The only exception to this is Deepspeed ZeRO stage3 which has a feature called zero.Init which immediately shards the model across gpus and frees up the memory on cpu. So it uses much less cpu memory during the loading process.

Sometimes one has a lot of gpu memory but little cpu memory, in that case you could work around the issue by staggering the loading, so that say only one rank loads at a time, then moves the model onto the gpu and frees up the memory for other ranks to load next.

@alexcoca
Copy link

alexcoca commented Mar 2, 2023

That's fantastic, thanks so much for your reply. I assume that this feature gets called whenever we use the Deepspeed ZeRO integration (stage 3) for training with the Trainer?

@stas00
Copy link
Contributor Author

stas00 commented Mar 3, 2023

With HF Trainer it's automatic, but if you want to use it w/ your own Trainer, it's 2 lines of code:
https://huggingface.co/docs/transformers/main/main_classes/deepspeed#nontrainer-deepspeed-integration

@i-am-neo
Copy link

Hi @stas00 Thank you for this work. Is there a sharded version of Flan-T5-XL? I see this but unsure what the original source model was.

@stas00
Copy link
Contributor Author

stas00 commented May 30, 2023

as you can see https://huggingface.co/google/flan-t5-xl is already sharded: https://huggingface.co/google/flan-t5-xl/tree/main

all new models that are being added are automatically sharded (unless the user overrides save_pretraineds defaults)

@i-am-neo
Copy link

Thank you @stas00. Interestingly, Flan-T5-XL, as is, cannot load into a free Colab T4 GPU (out of RAM), while this sharded variant does.

Comparing https://huggingface.co/google/flan-t5-xl/blob/main/pytorch_model.bin.index.json and https://huggingface.co/ybelkada/flan-t5-xl-sharded-bf16/blob/main/pytorch_model.bin.index.json,

the former has

"total_size": 11925413888

while the latter has

"total_size":  5962706944

Does https://huggingface.co/google/flan-t5-xl/tree/main contain the correct "official" model weights from the original Flan-T5 checkpoints?

And could one shard it so that it's loadable on a Colab T4?

@stas00
Copy link
Contributor Author

stas00 commented May 30, 2023

as you can probably see one of them is saved in bf16 and the other in fp32, so the former is half the size of the latter.

Does https://huggingface.co/google/flan-t5-xl/tree/main contain the correct "official" model weights from the original Flan-T5 checkpoints?

I'd say open a new issue to discuss the specifics of this model. This issue is not the place to discuss it I think.

@i-am-neo
Copy link

Ok, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants