Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepSpeed] [success] trained t5-11b on 1x 40GB gpu #9996

Closed
stas00 opened this issue Feb 4, 2021 · 69 comments
Closed

[DeepSpeed] [success] trained t5-11b on 1x 40GB gpu #9996

stas00 opened this issue Feb 4, 2021 · 69 comments
Assignees

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 4, 2021

Managed to train t5-11b on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB)

Thank you, @PeterAJansen for letting me use your hardware!

Thank you, @jeffra and @samyam, for not believing that it is not possible to train t5-11b on 1x 40GB gpu w/ Deepspeed and supporting me that lead me to find a few bugs in the integration.

Sharing details for those who need.

If you want to try this at home please make sure you use transformers master as some bug fixes were just merged in

Well, it's similar to the t5-3b on 24GB success reported here and here.
But this time t5-11b on 1x 40GB gpu (or 4x if you wanted things faster)

As someone asked me before you need a huge amount of general RAM to use ZeRO-Offload for a huge model:

  • for t5-3b on 1x 24GB gpu: ~71GB RAM
  • for t5-11b on 1x 40GB gpu: ~234GB RAM

I was using /usr/bin/time -v program to get the peak memory measurement - it's the Maximum resident set size entry in the final report.

Question: I don't think /usr/bin/time does the right thing for multi-process - I think it only measures the parent process. e.g. with 4x gpus it reported only 102GB RAM, but I clearly saw in top that it was around 240GB. If you have an easy way to measure peak memory that takes into an account forked processes I'm all ears.

Batch sizes on one gpu:

  • with buffers of 5e8 I was able to run BS=2, which might be too small for training,
  • but with 2e8 I managed to squeeze in BS=10 for training, but OOMed on prediction

I'm referring to these batch sizes in ds_config.json:

        "allgather_bucket_size": 2e8,
        "reduce_bucket_size": 2e8,

And I tested for 2x and 4x DDP as well, BS=16 OOMed, BS=8 was good so I used that - but could probably squeeze some more.

edit1: later tests show that my test was too short and wasn't getting the CPU Adam optimizer kick in, as it skips the first 20 or so tests because of the overflow. So once it kicks in it takes more GPU memory, so the practical BS is much smaller - I think around 2 on this setup. So most likely you will need to use BS=2 for real work, until things get optimized even more.

edit2: things are getting re-shuffling in the tests, so the default ds_config.json file has moved in master to a new, hopefully permanent home. It's now at examples/tests/deepspeed/ds_config.json so you will need to adjust the command line to reflect this new location or simply copy it over to where the old one used to be.

here is the full benchmark:

# 1 gpu: 
# only training fits with this BS, eval needs a smaller BS

export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

{'train_runtime': 31.0897, 'train_samples_per_second': 0.257, 'epoch': 1.0}

# 2 gpus:

export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=2 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

{'train_runtime': 17.9026, 'train_samples_per_second': 0.223, 'epoch': 1.0}

# 4 gpus

export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

{'train_runtime': 10.4404, 'train_samples_per_second': 0.192, 'epoch': 1.0}

Checkpointing should allow making even bigger batch sizes.

@stas00
Copy link
Contributor Author

stas00 commented Feb 4, 2021

Well, I'm closing this right away, since it's not a bug, but feel free to comment or ask questions in the comments.

@stas00 stas00 closed this as completed Feb 4, 2021
@stas00 stas00 changed the title [deepspeed] [success] trained t5-11b on 1x 40GB gpu w/ Deepspeed [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu Feb 4, 2021
@stas00 stas00 self-assigned this Feb 4, 2021
@PeterAJansen
Copy link

(I'm adding to this issue, even though it's closed, because it's directly related)

I am seeing OOM trying to get this to work: 1 GPU, SeqLength 128 (originally tried 256), buffers {2e8, 3e8, 5e8} (just changes the epoch of the OOM), BS=1.

@stas00 , I kept track of the GPU memory (as reported in nvidia-smi) to see if it's a progressive memory leak, but I don't think it is:

  • 23.2gb after loading model weights
  • 33.8gb @ epoch ~1
  • 33.8gb @ epoch 25
  • long pause at epoch 26, then dies with OOM

Runscript:
(Note I am using unifiedqa-t5-11b, which is just a fine-tuned t5-11b -- I don't think that should change anything)

export DATADIR=/home/pajansen/11b-data/ \
export SEQLEN=128 \
export OUTPUTDIR=output_dir \

export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir $OUTPUTDIR --adam_eps 1e-06 --data_dir $DATADIR \
--do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length $SEQLEN --max_target_length $SEQLEN --num_train_epochs 2 \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--predict_with_generate --sortish_sampler \
--test_max_target_length $SEQLEN --val_max_target_length $SEQLEN \
--warmup_steps 5 \
--deepspeed ds_config.json --fp16 \

Conda environment:

# Make new environment
conda create --name transformers-feb4-2020 python=3.8
conda activate transformers-feb4-2020

# Clone transformers
git clone https://github.com/huggingface/transformers.git
cd transformers

# Install nightly build of Pytorch
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html -U

# Install seq2seq transformers requirements
pip install -r examples/seq2seq/requirements.txt

# Install transformers
pip install -e .

# Install DeepSpeed from source for the A100 support
cd ..
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed/
./install.sh
pip install .

The monster output:
oom-feb4-t5-11b.txt

Just the last bit of the output:
(the overflow errors are probably noteworthy?)

Using /home/pajansen/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005221366882324219 seconds
[INFO|trainer.py:837] 2021-02-04 15:05:54,964 >> ***** Running training *****
[INFO|trainer.py:838] 2021-02-04 15:05:54,964 >>   Num examples = 592
[INFO|trainer.py:839] 2021-02-04 15:05:54,964 >>   Num Epochs = 2
[INFO|trainer.py:840] 2021-02-04 15:05:54,964 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:841] 2021-02-04 15:05:54,964 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:842] 2021-02-04 15:05:54,964 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-02-04 15:05:54,964 >>   Total optimization steps = 1184
  0%|                                                                                                                                                                                                      | 0/1184 [00:00<?, ?it/s][2021-02-04 15:05:58,447] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
{'loss': inf, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                                   
  0%|▏                                                                                                                                                                                           | 1/1184 [00:03<1:08:20,  3.47s/it][2021-02-04 15:06:02,124] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
  0%|▎                                                                                                                                                                                           | 2/1184 [00:07<1:09:31,  3.53s/it][2021-02-04 15:06:05,853] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
  0%|▍                                                                                                                                                                                           | 3/1184 [00:10<1:10:38,  3.59s/it][2021-02-04 15:06:09,757] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
  0%|▋                                                                                                                                                                                           | 4/1184 [00:14<1:12:26,  3.68s/it][2021-02-04 15:06:13,120] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
  0%|▊                                                                                                                                                                                           | 5/1184 [00:18<1:10:29,  3.59s/it][2021-02-04 15:06:16,495] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
  1%|▉                                                                                                                                                                                           | 6/1184 [00:21<1:09:10,  3.52s/it][2021-02-04 15:06:19,825] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
  1%|█                                                                                                                                                                                           | 7/1184 [00:24<1:07:59,  3.47s/it][2021-02-04 15:06:23,182] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
  1%|█▎                                                                                                                                                                                          | 8/1184 [00:28<1:07:17,  3.43s/it][2021-02-04 15:06:26,854] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
  1%|█▍                                                                                                                                                                                          | 9/1184 [00:31<1:08:37,  3.50s/it][2021-02-04 15:06:30,436] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
  1%|█▌                                                                                                                                                                                         | 10/1184 [00:35<1:09:01,  3.53s/it][2021-02-04 15:06:33,801] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
  1%|█▋                                                                                                                                                                                         | 11/1184 [00:38<1:08:00,  3.48s/it][2021-02-04 15:06:37,147] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
  1%|█▉                                                                                                                                                                                         | 12/1184 [00:42<1:07:10,  3.44s/it][2021-02-04 15:06:40,510] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
  1%|██                                                                                                                                                                                         | 13/1184 [00:45<1:06:40,  3.42s/it][2021-02-04 15:06:43,887] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
  1%|██▏                                                                                                                                                                                        | 14/1184 [00:48<1:06:23,  3.40s/it][2021-02-04 15:06:47,250] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
  1%|██▎                                                                                                                                                                                        | 15/1184 [00:52<1:06:05,  3.39s/it][2021-02-04 15:06:50,615] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
  1%|██▌                                                                                                                                                                                        | 16/1184 [00:55<1:05:52,  3.38s/it][2021-02-04 15:06:53,976] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
  1%|██▋                                                                                                                                                                                        | 17/1184 [00:58<1:05:41,  3.38s/it][2021-02-04 15:06:57,313] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
  2%|██▊                                                                                                                                                                                        | 18/1184 [01:02<1:05:23,  3.36s/it][2021-02-04 15:07:00,672] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
  2%|███                                                                                                                                                                                        | 19/1184 [01:05<1:05:18,  3.36s/it][2021-02-04 15:07:04,003] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
  2%|███▏                                                                                                                                                                                       | 20/1184 [01:09<1:05:03,  3.35s/it][2021-02-04 15:07:07,382] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0
  2%|███▎                                                                                                                                                                                       | 21/1184 [01:12<1:05:08,  3.36s/it][2021-02-04 15:07:10,753] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0
  2%|███▍                                                                                                                                                                                       | 22/1184 [01:15<1:05:09,  3.36s/it][2021-02-04 15:07:14,118] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0
  2%|███▋                                                                                                                                                                                       | 23/1184 [01:19<1:05:06,  3.36s/it][2021-02-04 15:07:17,475] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0
  2%|███▊                                                                                                                                                                                       | 24/1184 [01:22<1:05:00,  3.36s/it][2021-02-04 15:07:20,816] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512.0, reducing to 256.0
  2%|███▉                                                                                                                                                                                       | 25/1184 [01:25<1:04:49,  3.36s/it][2021-02-04 15:07:24,174] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256.0, reducing to 128.0
  2%|████                                                                                                                                                                                       | 26/1184 [01:29<1:04:46,  3.36s/it]Killing subprocess 3319579
Traceback (most recent call last):
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/pajansen/anaconda3/envs/transformers-feb4-2020/bin/python', '-u', './finetune_trainer.py', '--local_rank=0', '--model_name_or_path', 'allenai/unifiedqa-t5-11b', '--output_dir', 'output_dir_compexpl-feb4-epoch2-uqa-11b-wholetree-rev', '--adam_eps', '1e-06', '--data_dir', '/home/pajansen/github/compositional-expl/data/feb4-initialtest-q693/wholetree-rev/', '--do_eval', '--do_predict', '--do_train', '--evaluation_strategy=steps', '--freeze_embeds', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '2', '--overwrite_output_dir', '--per_device_eval_batch_size', '1', '--per_device_train_batch_size', '1', '--predict_with_generate', '--sortish_sampler', '--test_max_target_length', '128', '--val_max_target_length', '128', '--warmup_steps', '5', '--deepspeed', 'ds_config.json', '--fp16']' died with <Signals.SIGSEGV: 11>.
        Command being timed: "deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir output_dir_compexpl-feb4-epoch2-uqa-11b-wholetree-rev --adam_eps 1e-06 --data_dir /home/pajansen/github/compositional-expl/data/feb4-initialtest-q693/wholetree-rev/ --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 2 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --sortish_sampler --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --deepspeed ds_config.json --fp16"
        User time (seconds): 1152.16
        System time (seconds): 746.75
        Percent of CPU this job got: 396%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:58.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 233292336
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 108071918
        Voluntary context switches: 38621
        Involuntary context switches: 588867
        Swaps: 0
        File system inputs: 0
        File system outputs: 48
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@stas00
Copy link
Contributor Author

stas00 commented Feb 4, 2021

Thank you for the report and the details, @PeterAJansen

In the future, let's try to have a dedicated issue for each unique problem, but since the OP wasn't really an issue, it is now ;) so all is good.

Let me see if I can reproduce the problem with your changes, perhaps my data sample was too short.

The other difference I see is that you're not using --task which then defaults to summarization - so we surely don't test the exact same thing.

The allenai/unifiedqa-t5-11b model looks of identical size to t5-11b, but let me download the former to make sure that I'm doing an exact reproduction.

Let me see

  1. if I can get it to OOM with the translation task that I have been testing with first
  2. and if that fails, I will try one of the local summarization datasets,
  3. and if all runs fine still will need to see what's different about your dataset.

(the overflow errors are probably noteworthy?)

these are normal. not a problem.

@stas00
Copy link
Contributor Author

stas00 commented Feb 5, 2021

OK, I'm able to reproduce it. The GPU memory usage grows slowly at some times and jumps at quick bump ups of several GBs at other times.

I used buffers of 1e8 and cmd:

export BS=2; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b  --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --deepspeed ds_config.json --fp16

Which means that either transformers (trainer or model) or DeepSpeed or both leak memory. I'm going to switch to a much smaller model size as with this model it takes ages for it to just start - can't develop like this and try to detect where the leak is coming from.

BTW, here is a tip. Currently transformers performs a silly thing - it inits the model, inits the weights, and overwrites all this work with pretrained weights. Which with this model takes like 10 minutes. You can shortcut it with:

--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -747,7 +747,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
         Initializes and prunes weights if needed.
         """
         # Initialize weights
-        self.apply(self._init_weights)
+        #self.apply(self._init_weights)

         # Prune heads if needed
         if self.config.pruned_heads:

which skips 90% of the pointless of weight inits.

I'm trying to advocate for this to be a feature here: #9205

@stas00 stas00 reopened this Feb 5, 2021
@stas00
Copy link
Contributor Author

stas00 commented Feb 5, 2021

Heh, we were assuming it was OOM, but it got SIGSEGV - I didn't bother to look closer - so pytorch w/Deepspeed segfaults pretty much at step 22. Investigating...

No useful info in the core bt. Stripped binaries.

I eliminated the possibility that the issue could be with pytorch.

Most likely a regression in DS.

Downgrading pip install deepspeed==0.3.10 solves the segfault

I must have been using an old DS yesterday and that's why it was working for me.

Trying to locate the faulty commit in DS

And the reason it was happening always at step 22 was because AdamW wasn't running until this step, this is all those skipping step overflow reports:

[2021-02-04 22:40:47,424] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0
  0%|                                                                                                                      | 23/60000 [01:18<55:05:44,  3.31s/it][2021-02-04 22:40:50,837] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0
  0%|                                                                                                                      | 24/60000 [01:21<55:37:22,  3.34s/it][2021-02-04 22:40:54,255] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512.0, reducing to 256.0

As soon as it run it segfaulted.

Hopefully we will have a fix soon, but until then please use deepspeed==0.3.10

@PeterAJansen
Copy link

Thanks @stas00 !

I have downgraded to deepspeed 0.3.10 and I'm going to leave Transformers running overnight on a proper training job to see if it crashes (it's currently about 20% completed, so that's promising). Though it does appear that the GPU memory usage periodically moves from ~34GB up to nearly the entire 40GB minus a few hundred MB, so it's a real nail biter watching it:

image

Transformers+DeepSpeed really doesn't believe in wasting RAM... :)

@stas00
Copy link
Contributor Author

stas00 commented Feb 5, 2021

update: DeepSpeed yanked 0.3.11 from pypi, so a normal pip install should now result in a good working 0.3.10 installed until this issue is fixed.

@PeterAJansen
Copy link

Update on my end: with DeepSpeed 0.3.10 it did run successfully through the night on a full job, successfully training and generating the predictions. Amazing work @stas00 et al.

@PeterAJansen
Copy link

@stas00 I'm not sure if this is a bug or if I'm just not doing it correctly given how fast most of this is moving, but I'm trying to evaluate/generate predictions post-training and getting not-on-device errors. I should not that it worked fine when I did the whole thing in one command (train/eval/predict) overnight, but now I'm trying to use the fine-tuned model to generate predictions on other data.

I have (a) just removed the --do_train flag from the call to finetune_trainer (and, set the model path to the output path of the fine-tuned model), and this gives an error (below). I've also (b) tried CPU-based eval (--device cpu) with the official instructions in examples/seq2seq/, which gave a different error (but I've not done non-cuda eval before, so that might be my issue).

Here's the error from (A):

[2021-02-05 12:00:30,238] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-02-05 12:00:30,586] [INFO] [runner.py:355:main] cmd = /home/pajansen/anaconda3/envs/transformers-feb4-2020/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev --output_dir output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev-unannotated --adam_eps 1e-06 --data_dir /home/pajansen/github/compexpl/data/feb4-initialtest-q693/unannotated/ --do_eval --do_predict --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 256 --max_target_length 256 --num_train_epochs 3 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --sortish_sampler --test_max_target_length 256 --val_max_target_length 256 --warmup_steps 5 --deepspeed ds_config.json --fp16
[2021-02-05 12:00:31,464] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-02-05 12:00:31,464] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-02-05 12:00:31,464] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-02-05 12:00:31,464] [INFO] [launch.py:100:main] dist_world_size=4
[2021-02-05 12:00:31,464] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-02-05 12:00:33,681] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 12:00:33,788] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 12:00:33,908] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 12:00:34,042] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:447] 2021-02-05 12:00:34,625 >> loading configuration file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/config.json
[INFO|configuration_utils.py:485] 2021-02-05 12:00:34,626 >> Model config T5Config {
  "_name_or_path": "allenai/unifiedqa-t5-11b",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 65536,
  "d_kv": 128,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "early_stopping": true,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "length_penalty": 2.0,
  "max_length": 200,
  "min_length": 30,
  "model_type": "t5",
  "n_positions": 512,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "num_decoder_layers": 24,
  "num_heads": 128,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "prefix": "summarize: ",
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|configuration_utils.py:447] 2021-02-05 12:00:34,626 >> loading configuration file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/config.json
[INFO|configuration_utils.py:485] 2021-02-05 12:00:34,627 >> Model config T5Config {
  "_name_or_path": "allenai/unifiedqa-t5-11b",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 65536,
  "d_kv": 128,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "early_stopping": true,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "length_penalty": 2.0,
  "max_length": 200,
  "min_length": 30,
  "model_type": "t5",
  "n_positions": 512,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "num_decoder_layers": 24,
  "num_heads": 128,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "prefix": "summarize: ",
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|tokenization_utils_base.py:1685] 2021-02-05 12:00:34,627 >> Model name 'output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming 'output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1721] 2021-02-05 12:00:34,627 >> Didn't find file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-02-05 12:00:34,627 >> Didn't find file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/spiece.model
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/special_tokens_map.json
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/tokenizer_config.json
WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|modeling_utils.py:1025] 2021-02-05 12:00:34,753 >> loading weights file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-02-05 12:04:48,021 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1151] 2021-02-05 12:04:48,034 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
[INFO|trainer.py:348] 2021-02-05 12:04:48,080 >> Using amp fp16 backend
[INFO|trainer.py:1600] 2021-02-05 12:04:48,080 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-05 12:04:48,080 >>   Num examples = 1950
[INFO|trainer.py:1602] 2021-02-05 12:04:48,080 >>   Batch size = 1
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 327, in main
    metrics = trainer.evaluate(metric_key_prefix="val")
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
    output = self.prediction_loop(
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
    generated_tokens = self.model.generate(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
    return F.embedding(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 327, in main
    metrics = trainer.evaluate(metric_key_prefix="val")
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
    output = self.prediction_loop(
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
    generated_tokens = self.model.generate(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
    return F.embedding(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 327, in main
    metrics = trainer.evaluate(metric_key_prefix="val")
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
    output = self.prediction_loop(
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
    generated_tokens = self.model.generate(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
    return F.embedding(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 327, in main
    metrics = trainer.evaluate(metric_key_prefix="val")
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
    output = self.prediction_loop(
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
    generated_tokens = self.model.generate(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
    return F.embedding(
  File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

@stas00
Copy link
Contributor Author

stas00 commented Feb 5, 2021

Are you on master and not by chance on my experimental t5-pipeline branch? If it's the latter then it's very likely that you'd hit that "not on the current device" error. Please make sure you're using the master transformers.

@PeterAJansen
Copy link

Definitely on the master :)

@PeterAJansen
Copy link

Update: I did figure out the CPU eval error -- I had --fp16 set (as in the example script), which currently throws an esoteric pytorch error on CPU ("threshold_cpu" not implemented for 'Half'). Removing this lets it run on CPU, but with 64 cores T5-11B is evaluating at 150 seconds per generation, instead of less than 1 sec with the GPU, so I think I'll kill that.

@stas00
Copy link
Contributor Author

stas00 commented Feb 5, 2021

@PeterAJansen want to confirm with you one detail, is your setup with Intel or AMD cpu?

It's AMD.

I'm using Peter's machine for debugging this, so you can ask me anything.


@PeterAJansen, glad you sorted it out - let me see if I can reproduce that and we could ensure that we prevent the erroneous fp16/cpu combination in first place.


Update on DeepSpeed: it looks like the segfault over CPU ADAM problem is specific to AMD, which is the case on your computer, so the DeepSpeed team are working on figuring that out and hopefully will have a new release some time soon that will do the right thing on AMD and be fast too.

@stas00
Copy link
Contributor Author

stas00 commented Feb 6, 2021

@PeterAJansen,

  • I have fixed the first bug where you went for inference without training - please use this PR branch if it's not merged [trainer] deepspeed bug fixes and tests #10039
    Well basically we aren't using deepspeed at the moment at all if --do_train wasn't run - need to think how to benefit from Deepspeed for pure inference. I will experiment with that.

  • wrt --device cpu could you please explain how you managed to use it? Since it's not a valid flag for finetune_trainer.py, so if you could share the full cmd that would help to reproduce the problem.

Thank you!

@stas00
Copy link
Contributor Author

stas00 commented Feb 6, 2021

@PeterAJansen, for the future let's do this:

  • Try new things - if they fail assume it's 99% a bug in our code - things should either work or give a user-friendly message so that you know it's your error - if it's anything else we should be fixing it.
  • Please do file a new issue every time - while all these bugs are totally related it is very difficult to track when it's one pile
  • Always paste the full cmd that you used
  • Ideally try to use generic datasets/models to make it easy to reproduce the problem

Then:

  1. I reproduce
  2. I write a new test
  3. I fix the bug
  4. You try new things
  5. Rinse and repeat

;)

@PeterAJansen
Copy link

@PeterAJansen,

  • I have fixed the first bug where you went for inference without training - please use this PR branch if it's not merged [trainer] deepspeed bug fixes and tests #10039
    Well basically we aren't using deepspeed at the moment at all if --do_train wasn't run - need to think how to benefit from Deepspeed for pure inference. I will experiment with that.

Thanks!

  • wrt --device cpu could you please explain how you managed to use it? Since it's not a valid flag for finetune_trainer.py, so if you could share the full cmd that would help to reproduce the problem.

Thank you!

Apologies, I think in my exhilaration that it's running T5-11B on 40G cards that I forgot proper issue submission procedures. The --fp16 error is submitted as isssue #10040 :)

@stas00
Copy link
Contributor Author

stas00 commented Feb 8, 2021

both issues have been fixed #10039 and #10041

@stas00
Copy link
Contributor Author

stas00 commented Sep 22, 2021

I'd love to answer your question, @benathi, but I haven't had a chance to experiment with this feature yet. Perhaps asking at https://discuss.huggingface.co/?

HF arsenal has several models that implement sparse attention natively: https://huggingface.co/blog/long-range-transformers

Deepspeed implements sparse attention, but I am not sure how we would plug it into HF Transformers. That is it has this section of the config file, but I think it only works with some of their internal features. I don't know. Might it be a good idea to ask at https://github.com/microsoft/DeepSpeed - I'd love to know the answer myself - and if we could integrate that into Transformers. If you'd like to take the lead on the research I'd be happy to help integrating it. If you ask please tag me as well.

Thank you!

@sbmaruf
Copy link

sbmaruf commented Oct 7, 2021

@stas00 I see the the ds_config.json uses "auto" casting. I cannot train a 13B multilingual mT5-xxl model on the 8x40GB A100 on aws p4d24xlarge. I am using This config with "fp16": {"enabled": false, as t5 is trained on bfloat16 and fp16 usually produce NaN. My sequence length is "src_input_length=1024", target_input_length=256".
Do you have any suggestion? Should I move to fairscale for fp16 issue?

@stas00
Copy link
Contributor Author

stas00 commented Oct 7, 2021

"auto" just allows converting --fp16 to "true" if it's passed in the trainer args. You can absolutely hardcode it to what you need.

I made a possible workaround for t5/mt5 overflows which worked some and not for others, you may want to try:
#10956

Ideally, especially since you're using A100, you should train in bf16 mixed precision, the work is being done on it here:
#13207

But deepspeed doesn't yet support bf16 - perhaps it'd be beneficial to ask Deepspeed about supporting bf16 by opening a feature request at https://github.com/microsoft/DeepSpeed/issues - If you feel inspired to do so?

Should I move to fairscale for fp16 issue?

If fairscale gives a working solution then by all means use it. Does it? I just don't know the answer.

Megatron-LM released a t5 model recently but it doesn't yet support pipeline, so if tensor parallelism is sufficient to your setup it might do a trick (transformers will have TP shortly as well). You can ping them asking when PP will be added. I doubt that if nobody asks it'll happen any time soon. Their bert/gpt2 have a full dp/tp/pp support, but not yet t5.

Finally, try activating Gradient Checkpointing which should help a lot to lower memory usage:
https://huggingface.co/transformers/performance.html#gradient-checkpointing

@sbmaruf
Copy link

sbmaruf commented Oct 8, 2021

Thanks a lot @stas00 for your reply.
I have been working with your PR #10956 until now. Just to let you know, it works fine for me. Huge thanks to you for that PR.
But so far I remember Deepspeed doesn't support torch.cuda.amp.autocast(enabled=False): so ffn layer weights remain fp16 in deepspeed.
I've already tried gradient-checkpointing with fp32 training (in deepspeed) for mT5-xxl-13B but OOM.
May be in coming day I will at first try fair-scale to be sure since it supports torch.cuda.amp.autocast(enabled=False):.

@stas00
Copy link
Contributor Author

stas00 commented Oct 8, 2021

Thanks a lot @stas00 for your reply. I have been working with your PR #10956 until now. Just to let you know, it works fine for me. Huge thanks to you for that PR.

Glad to hear that!

But so far I remember Deepspeed doesn't support torch.cuda.amp.autocast(enabled=False): so ffn layer weights remain fp16 in deepspeed. I've already tried gradient-checkpointing with fp32 training (in deepspeed) for mT5-xxl-13B but OOM.

DS uses their own mixed precision which doesn't lend to users overriding it. But it should be possible to make an if code branch that if the code is running under deepspeed we could manually upcast to fp32 and then downcast back to fp16 and deepspeed. Let me know if you need help with that, this would require no deepspeed understanding I believe. And I haven't tried that, so it's possible that my idea may or may not work.

May be in coming day I will at first try fair-scale to be sure since it supports torch.cuda.amp.autocast(enabled=False):.

Do you mean the sharded DDP (ZeRO@fairscale)? Do let us know, I have no idea what is the state of that project nowadays.

@tuhinjubcse
Copy link

@stas00 any idea about this, I keep getting overflow. Using Version: 0.5.3 of deepseed due to torch restrictions
I can't solve this even after several attempts

[2021-11-13 19:22:08,401] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16.0, reducing to 8.0
0%| | 14/24128 [00:54<25:52:50, 3.86s/it]
[2021-11-13 19:22:12,194] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8.0, reducing to 4.0
0%| | 15/24128 [00:58<25:44:14, 3.84s/it]
[2021-11-13 19:22:15,963] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4.0, reducing to 2.0
0%| | 16/24128 [01:02<25:35:10, 3.82s/it]
[2021-11-13 19:22:19,775] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2.0, reducing to 1.0
0%| | 17/24128 [01:06<25:34:08, 3.82s/it]
[2021-11-13 19:22:23,570] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1.0, reducing to 1
0%| | 18/24128 [01:10<25:31:20, 3.81s/it]
[2021-11-13 19:22:27,338] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%| | 19/24128 [01:13<25:26:08, 3.80s/it]
[2021-11-13 19:22:31,100] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 20/24128 [01:17<25:21:41, 3.79s/it]
[2021-11-13 19:22:34,909] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 21/24128 [01:21<25:24:20, 3.79s/it]
[2021-11-13 19:22:38,715] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 22/24128 [01:25<25:25:39, 3.80s/it]
[2021-11-13 19:22:42,709] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 23/24128 [01:29<25:49:22, 3.86s/it]
[2021-11-13 19:22:46,705] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 24/24128 [01:33<26:06:45, 3.90s/it]
[2021-11-13 19:22:50,537] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 25/24128 [01:37<25:57:46, 3.88s/it]
[2021-11-13 19:22:54,437] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 26/24128 [01:40<26:00:36, 3.89s/it]
[2021-11-13 19:22:58,333] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 27/24128 [01:44<26:01:38, 3.89s/it]
[2021-11-13 19:23:02,162] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 28/24128 [01:48<25:54:33, 3.87s/it]
[2021-11-13 19:23:05,991] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 29/24128 [01:52<25:49:28, 3.86s/it]
[2021-11-13 19:23:09,884] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 30/24128 [01:56<25:53:38, 3.87s/it]
[2021-11-13 19:23:13,776] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
0%|▏ | 31/24128 [02:00<25:56:27, 3.88s/it]
[2021-11-13 19:23:17,659] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

@stas00
Copy link
Contributor Author

stas00 commented Nov 14, 2021

This looks like an issue to report on the deepspeed side, @tuhinjubcse. https://github.com/microsoft/DeepSpeed/issues

@tuhinjubcse
Copy link

OK, @samyam helped me to figure out ZeRO-3 - getting a 3.5x larger BS than with zero2. The key was to lower:

"sub_group_size": 1e9,

from 1e14.

So, I'm able to train t5-11b on a single A100-SXM4-40GB with seq len 1024 with BS=14 with deepspeed ZeRO-3:

export BS=14; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-11b --output_dir output_dir \
--adam_eps 1e-06 --evaluation_strategy=steps --do_train --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 500 --max_source_length 1024 --max_target_length 1024 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS \
--predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 \
--dataset_config "ro-en" --source_prefix "translate English to Romanian: " --val_max_target_length \
128 --warmup_steps 50 --max_train_samples 2000 --max_eval_samples 50 --deepspeed \
tests/deepspeed/ds_config_zero3.json --fp16

everything else is the same as in the zero-2 post above, and config file is too from transformers @ 61c5063 , but ds_config_zero3.json needs to be changed as shown above.

@stas00 could you confirm your torch / deepspeed / apex / transformers versions

@stas00
Copy link
Contributor Author

stas00 commented Nov 16, 2021

Please see: #9996 (comment)

@tuhinjubcse
Copy link

@stas00 Thanks so much
May I also ask why you used LR = 3e-5 when HF page itself notes

T5 models need a slightly higher learning rate than the default one set in the Trainer when using the AdamW optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.

I used LR = 1e-3 previously without deep speed and it worked perfectly. I am doing generation, but now when using deep speed loss seems noisy. Anything you recommend?

{'loss': 5.4677, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.9166, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.6483, 'learning_rate': 0.0, 'epoch': 0.05}
6%|█████████▍ | 1999/32170 [2:21:21<35:31:11, 4.24s/it][2021-11-16 18:02:53,513] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=1999, lr=[0.0], mom=[[0.9, 0.999]]
[2021-11-16 18:02:53,513] [INFO] [timer.py:157:stop] 0/2000, SamplesPerSec=5.674303086219585
{'loss': 1.1347, 'learning_rate': 0.0, 'epoch': 0.06}
{'loss': 0.6642, 'learning_rate': 0.0, 'epoch': 0.08}
{'loss': 1.0864, 'learning_rate': 0.0, 'epoch': 0.09}
{'loss': 0.4922, 'learning_rate': 0.0, 'epoch': 0.11}
12%|██████████████████▉ | 3999/32170 [4:42:30<33:11:13, 4.24s/it][2021-11-16 20:24:02,592] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=3999, lr=[0.0], mom=[[0.9, 0.999]]
[2021-11-16 20:24:02,593] [INFO] [timer.py:157:stop] 0/4000, SamplesPerSec=5.679144072985121
{'loss': 1.6662, 'learning_rate': 0.0, 'epoch': 0.12}
{'loss': 1.4723, 'learning_rate': 0.0, 'epoch': 0.14}
{'loss': 0.5988, 'learning_rate': 0.0, 'epoch': 0.16}
{'loss': 1.1777, 'learning_rate': 0.0, 'epoch': 0.17}
19%|████████████████████████████▎ | 5999/32170 [7:03:38<30:45:21, 4.23s/it][2021-11-16 22:45:10,765] [INFO] [logging.py:68:log_dist] [Rank 0] step=6000, skipped=5999, lr=[0.0], mom=[[0.9, 0.999]]
[2021-11-16 22:45:10,765] [INFO] [timer.py:157:stop] 0/6000, SamplesPerSec=5.68092264980687
{'loss': 0.9843, 'learning_rate': 0.0, 'epoch': 0.19}
{'loss': 0.3419, 'learning_rate': 0.0, 'epoch': 0.2}
{'loss': 1.1882, 'learning_rate': 0.0, 'epoch': 0.22}

@stas00
Copy link
Contributor Author

stas00 commented Nov 17, 2021

May I also ask why you used LR = 3e-5 when HF page itself notes

Oh, that was a totally random setting which makes no impact on the need it was testing (memory usage). I use the same scripts to test many models and most of the time I only care about it working and/or fitting into memory, when I do that particular type of work. I train them for like 50 iterations...

Of course, when training for real, I pay attention to the recommended hparam settings. So please don't use any of the lr-like hparams in my examples for fitting memory as a recommendation for real training.

But let's not mix unrelated things in the same thread. If you'd like to discuss a different topic please kindly open a new issue and we can discuss it there.

@tuhinjubcse
Copy link

tuhinjubcse commented Nov 23, 2021

@stas00 Hopefully this is relevant. I know you had success on A100 40 GB GPU . I am using deep speed on 4 gpus and I recieve OOM after training for several hours. Any idea as to what I can do here


  warnings.warn(formatted_warning, FutureWarning)
{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}                                                                                                                                                       
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}                                                                                                                                                       
{'loss': 0.0399, 'learning_rate': 0.0, 'epoch': 0.06}                                                                                                                                                       
  8%|█████████████                                                                                                                                                | 1999/24128 [1:52:11<20:35:01,  3.35s/it][2021-11-22 19:51:55,198] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 19:51:55,199] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=9.546767962244255
{'loss': 0.0749, 'learning_rate': 0.0, 'epoch': 0.08}                                                                                                                                                       
{'loss': 0.408, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                                                                         
{'loss': 0.0354, 'learning_rate': 0.0, 'epoch': 0.12}                                                                                                                                                       
{'loss': 0.0341, 'learning_rate': 0.0, 'epoch': 0.15}                                                                                                                                                       
 17%|██████████████████████████                                                                                                                                   | 3999/24128 [3:43:57<18:47:06,  3.36s/it][2021-11-22 21:43:41,103] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=3999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 21:43:41,103] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=9.564911481857864
{'loss': 0.0316, 'learning_rate': 0.0, 'epoch': 0.17}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.19}                                                                                                                                                       
{'loss': 0.035, 'learning_rate': 0.0, 'epoch': 0.21}                                                                                                                                                        
{'loss': 0.1423, 'learning_rate': 0.0, 'epoch': 0.23}                                                                                                                                                       
 25%|███████████████████████████████████████                                                                                                                      | 5999/24128 [5:35:43<16:52:01,  3.35s/it][2021-11-22 23:35:26,678] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=5999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 23:35:26,678] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=9.571203445125207
{'loss': 0.1107, 'learning_rate': 0.0, 'epoch': 0.25}                                                                                                                                                       
{'loss': 0.0467, 'learning_rate': 0.0, 'epoch': 0.27}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.29}                                                                                                                                                       
{'loss': 0.0706, 'learning_rate': 0.0, 'epoch': 0.31}                                                                                                                                                       
 33%|████████████████████████████████████████████████████                                                                                                         | 7999/24128 [7:27:26<15:00:20,  3.35s/it][2021-11-23 01:27:10,465] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=7999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 01:27:10,465] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=9.574953735862689
{'loss': 0.22, 'learning_rate': 0.0, 'epoch': 0.33}                                                                                                                                                         
{'loss': 0.0967, 'learning_rate': 0.0, 'epoch': 0.35}                                                                                                                                                       
{'loss': 0.0716, 'learning_rate': 0.0, 'epoch': 0.37}                                                                                                                                                       
{'loss': 0.1111, 'learning_rate': 0.0, 'epoch': 0.39}                                                                                                                                                       
 41%|█████████████████████████████████████████████████████████████████                                                                                            | 9999/24128 [9:19:10<13:10:15,  3.36s/it][2021-11-23 03:18:53,863] [INFO] [logging.py:69:log_dist] [Rank 0] step=10000, skipped=9999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 03:18:53,863] [INFO] [timer.py:181:stop] 0/10000, SamplesPerSec=9.577305314814142
{'loss': 0.2233, 'learning_rate': 0.0, 'epoch': 0.41}                                                                                                                                                       
 43%|███████████████████████████████████████████████████████████████████▏                                                                                        | 10397/24128 [9:41:24<12:47:24,  3.35s/it]Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1865, in training_step
    loss = self.deepspeed.backward(loss)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1708, in backward
    self.optimizer.backward(loss)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1880, in backward
    buf_1 = torch.empty(int(self.reduce_bucket_size),
RuntimeError: CUDA out of memory. Tried to allocate 382.00 MiB (GPU 1; 39.59 GiB total capacity; 36.01 GiB already allocated; 164.94 MiB free; 36.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My script

export BS=8;
PYTHONPATH=../../src
USE_TF=0

deepspeed --num_gpus=4 ./finetune_trainer.py \
 --data_dir /home/tuhin.chakr/gpt3/poetrynew \
 --output_dir /local/nlp/temp/poetryT5-11B_new \
 --model_name_or_path t5-11b \
 --do_train \
 --task translation \
 --max_source_length 64 \
 --max_target_length 64 \
 --save_strategy=epoch \
 --num_train_epochs 1 \
 --per_device_train_batch_size $BS \
 --adafactor \
 --learning_rate 1e-3 \
 --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json \
 --fp16

My config

json = {
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 0.001, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true
    }, 
    "train_batch_size": 32, 
    "train_micro_batch_size_per_gpu": 8, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "wall_clock_breakdown": false, 
    "zero_allow_untested_optimizer": true
}

@stas00
Copy link
Contributor Author

stas00 commented Nov 23, 2021

are you monitoring the memory consumption over the duration of the training - is it borderline OOM from the get going or is the memory usage slowly creeping up?

But regardless, you're using only stage-2, and you want stage-3 in this situation. Since if you're not sharding the params, you get only 12 out of 18 bytes sharded per param. Stage-3 is slower than stage-2 since it has to do more work, but if you can't fit into your gpus stage-3 is what you want.

Note that I'm using stage 3 here: #9996 (comment)

@tuhinjubcse
Copy link

image

retraining again and this is what my gpu looks like

@stas00
Copy link
Contributor Author

stas00 commented Nov 23, 2021

So this is the state at the beginning of the training, right? Then check it say once in 30min and note the differences - if your application is well written then it shouldn't grow after say a few hundred of iterations, assuming the longest seqlen with widest batch size has been consumed already.

I'm also noticing that you're using a very old version of our examples - finetune_trainer.py is very old. So it'd be hard to debug this situation if indeed there a gradual memory leak there. In which case I'd recommend to migrate to the recent version of the software.

@tuhinjubcse
Copy link

The snapshot I sent you was after 5 hrs of training. I have 7M samples and max seq len I reduced to 64 from 128. So hoping it works this time. Last time it failed around 40% of training. Its at 22% now

Yes If I still can't make it work I will switch to a recent version of software.

@stas00
Copy link
Contributor Author

stas00 commented Nov 24, 2021

Right, I'm not sure my message is coming across - I'm suggesting to monitor the memory usage through the training.

And that if it OOMs you need to switch to ZeRO-3 and then you should be able to train with a much longer seqlen.

Enabling https://huggingface.co/transformers/performance.html#gradient-checkpointing is another technique to allow for much longer seqlen.

@tuhinjubcse
Copy link

@stas00 many thanks for your guidance. I could finetune 1 epoch. I converted the model to fp32 and saw the output and noticed it's generating garbled text. Now of course this could be bcz its only 1 epoch. But I trained on 772073 samples. Just to be clear I have a T5 3B model trained on same data but using a different code and it works perfecrly, so assuming my data is perfect

It generated something
**' thou sa wrt e the in thee wast the the of the world, a man of resea the earthe, the in the all the that of**

I am wondering what could be the reason, One thing I suspect is why is the loss zero. as you can see below. I just wanted to see as a proof of concept the generated text as it takes around 24 hours to train 1 epoch. Would you recommend finetuning for more epochs or something else

{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}                                                                                                                                                       
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}                                                                                                                                                       
{'loss': 0.0399, 'learning_rate': 0.0, 'epoch': 0.06}                                                                                                                                                       
  8%|█████████████                                                                                                                                                | 1999/24128 [1:52:11<20:35:01,  3.35s/it][2021-11-22 19:51:55,198] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 19:51:55,199] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=9.546767962244255
{'loss': 0.0749, 'learning_rate': 0.0, 'epoch': 0.08}                                                                                                                                                       
{'loss': 0.408, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                                                                         
{'loss': 0.0354, 'learning_rate': 0.0, 'epoch': 0.12}                                                                                                                                                       
{'loss': 0.0341, 'learning_rate': 0.0, 'epoch': 0.15}                                                                                                                                                       
 17%|██████████████████████████                                                                                                                                   | 3999/24128 [3:43:57<18:47:06,  3.36s/it][2021-11-22 21:43:41,103] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=3999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 21:43:41,103] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=9.564911481857864
{'loss': 0.0316, 'learning_rate': 0.0, 'epoch': 0.17}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.19}                                                                                                                                                       
{'loss': 0.035, 'learning_rate': 0.0, 'epoch': 0.21}                                                                                                                                                        
{'loss': 0.1423, 'learning_rate': 0.0, 'epoch': 0.23}                                                                                                                                                       
 25%|███████████████████████████████████████                                                                                                                      | 5999/24128 [5:35:43<16:52:01,  3.35s/it][2021-11-22 23:35:26,678] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=5999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 23:35:26,678] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=9.571203445125207
{'loss': 0.1107, 'learning_rate': 0.0, 'epoch': 0.25}                                                                                                                                                       
{'loss': 0.0467, 'learning_rate': 0.0, 'epoch': 0.27}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.29}                                                                                                                                                       
{'loss': 0.0706, 'learning_rate': 0.0, 'epoch': 0.31}                                                                                                                                                       
 33%|████████████████████████████████████████████████████                                                                                                         | 7999/24128 [7:27:26<15:00:20,  3.35s/it][2021-11-23 01:27:10,465] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=7999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 01:27:10,465] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=9.574953735862689
{'loss': 0.22, 'learning_rate': 0.0, 'epoch': 0.33}                                                                                                                                                         
{'loss': 0.0967, 'learning_rate': 0.0, 'epoch': 0.35}                                                                                                                                                       
{'loss': 0.0716, 'learning_rate': 0.0, 'epoch': 0.37}                                                                                                                                                       
{'loss': 0.1111, 'learning_rate': 0.0, 'epoch': 0.39} 

@stas00
Copy link
Contributor Author

stas00 commented Nov 24, 2021

why is your 'learning_rate': 0.0 ?

@tuhinjubcse
Copy link

@stas00 thats something I don't understand that. As you can see in my script i mentioned 1e-3

My script from transformers repo

export BS=8;
PYTHONPATH=../../src
USE_TF=0

deepspeed --num_gpus=3 ./finetune_trainer.py \
 --data_dir /home/tuhin.chakr/gpt3/poetrynew \
 --output_dir /local/nlp/temp/poetryT5-11B_new \
 --model_name_or_path t5-11b \
 --do_train \
 --task translation \
 --max_source_length 128 \
 --max_target_length 128 \
 --save_strategy=epoch \
 --num_train_epochs 1 \
 --per_device_train_batch_size $BS \
 --adafactor \
 **--learning_rate 1e-3 \**
 --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json \
 --fp16
~          
My deepspeed config

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "train_batch_size": 24,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Someone here said the same
microsoft/DeepSpeed#1574

@stas00
Copy link
Contributor Author

stas00 commented Nov 24, 2021

I'd be happy to debug this with you, but let's first switch to the current example, which is https://github.com/huggingface/transformers/blob/master/examples/pytorch/translation/run_translation.py - it should be mostly the same with some args renamed - see the README.md for details https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation

e.g. my staple cmd that I use is:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --evaluation_strategy=steps --do_train --do_eval --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 500 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --per_device_eval_batch_size $BS --predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --val_max_target_length 128 --warmup_steps 50 --max_train_samples 500 --max_eval_samples 50 --deepspeed tests/deepspeed/ds_config_zero3.json  --fp16 

Additionally, please open a new Issue since this discussion is now taking over this already closed issue, so let's give it a dedicated space. Just don't forget to tag me in the new Issue.

@sanxchep
Copy link

Update on my end: with DeepSpeed 0.3.10 it did run successfully through the night on a full job, successfully training and generating the predictions. Amazing work @stas00 et al.

how did you infer bro?
got something ?

@NiushanDong
Copy link

Could you please tell me where can I find the ds_config.json and finetune_trainer.py? Thank you!

@stas00
Copy link
Contributor Author

stas00 commented Feb 14, 2023

The examples have been renamed and re-organized since the time of this thread, you can find them all here:
https://github.com/huggingface/transformers/tree/main/examples/pytorch

e.g. the translation is now at examples/pytorch/translation/run_translation.py

For deepspeed please see:
https://huggingface.co/transformers/master/main_classes/deepspeed.html#deepspeed-trainer-integration

@alexey2baranov
Copy link

@stas00 sorry for such question do I understand correctly that every trani example executed 5 seconds? If yes, how many time approx you think tooks training T5-11B from the scratch on such hw?

@stas00
Copy link
Contributor Author

stas00 commented Nov 9, 2023

multiply iteration time by how many batches you plan to feed the model and you will get the total time needed to train any model - as I wasn't part of the t5 training I don't know what their numbers were.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests