Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

pacman100
Copy link
Contributor

What does this PR do?

  1. When examining if auto_find_batch_size issue with DeepSpeed is solved via Zach's previous PR as someone commented on the PR that issue is still there: Support DeepSpeed when using auto find batch size #28088 (comment)

When I try https://github.com/pacman100/DHS-LLM-Workshop/tree/main/chat_assistant/sft/training with following command:

accelerate launch --config_file "configs/deepspeed_config.yaml" train.py \
--seed 100 \
--model_name_or_path "mistralai/Mistral-7B-v0.1" \
--dataset_name "smangrul/code-chat-assistant-v1" \
--chat_template_format "none" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 2e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.0 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--output_dir "mistral-sft-ds" \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--dataset_text_field "content" \
--use_flash_attn True \
--auto_find_batch_size True

I get a different error:

File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1328, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1477, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1510, in _partition_param
    free_param(param)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 285, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 26, 'status': 'AVAILABLE', 'numel': 4096, 'ds_numel': 4096, 'shape': (4096,), 'ds_shape': (4096,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {44}, 'ds_tensor.shape': torch.Size([512])}
[2024-02-09 14:50:57,113] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 230181 closing signal SIGTERM
[2024-02-09 14:50:57,646] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 230182) of binary: /fsx/sourab/miniconda3/envs/hf/bin/python

As auto_find_batch_size is good to have feature and not a necessity, coupled with the obscure errors noticed with DeepSpeeed/FSDP, we don't want to spend more time around this at present. Hence, this PR to raise error when trying to use auto_find_batch_size with DeepSpeed/FSDP.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's annoying with this is at the time it was working and that test was passing, but if it's niche enough (and we def don't have the bandwidth right now) raising this err makes sense. If enough users vocalize concern we can look at reverting it.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Agreed with @muellerzr - raising an error makes sense and we can work on adding support if it's something which is requested a lot

@pacman100 pacman100 merged commit 4c18ddb into main Feb 16, 2024
21 checks passed
@pacman100 pacman100 deleted the smangrul/throw-error-auto-batch-size-for-ds-fsdp branch February 16, 2024 12:41
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Feb 19, 2024
itazap pushed a commit that referenced this pull request May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants