Releases · huggingface/accelerate

01 Nov 15:30

muellerzr

v1.1.0

d0e80e5

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes Latest

Latest

Internals:

Allow for a data_seed argument in #3150
Trigger weights_only=True by default for all compatible objects when checkpointing and saving with torch.save in #3036
Handle negative values for dim input in pad_across_processes in #3114
Enable cpu bnb distributed lora finetune in #3159

DeepSpeed

Support torch dynamo for deepspeed>=0.14.4 in #3069

Megatron

update Megatron-LM plugin code to version 0.8.0 or higher in #3174

Big Model Inference

New has_offloaded_params utility added in #3188

Examples

Florence2 distributed inference example in #3123

Full Changelog

Handle negative values for dim input in pad_across_processes by @mariusarvinte in #3114
Fixup DS issue with weakref by @muellerzr in #3143
Refactor scaler to util by @muellerzr in #3142
DS fix, continued by @muellerzr in #3145
Florence2 distributed inference example by @hlky in #3123
POC: Allow for a data_seed by @muellerzr in #3150
Adding multi gpu speech generation by @dame-cell in #3149
support torch dynamo for deepspeed>=0.14.4 by @oraluben in #3069
Fixup Zero3 + save_model by @muellerzr in #3146
Trigger weights_only=True by default for all compatible objects by @muellerzr in #3036
Remove broken dynamo test by @oraluben in #3155
fix version check bug in get_xpu_available_memory by @faaany in #3165
enable cpu bnb distributed lora finetune by @jiqing-feng in #3159
[Utils] has_offloaded_params by @kylesayrs in #3188
fix bnb by @eljandoubi in #3186
[docs] update neptune API by @faaany in #3181
docs: fix a wrong word in comment in src/accelerate/accelerate.py:1255 by @Rebornix-zero in #3183
[docs] use nn.module instead of tensor as model by @faaany in #3157
Fix typo by @kylesayrs in #3191
MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #3187
update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in #3174
🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨 by @muellerzr in #3194
Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in #3196
eliminate dead code by @statelesshz in #3198
take torch.nn.Module model into account when moving to device by @faaany in #3167
[docs] add xpu part and fix bug in torchrun by @faaany in #3166
Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in #3154
add the missing xpu for local sgd by @faaany in #3163
typo fix in big_modeling.py by @a-r-r-o-w in #3207
[Utils] align_module_device by @kylesayrs in #3204

New Contributors

@mariusarvinte made their first contribution in #3114
@hlky made their first contribution in #3123
@dame-cell made their first contribution in #3149
@kylesayrs made their first contribution in #3188
@eljandoubi made their first contribution in #3186
@Rebornix-zero made their first contribution in #3183
@loadams made their first contribution in #3196

Full Changelog: v1.0.1...v1.1.0

Contributors

huismiling, oraluben, and 13 other contributors

Assets 2

12 Oct 03:01

muellerzr

v1.0.1

a427548

v1.0.1: Bugfix

Bugfixes

Fixes an issue where the auto values were no longer being parsed when using deepspeed
Fixes a broken test in the deepspeed tests related to the auto values

Full Changelog: v1.0.0...v1.0.1

Assets 2

07 Oct 15:42

muellerzr

v1.0.0

5d71646

Accelerate 1.0.0 is here!

🚀 Accelerate 1.0 🚀

With accelerate 1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.

To read more, check out our official blog here

Migration assistance

Passing in dispatch_batches, split_batches, even_batches, and use_seedable_sampler to the Accelerator() should now be handled by creating an accelerate.utils.DataLoaderConfiguration() and passing this to the Accelerator() instead (Accelerator(dataloader_config=DataLoaderConfiguration(...)))
Accelerator().use_fp16 and AcceleratorState().use_fp16 have been removed; this should be replaced by checking accelerator.mixed_precision == "fp16"
Accelerator().autocast() no longer accepts a cache_enabled argument. Instead, an AutocastKwargs() instance should be used which handles this flag (among others) passing it to the Accelerator (Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)]))
accelerate.utils.is_tpu_available should be replaced with accelerate.utils.is_torch_xla_available
accelerate.utils.modeling.shard_checkpoint should be replaced with split_torch_state_dict_into_shards from the huggingface_hub library
accelerate.tqdm.tqdm() no longer accepts True/False as the first argument, and instead, main_process_only should be passed in as a named argument

Multiple Model DeepSpeed Support

After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:

When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:

Knowledge distillation

(Where we train only one model, zero3, and another is used for inference, zero2)

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}


accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)

To then select which plugin to be used at a certain time (aka when calling prepare), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:

accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)

Multiple disjoint models

For disjoint models, separate accelerators should be used for each model, and their own .backward() should be called later:

for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()

FP8

We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.

FSDP

Fixed FSDP auto_wrap using characters instead of full str for layers
Re-enable setting state dict type manually

Big Modeling

Removed cpu restriction for bnb training

What's Changed

Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in #3075
Allow DataLoaderAdapter subclasses to be pickled by implementing __reduce__ by @byi8220 in #3074
Fix three typos in src/accelerate/data_loader.py by @xiabingquan in #3082
Re-enable setting state dict type by @muellerzr in #3084
Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in #3085
fix bug in _get_named_modules by @faaany in #3052
use the correct available memory API for XPU by @faaany in #3076
fix skip_keys usage in forward hooks by @152334H in #3088
Update README.md to include distributed image generation gist by @sayakpaul in #3077
MAINT: Upgrade ruff to v0.6.4 by @BenjaminBossan in #3095
Revert "Enable Unwrapping for Model State Dicts (FSDP)" by @SunMarc in #3096
MS-AMP support (w/o FSDP) by @muellerzr in #3093
[docs] DataLoaderConfiguration docstring by @stevhliu in #3103
MAINT: Permission for GH token in stale.yml by @BenjaminBossan in #3102
[docs] Doc sprint by @stevhliu in #3099
Update image ref for docs by @muellerzr in #3105
No more t5 by @muellerzr in #3107
[docs] More docstrings by @stevhliu in #3108
🚨🚨🚨 The Great Deprecation 🚨🚨🚨 by @muellerzr in #3098
POC: multiple model/configuration DeepSpeed support by @muellerzr in #3097
Fixup test_sync w/ deprecated stuff by @muellerzr in #3109
Switch to XLA instead of TPU by @SunMarc in #3118
[tests] skip pippy tests for XPU by @faaany in #3119
Fixup multiple model DS tests by @muellerzr in #3131
remove cpu restriction for bnb training by @jiqing-feng in #3062
fix deprecated torch.cuda.amp.GradScaler FutureWarning for pytorch 2.4+ by @Mon-ius in #3132
🐛 [HotFix] Handle Profiler Activities Based on PyTorch Version by @yhna940 in #3136
only move model to device when model is in cpu and target device is xpu by @faaany in #3133
fix tip brackets typo by @davanstrien in #3129
typo of "scalar" instead of "scaler" by @tonyzhaozh in #3116
MNT Permission for PRs for GH token in stale.yml by @BenjaminBossan in #3112

New Contributors

@xiabingquan made their first contribution in #3082
@a-r-r-o-w made their first contribution in #3085
@152334H made their first contribution in #3088
@sayakpaul made their first contribution in #3077
@Mon-ius made their first contribution in #3132
@davanstrien made their first contribution in #3129
@tonyzhaozh made their first contribution in #3116

Full Changelog: v0.34.2...v1.0.0

Contributors

BenjaminBossan, muellerzr, and 13 other contributors

Assets 2

05 Sep 15:36

muellerzr

v0.34.1

beb4378

v0.34.1 Patchfix

Bug fixes

Fixes an issue where processed DataLoaders could no longer be pickled in #3074 thanks to @byi8220
Fixes an issue when using FSDP where default_transformers_cls_names_to_wrap would separate _no_split_modules by characters instead of keeping it as a list of layer names in #3075

Full Changelog: v0.34.0...v0.34.1

Contributors

byi8220

Assets 2

03 Sep 14:58

muellerzr

v0.34.0

159c0dd

v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!

Dependency Changes

Updated Safetensors Requirement: The library now requires safetensors version 0.4.3.
Added support for Numpy 2.0: The library now fully supports numpy 2.0.0

Core

New Script Behavior Changes

Process Group Management: PyTorch now requires users to destroy process groups after training. The accelerate library will handle this automatically with accelerator.end_training(), or you can do it manually using PartialState().destroy_process_group().
MLU Device Support: Added support for saving and loading RNG states on MLU devices by @huismiling
NPU Support: Corrected backend and distributed settings when using transfer_to_npu, ensuring better performance and compatibility.

DataLoader Enhancements

Stateful DataDataLoader: We are excited to announce that early support has been added for the StatefulDataLoader from torchdata, allowing better handling of data loading states. Enable by passing use_stateful_dataloader=True to the DataLoaderConfiguration, and when calling load_state() the DataLoader will automatically be resumed from its last step, no more having to iterate through passed batches.
Decoupled Data Loader Preparation: The prepare_data_loader() function is now independent of the Accelerator, giving you more flexibility towards which API levels you would like to use.
XLA Compatibility: Added support for skipping initial batches when using XLA.
Improved State Management: Bug fixes and enhancements for saving/loading DataLoader states, ensuring smoother training sessions.
Epoch Setting: Introduced the set_epoch function for MpDeviceLoaderWrapper.

FP8 Training Improvements

Enhanced FP8 Training: Fully Sharded Data Parallelism (FSDP) and DeepSpeed support now work seamlessly with TransformerEngine FP8 training, including better defaults for the quantized FP8 weights.
Integration baseline: We've added a new suite of examples and benchmarks to ensure that our TransformerEngine integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with raw TransformersEngine, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them here
Import Fixes: Resolved issues with import checks for the Transformers Engine that has downstream issues.
FP8 Docker Images: We've added new docker images for TransformerEngine and accelerate as well. Use docker pull huggingface/accelerate@gpu-fp8-transformerengine to quickly get an environment going.

`torchpippy` no more, long live `torch.distributed.pipelining`

With the latest PyTorch release, torchpippy is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on
There are breaking examples and changes that comes from this shift. Namely:
- Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of [1, n, n] rather than [2, n, n] as before.
- We no longer support Encoder/Decoder models. PyTorch tracing for pipelining no longer supports encoder/decoder models, so the t5 example has been removed.
- Computer vision model support currently does not work: There are some tracing issues regarding resnet's we are actively looking into.
If either of these changes are too breaking, we recommend pinning your accelerate version. If the encoder/decoder model support is actively blocking your inference using pippy, please open an issue and let us know. We can look towards adding in the old support for torchpippy potentially if needed.

Fully Sharded Data Parallelism (FSDP)

Environment Flexibility: Environment variables are now fully optional for FSDP, simplifying configuration. You can now fully create a FullyShardedDataParallelPlugin yourself manually with no need for environment patching:

from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)

FSDP RAM efficient loading: Added a utility to enable RAM-efficient model loading (by setting the proper environmental variable). This is generally needed if not using accelerate launch and need to ensure the env variables are setup properly for model loading:

from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
enable_fsdp_ram_efficient_loading()

Model State Dict Management: Enhanced support for unwrapping model state dicts in FSDP, making it easier to manage distributed models.

New Examples

Configuration and Models: Improved configuration handling and introduced a configuration zoo for easier experimentation. You can learn more here. This was largely inspired by the axolotl library, so very big kudos to their wonderful work
FSDP + SLURM Example: Added a minimal configuration example for running jobs with SLURM and using FSDP

Bug Fixes

Fix bug of clip_grad_norm_ for xla fsdp by @hanwen-sun in #2941
Explicit check for step when loading the state by @muellerzr in #2992
Fix find_tied_params for models with shared layers by @qubvel in #2986
clear memory after offload by @SunMarc in #2994
fix default value for rank size in cpu threads_per_process assignment logic by @rbrugaro in #3009
Fix batch_sampler maybe None error by @candlewill in #3025
Do not import transformer_engine on import by @oraluben in #3056
Fix torchvision to be compatible with torch version in CI by @SunMarc in #2982
Fix gated test by @muellerzr in #2993
Fix typo on warning str: "on the meta device device" -> "on the meta device" by @HeAndres in #2997
Fix deepspeed tests by @muellerzr in #3003
Fix torch version check by @muellerzr in #3024
Fix fp8 benchmark on single GPU by @muellerzr in #3032
Fix typo in comment by @zmoki688 in #3045
Speed up tests by shaving off subprocess when not needed by @muellerzr in #3042
Remove skip_first_batches support for StatefulDataloader and fix all the tests by @muellerzr in #3068

New Contributors

@byi8220 made their first contribution in #2957
@alex-jw-brooks made their first contribution in #2959
@XciD made their first contribution in #2981
@hanwen-sun made their first contribution in #2941
@HeAndres made their first contribution in #2997
@yitongh made their first contribution in #2966
@qubvel made their first contribution in #2986
@rbrugaro made their first contribution in #3009
@candlewill made their first contribution in #3025
@siddk made their first contribution in #3047
@oraluben made their first contribution in #3056
@tmm1 made their first contribution in #3055
@zmoki688 made their first contribution in #3045

Full Changelog:

Require safetensors>=0.4.3 by @byi8220 in #2957
Fix torchvision to be compatible with torch version in CI by @SunMarc in #2982
Enable Unwrapping for Model State Dicts (FSDP) by @alex-jw-brooks in #2959
chore: Update runs-on configuration for CI workflows by @XciD in #2981
add MLU devices for rng state saving and loading. by @huismiling in #2940
remove .md to allow proper linking by @nbroad1881 in #2977
Fix bug of clip_grad_norm_ for xla fsdp by @hanwen-sun in #2941
Fix gated test by @muellerzr in #2993
Explicit check for step when loading the state by @muellerzr in #2992
Fix typo on warning str: "on the meta device device" -> "on the meta device" by @HeAndres in #2997
Support skip_first_batches for XLA by @yitongh in #2966
clear memory aft...

Contributors

tmm1, siddk, and 16 other contributors

Assets 2

08 Aug 12:57

muellerzr

v0.33.0

28a3b98

v0.33.0: MUSA backend support and bugfixes

MUSA backend support and bugfixes

Small release this month, with key focuses on some added support for backends and bugs:

Support MUSA (Moore Threads GPU) backend in accelerate by @fmo-mt in #2917
Allow multiple process per device by @cifkao in #2916
Add torch.float8_e4m3fn format dtype_byte_size by @SunMarc in #2945
Properly handle Params4bit in set_module_tensor_to_device by @matthewdouglas in #2934

What's Changed

[tests] fix bug in torch_device by @faaany in #2909
Fix slowdown on init with device_map="auto" by @muellerzr in #2914
fix: bug where multi_gpu was being set and warning being printed even with num_processes=1 by @HarikrishnanBalagopal in #2921
Better error when a bad directory is given for weight merging by @muellerzr in #2852
add xpu device check before moving tensor directly to xpu device by @faaany in #2928
Add huggingface_hub version to setup.py by @nullquant in #2932
Correct loading of models with shared tensors when using accelerator.load_state() by @jkuntzer in #2875
Hotfix PyTorch Version Installation in CI Workflow for Minimum Version Matrix by @yhna940 in #2889
Fix import test by @muellerzr in #2931
Consider pynvml available when installed through the nvidia-ml-py distribution by @matthewdouglas in #2936
Improve test reliability for Accelerator.free_memory() by @matthewdouglas in #2935
delete CCL env var setting by @Liangliang-Ma in #2927
feat(ci): add pip caching in CI by @SauravMaheshkar in #2952

New Contributors

@HarikrishnanBalagopal made their first contribution in #2921
@fmo-mt made their first contribution in #2917
@nullquant made their first contribution in #2932
@cifkao made their first contribution in #2916
@jkuntzer made their first contribution in #2875
@matthewdouglas made their first contribution in #2936
@Liangliang-Ma made their first contribution in #2927
@SauravMaheshkar made their first contribution in #2952

Full Changelog: v0.32.1...v0.33.0

Contributors

muellerzr, cifkao, and 10 other contributors

Assets 2

03 Jul 17:44

muellerzr

v0.32.0

6d3324a

v0.32.0: Profilers, new hooks, speedups, and more!

Core

Utilize shard saving from the huggingface_hub rather than our own implementation (#2795)
Refactor logging to use logger in dispatch_model (#2855)
The Accelerator.step number is now restored when using save_state and load_state (#2765)
A new profiler has been added allowing users to collect performance metrics during model training and inference, including detailed analysis of execution time and memory consumption. These can then be generated in Chrome's tracing tool. Read more about it here (#2883)
Reduced import times for doing import accelerate and any other major core import by 68%, now should be only slightly longer than doing import torch (#2845)
Fixed a bug in get_backend and added a clear_device_cache utility (#2857)

Distributed Data Parallelism

Introduce DDP communication hooks to have more flexibility in how gradients are communicated across workers, overriding the standard allreduce. (#2841)
Make log_line_prefix_template optional the notebook_launcher (#2888)

FSDP

If the output directory doesn't exist when using accelerate merge-weights, one will be automatically created (#2854)
When merging weights, the default is now .safetensors (#2853)

XPU

Migrate to pytorch's native XPU backend on torch>=2.4 (#2825)
Add @require_triton test decorator and enable test_dynamo work on xpu (#2878)
Fixed load_state_dict not working on xpu and refine xpu safetensors version check (#2879)

XLA

Added support for XLA Dynamo backends for both training and inference (#2892)

Examples

Added a new multi-cpu SLURM example using accelerate launch (#2902)

Full Changelog

Use shard saving from huggingface_hub by @SunMarc in #2795
doc: fix link by @imba-tjd in #2844
Revert "Slight rename" by @SunMarc in #2850
remove warning hook addede during dispatch_model by @SunMarc in #2843
Remove underlines between badges by @novialriptide in #2851
Auto create dir when merging FSDP weights by @helloworld1 in #2854
Add DDP Communication Hooks by @yhna940 in #2841
Refactor logging to use logger in dispatch_model by @panjd123 in #2855
xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in #2825
Drop torch re-imports in npu and mlu paths by @dvrogozh in #2856
Default FSDP weights merge to safetensors by @helloworld1 in #2853
[tests] fix bug in test_tracking.ClearMLTest by @faaany in #2863
[tests] use torch_device instead of 0 for device check by @faaany in #2861
[tests] skip bnb-related tests instead of failing on xpu by @faaany in #2860
Potentially fix tests by @muellerzr in #2862
[tests] enable XPU backend for test_zero3_integration by @faaany in #2864
Support saving and loading of step while saving and loading state by @bipinKrishnan in #2765
Add Profiler Support for Performance Analysis by @yhna940 in #2883
Speed up imports and add a CI by @muellerzr in #2845
Make log_line_prefix_template Optional in Elastic Launcher for Backward Compatibility by @yhna940 in #2888
Add XLA Dynamo backends for training and inference by @johnsutor in #2892
Added a MultiCPU SLURM example using Accelerate Launch and MPIRun by @okhleif-IL in #2902
make more cuda-only tests device-agnostic by @faaany in #2876
fix mlu device longTensor bugs by @huismiling in #2887
add require_triton and enable test_dynamo work on xpu by @faaany in #2878
fix load_state_dict for xpu and refine xpu safetensor version check by @faaany in #2879
Fix get_backend bug and add clear_device_cache function by @NurmaU in #2857

New Contributors

@McPatate made their first contribution in #2836
@imba-tjd made their first contribution in #2844
@novialriptide made their first contribution in #2851
@panjd123 made their first contribution in #2855
@dvrogozh made their first contribution in #2825
@johnsutor made their first contribution in #2892
@okhleif-IL made their first contribution in #2902
@NurmaU made their first contribution in #2857

Full Changelog: v0.31.0...v0.32.0

Contributors

helloworld1, huismiling, and 13 other contributors

Assets 2

07 Jun 15:27

muellerzr

v0.31.0

66eefd7

v0.31.0: Better support for sharded state dict with FSDP and Bugfixes

Core

Set timeout default to PyTorch defaults based on backend by @muellerzr in #2758
fix duplicate elements in split_between_processes by @hkunzhe in #2781
Add Elastic Launch Support to notebook_launcher by @yhna940 in #2788
Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790

FSDP

Introduce shard-merging util for FSDP by @muellerzr in #2772
Enable sharded state dict + offload to cpu resume by @muellerzr in #2762
Enable config for fsdp activation checkpointing by @helloworld1 in #2779

Megatron

Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501

What's Changed

Add feature to allow redirecting std streams into log files when using torchrun as the launcher. by @lyuwen in #2740
Update modeling.py by adding try-catch section to skip the unavailable devices by @MeVeryHandsome in #2681
Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity by @statelesshz in #2748
Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in #2730
LOMO / FIX: Support multiple optimizers by @younesbelkada in #2745
Fix max_memory assignment by @SunMarc in #2751
Fix duplicate environment variable check in multi-cpu condition by @yhna940 in #2752
Simplify CLI args validation and ensure CLI args take precedence over config file. by @Iain-S in #2757
Fix sagemaker config by @muellerzr in #2753
fix cpu omp num threads set by @jiqing-feng in #2755
Revert "Simplify CLI args validation and ensure CLI args take precedence over config file." by @muellerzr in #2763
Enable sharded cpu resume by @muellerzr in #2762
Sets default to PyTorch defaults based on backend by @muellerzr in #2758
optimize get_module_leaves speed by @BBuf in #2756
fix minor typo by @TemryL in #2767
Fix small edge case in get_module_leaves by @SunMarc in #2774
Skip deepspeed test by @SunMarc in #2776
Enable config for fsdp activation checkpointing by @helloworld1 in #2779
Add arg from CLI to fix failing test by @muellerzr in #2783
Skip tied weights disk offload test by @SunMarc in #2782
Introduce shard-merging util for FSDP by @muellerzr in #2772
FIX / FSDP : Guard fsdp utils for earlier PyTorch versions by @younesbelkada in #2794
Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
Fixup CLI test by @muellerzr in #2796
fix duplicate elements in split_between_processes by @hkunzhe in #2781
Add Elastic Launch Support to notebook_launcher by @yhna940 in #2788
Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
Fix type in accelerator.py by @qgallouedec in #2800
fix comet ml test by @SunMarc in #2804
New template by @muellerzr in #2808
Fix access error for torch.mps when using torch==1.13.1 on macOS by @SunMarc in #2806
4-bit quantization meta device bias loading bug by @SunMarc in #2805
State dictionary retrieval from offloaded modules by @blbadger in #2619
add cuda dep for a test by @SunMarc in #2820
Remove out-dated xpu device check code in get_balanced_memory by @faaany in #2826
Fix DeepSpeed config validation error by changing stage3_prefetch_bucket_size value to an integer by @adk9 in #2814
Improve test speeds by up to 30% in multi-gpu settings by @muellerzr in #2830
monitor-interval, take 2 by @muellerzr in #2833
Optimize the megatron plugin by @zhangsheng377 in #2822
fix fstr format by @Jintao-Huang in #2810

New Contributors

@lyuwen made their first contribution in #2740
@MeVeryHandsome made their first contribution in #2681
@luowyang made their first contribution in #2730
@Iain-S made their first contribution in #2757
@BBuf made their first contribution in #2756
@TemryL made their first contribution in #2767
@helloworld1 made their first contribution in #2779
@hkunzhe made their first contribution in #2781
@adk9 made their first contribution in #2814
@Jintao-Huang made their first contribution in #2810

Full Changelog: v0.30.1...v0.31.0

Contributors

adk9, helloworld1, and 19 other contributors

Assets 2

10 May 17:47

muellerzr

v0.30.1

b52803d

v0.30.1: Bugfixes

Patchfix

Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in #2752
Fix issue with missing values in the SageMaker config leading to not being able to launch in #2753
Fix CPU OMP num threads setting thanks to @jiqing-feng in #2755
Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model #2762
Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in #2748
Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in #2730
Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in #2745

Full Changelog: v0.30.0...v0.30.1

Contributors

luowyang, statelesshz, and 3 other contributors

Assets 2

03 May 15:29

muellerzr

v0.30.0

989cc50

v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

Core

We've simplified the tqdm wrapper to make it fully passthrough, no need to have tqdm(main_process_only, *args), it is now just tqdm(*args) and you can pass in is_main_process as a kwarg.
We've added support for advanced optimizer usage:
- Schedule free optimizer introduced by Meta by @muellerzr in #2631
- LOMO optimizer introduced by OpenLMLab by @younesbelkada in #2695
Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in #2655
Support dataloader send_to_device calls to use non-blocking by @drhead in #2685
allow gather_for_metrics to be more flexible by @SunMarc in #2710
Add cann version info to command accelerate env for NPU by @statelesshz in #2689
Add MLU rng state setter by @ArthurinRUC in #2664
device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602

Documentation

Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
New distributed inference examples have been added thanks to @SunMarc in #2672
Fixed some docs for using internal trackers by @brentyi in #2650

DeepSpeed

Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in #2662
Allow "auto" for gradient clipping in YAML by @regisss in #2649
Introduce a deepspeed-specific Docker image by @muellerzr in #2707. To use, pull the gpu-deepspeed tag docker pull huggingface/accelerate:cuda-deepspeed-nightly

Megatron

Megatron plugin can support NPU by @zhangsheng377 in #2667

Big Modeling

Add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641

Bug Fixes

Fix up state with xla + performance regression by @muellerzr in #2634
Parenthesis on xpu_available by @muellerzr in #2639
Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in #2646
Fix backend check by @jiqing-feng in #2652
Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
Block AMP for MPS device by @SunMarc in #2699
Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in #2714
Fixup free_memory to deal with garbage collection by @muellerzr in #2716
Fix sampler serialization failing by @SunMarc in #2723
Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in #2717

Full Changelog

Schedule free optimizer support by @muellerzr in #2631
Fix up state with xla + performance regression by @muellerzr in #2634
Parenthesis on xpu_available by @muellerzr in #2639
add third-party device prefix to execution_device by @faaany in #2612
add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641
device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602
Docs fix for using internal trackers by @brentyi in #2650
Allow "auto" for gradient clipping in YAML by @regisss in #2649
Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in #2646
Don't use deprecated Repository anymore by @Wauplin in #2658
Fix test_from_pretrained_low_cpu_mem_usage_measured failure by @yuanwu2017 in #2644
Add MLU rng state setter by @ArthurinRUC in #2664
fix backend check by @jiqing-feng in #2652
Megatron plugin can support NPU by @zhangsheng377 in #2667
Revert "fix backend check" by @muellerzr in #2669
tqdm: *args should come ahead of main_process_only by @rb-synth in #2654
Handle MoE models with DeepSpeed by @pacman100 in #2662
Fix deepspeed moe test with version check by @pacman100 in #2677
Pin DS...again.. by @muellerzr in #2679
fix backend check by @jiqing-feng in #2670
Deprecate tqdm args + slight logic tweaks by @muellerzr in #2673
Enable BF16 autocast to everything during FP8 + some tweaks to enable FSDP by @muellerzr in #2655
Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
Simplify test logic by @pacman100 in #2697
Add source code for DataLoader Animation by @muellerzr in #2696
Block AMP for MPS device by @SunMarc in #2699
Do a pip freeze during workflows by @muellerzr in #2704
add cann version info to command accelerate env by @statelesshz in #2689
Add version checks for the import of DeepSpeed moe utils by @pacman100 in #2705
Change dataloader send_to_device calls to non-blocking by @drhead in #2685
add distributed examples by @SunMarc in #2672
Add diffusers to req by @muellerzr in #2711
fix bnb multi gpu training by @SunMarc in #2714
allow gather_for_metrics to be more flexible by @SunMarc in #2710
Add Upcasting for FSDP in Mixed Precision. Add Concept Guide for FSPD and DeepSpeed. by @fabianlim in #2674
Segment out a deepspeed docker image by @muellerzr in #2707
Fixup free_memory to deal with garbage collection by @muellerzr in #2716
fix sampler serialization by @SunMarc in #2723
Fix sampler failing test by @SunMarc in #2728
Docs: Fix build main documentation by @SunMarc in #2729
Fix Documentation in FSDP and DeepSpeed Concept Guide by @fabianlim in #2725
Fix deepspeed offload device type by @yhna940 in #2717
FEAT: Add LOMO optimizer by @younesbelkada in #2695
Fix tests on main by @muellerzr in #2739

New Contributors

@brentyi made their first contribution in #2650
@regisss made their first contribution in #2649
@yhna940 made their first contribution in #2646
@Wauplin made their first contribution in #2658
@ArthurinRUC made their first contribution in #2664
@jiqing-feng made their first contribution in #2652
@zhangsheng377 made their first contribution in #2667
@rb-synth made their first contribution in #2654
@drhead made their first contribution in #2685

Full Changelog: https://github.com/huggingface/acce...

Contributors

drhead, zhangsheng377, and 16 other contributors

Assets 2

Releases: huggingface/accelerate

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

Internals:

DeepSpeed

Megatron

Big Model Inference

Examples

Full Changelog

New Contributors

Contributors

v1.0.1: Bugfix

Bugfixes

Accelerate 1.0.0 is here!

🚀 Accelerate 1.0 🚀

Migration assistance

Multiple Model DeepSpeed Support

Knowledge distillation

Multiple disjoint models

FP8

FSDP

Big Modeling

What's Changed

New Contributors

Contributors

v0.34.1 Patchfix

Bug fixes

Contributors

v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!

Dependency Changes

Core

New Script Behavior Changes

DataLoader Enhancements

FP8 Training Improvements

torchpippy no more, long live torch.distributed.pipelining

Fully Sharded Data Parallelism (FSDP)

New Examples

Bug Fixes

New Contributors

Full Changelog:

Contributors

v0.33.0: MUSA backend support and bugfixes

MUSA backend support and bugfixes

What's Changed

New Contributors

Contributors

v0.32.0: Profilers, new hooks, speedups, and more!

Core

Distributed Data Parallelism

FSDP

XPU

XLA

Examples

Full Changelog

New Contributors

Contributors

v0.31.0: Better support for sharded state dict with FSDP and Bugfixes

Core

FSDP

Megatron

What's Changed

New Contributors

Contributors

v0.30.1: Bugfixes

Patchfix

Contributors

v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

Core

Documentation

DeepSpeed

Megatron

Big Modeling

Bug Fixes

Full Changelog

New Contributors

Contributors

`torchpippy` no more, long live `torch.distributed.pipelining`