Skip to content

Commit

Permalink
Fix deepspeed docs (#15346)
Browse files Browse the repository at this point in the history
  • Loading branch information
ngoquanghuy99 authored Jan 26, 2022
1 parent 96161ac commit 5d8b986
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions docs/source/main_classes/deepspeed.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ won't be possible on a single GPU.

🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:

1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for your type
of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
this document is focused on this feature.
2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
Expand Down Expand Up @@ -97,7 +97,7 @@ TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--disable-pip-version-check 2>&1 | tee build.log
```

If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also
If you intend to use NVMe offload you will also need to include `DS_BUILD_AIO=1` in the instructions above (and also
install *libaio-dev* system-wide).

Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
Expand Down Expand Up @@ -134,7 +134,7 @@ You can check the archs pytorch was built with using:
python -c "import torch; print(torch.cuda.get_arch_list())"
```

Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
Here is how to find out the arch for one of the installed GPUs. For example, for GPU 0:

```bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
Expand Down Expand Up @@ -169,7 +169,7 @@ following:
2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as
documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you.

Therefore, if your original command line looked as following:
Therefore, if your original command line looked as follows:

```bash
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
Expand Down Expand Up @@ -214,7 +214,7 @@ For some practical usage examples, please, see this [post](https://github.com/hu

### Deployment with one GPU

To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following:
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as follows:

```bash
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
Expand Down Expand Up @@ -560,7 +560,7 @@ Do note that some values, such as `scheduler.params.total_num_steps` are calcula
### ZeRO

[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It
support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
supports 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
You will find more indepth information in the DeepSpeed documentation.

Expand All @@ -581,7 +581,7 @@ going to use.

#### ZeRO-2 Config

The following is an example configuration for ZeRO stage 2:
The following is an example of configuration for ZeRO stage 2:

```json
{
Expand All @@ -604,13 +604,13 @@ The following is an example configuration for ZeRO stage 2:
**Performance tuning:**

- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`)
- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
- `"overlap_comm": true` trade offs increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB
footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do
the same on larger capacity GPU as well, if you're starting to hit OOM.
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size is,
the slower the communication gets, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
important, getting a slightly slower training time could be a good trade.


Expand All @@ -619,7 +619,7 @@ The following is an example configuration for ZeRO stage 2:

#### ZeRO-3 Config

The following is an example configuration for ZeRO stage 3:
The following is an example of configuration for ZeRO stage 3:

```json
{
Expand Down Expand Up @@ -662,7 +662,7 @@ and its typically accessed much faster than normal CPU memory.

If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact
on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by
`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total.
`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so it's not additive, it's just 2GB total.

`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given
time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
Expand Down

0 comments on commit 5d8b986

Please sign in to comment.