Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Accelerate, Deepspeed #117

Merged
merged 9 commits into from
Aug 23, 2024
275 changes: 275 additions & 0 deletions accelerate_and_env_note.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
# How to Start

``Caution: This file is temporary and will be used as a reference to our final readme``

## On our Strangepork nodes

### Create venv
Source the python environment with bzip (this includes python developer headers, so there wont be issues with missing Python.h files)
```
source /share/apps/source_files/python/python-3.11.5_bzip2.source
```

Then create a python venv using

```python -m venv /path/to/new/virtual/environment```


## Soucre CUDA then install dependencies

### Source CUDA
Before installing any dependencies, we need to source CUDA to the environment. Otherwise, it might causes failure on compiling some required libraries.

On Stangepork, I recommend using CUDA11.8, the path is as follow:
```
source /share/apps/source_files/cuda/cuda-11.8.source
```

### Install dependencies

install all requirements (accelerate, deepspeed, torch (should be 2.3.1>), transformers, and might be a few others depending on used model)

As a reference, this is a copy of Adam's venv
```
accelerate==0.31.0
aiohttp==3.8.6
aiosignal==1.3.1
annotated-types==0.7.0
async-timeout==4.0.3
attrs==23.1.0
bitsandbytes==0.43.1
certifi==2023.7.22
charset-normalizer==3.3.0
click==8.1.7
contourpy==1.2.1
cycler==0.12.1
datasets==2.14.5
deepspeed==0.14.2
dill==0.3.7
docker-pycreds==0.4.0
filelock==3.12.4
fonttools==4.53.0
frozenlist==1.4.0
fsspec==2023.6.0
gitdb==4.0.11
GitPython==3.1.43
hjson==3.1.0
huggingface-hub==0.23.3
idna==3.4
Jinja2==3.1.3
joblib==1.4.2
kiwisolver==1.4.5
MarkupSafe==2.1.5
matplotlib==3.9.0
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.1
nvidia-cublas-cu11==11.11.3.6
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cudnn-cu11==8.7.0.84
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.3.0.86
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusparse-cu11==11.7.5.86
nvidia-nccl-cu11==2.20.5
nvidia-nvtx-cu11==11.8.86
packaging==23.2
pandas==2.1.1
peft==0.5.0
pillow==10.2.0
platformdirs==4.2.2
protobuf==5.27.1
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==16.1.0
pydantic==2.7.3
pydantic_core==2.18.4
pynvml==11.5.0
pyparsing==3.1.2
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
sentencepiece==0.2.0
sentry-sdk==2.5.1
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
svgwrite==1.4.3
sympy==1.12
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.3.1+cu118
torchaudio==2.3.1+cu118
torchvision==0.18.1+cu118
tqdm==4.66.1
transformers==4.41.2
triton==2.3.1
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
wandb==0.17.1
xxhash==3.4.1
yarl==1.9.2
```

## Configure Accelerate and Deepspeed

``NOTE: Currently, using accelerate config to configure the environment. However, we will shift to include a manual config in later commits in order to have more control on accelerate.``

### Configure as follow
Use ``accelerate config``

Select compute environment
```
In which compute environment are you running?
Please select a choice using the arrow or number keys, and selecting with enter
➔ This machine
AWS (Amazon SageMaker)
```

Select distributed training or not
```
Which type of machine are you using?
Please select a choice using the arrow or number keys, and selecting with enter
No distributed training
multi-CPU
multi-XPU
➔ multi-GPU
multi-NPU
multi-MLU
TPU
```

Use all default values (Just press ``Enter``)
```
How many different machines will you use (use more than 1 for multi-node training)? [1]:

```

```
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
```

```
Do you wish to optimize your script with torch dynamo?[yes/NO]:
```

Use DeepSpeed
```
Do you want to use DeepSpeed? [yes/NO]: yes
```

Currently we are not specifying a json file to a DeepSpeed config
```
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: NO
```

Use ``ZeRO 2`` Offload
```
What should be your DeepSpeed's ZeRO optimization stage?
Please select a choice using the arrow or number keys, and selecting with enter
0
1
➔ 2
3
```

Offloading to ``CPU``
```
Where to offload optimizer states?
Please select a choice using the arrow or number keys, and selecting with enter
none
➔ cpu
nvme
```
```
Where to offload parameters?
Please select a choice using the arrow or number keys, and selecting with enter
none
➔ cpu
nvme
```

Currently we don't want to touch deepspeed gradient accumulation yet. This could also be defined later in python manually after we fix the gradient accumulation in our code base.
```
How many gradient accumulation steps you're passing in your script? [1]:
```

Select graident clipping
```
Do you want to use gradient clipping? [yes/NO]:
```
No ``deepspeed.zero.Init`` as we are only using ``ZeRO 2``.
```
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]
```
No ``MoE``
```
Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]:
```
Select Number of GPU(s) to be used for distributed training. This could vary depending on our needs.
```
How many GPU(s) should be used for distributed training? [1]:
```
Choose ``bf16`` as ``A100`` supports.
```
Do you wish to use FP16 or BF16 (mixed precision)?
Please select a choice using the arrow or number keys, and selecting with enter
no
fp16
➔ bf16
fp8
```
Last, the configuration program will output the location of the saved accelerate config file. For example
```
accelerate configuration saved at /home/yadonliu/.cache/huggingface/accelerate/default_config.yaml
```

## Example command to launch unlearning

Launch llm unlearning as before, but now source the venv with deepspeed, and use ``accelerate launch`` instead of ``python3`` as the launcher.

Example command
```
CUDA_VISIBLE_DEVICES=0 accelerate launch unlearn_harm_redo_accelerate.py --model_name meta-llama/Meta-Llama-3-8B --model_save_dir "/SAN/intelsys/llm/yadonliu/SNLP_GCW/snlp-unlearned-models/models/test_llama8b" --log_file "/SAN/intelsys/llm/yadonliu/SNLP_GCW/snlp-unlearned-models/logs/test_llama8b.log" --cache_dir "/home/yadonliu/huggingface_cache" --seed 42 --retaining_dataset rajpurkar/squad --max_bad_loss 10000 --sequential=-1 --num_epochs=1 --batch_size=1 --seed=42 --save_every=100 --lr 2e-6
```

Where you can modify GPUs that are visible to the accelerate. For example ``CUDA_VISIBLE_DEVICES=0,1`` makes ``GPU0`` and ``GPU1`` visiable to the python run.

## Troubleshooting

### If multiple deepspeed sessions running on the same node

There might be potenitally more issues if multiple deepspeed sessions running on the same node (may need to tweak ``--main_process_port`` https://huggingface.co/docs/accelerate/en/package_reference/cli ) and potenitally CUDA_VISIBLE_DEVICES environment var

### If encountering an error: while Building extension module cpu_adam...

```
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/sduchnie/strangepork_venv/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status
```

It means the lib curand for CUDA is not visible by the linker (see [microsoft/DeepSpeed#3929](https://github.com/microsoft/DeepSpeed/issues/3929) ) and needs to be linked in manually. you can do so by:

1. Cd into: ``cd venv/lib/python3.11/site-packages/torch/lib`` (or whatever your lib is called)
2. create a symbolic link for the missing lib (should be located in: ``/usr/local/cuda/lib64/libcurand.so``): ``ln -s /usr/local/cuda/lib64/libcurand.so .``
3. Retry running the command







11 changes: 11 additions & 0 deletions llm_unlearn_ucl/data_utils.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
# Copyright (C) 2024 UCL CS SNLP Naturalnego 语言 Töötlus group
# - Szymon Duchniewicz
# - Yadong Liu
# - Andrzej Szablewski
# - Zhe Yu
#
# Adapted from https://github.com/kevinyaobytedance/llm_unlearn.
#
# This software is released under the MIT License.
# https://opensource.org/licenses/MIT

import os
import random
from json import dump
Expand Down
Loading
Loading