DeepSpeed integration for <10B models on strangeporks. #113

TheRootOf3 · 2024-07-01T14:30:28Z

Includes #105.

Willmish · 2024-07-10T14:45:51Z

Setting up the environment for deepspeed training

Create a venv after sourcing source /share/apps/source_files/python/python-3.11.5_bzip2.source (this includes python developer headers, so there wont be issues with missing Python.h files)
install all requirements (accelerate, deepspeed, torch (should be 2.3.1>), transformers, and might be a few others depending on used model)
Configure deepspeed using accelerate config. See this pdf for ZeRO 2 config with RAM (CPU) parameter offloading:
NOTE_SNLP_DEEPSPEED.pdf
Launch llm unlearning as before, but now source the venv with deepspeed, and use accelerate launch instead of python3 as the launcher.
There might be potenitally more issues if multiple deepspeed sessions running on the same node (may need to tweak --main_process_port https://huggingface.co/docs/accelerate/en/package_reference/cli ) and potenitally CUDA_VISIBLE_DEVICES environment var

If encountering an error: while Building extension module cpu_adam...

FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/sduchnie/strangepork_venv/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status

It means the lib curand for CUDA is not visible by the linker (see microsoft/DeepSpeed#3929 ) and needs to be linked in manually. you can do so by:

Cd into: cd venv/lib/python3.11/site-packages/torch/lib (or whatever your lib is called)
create a symbolic link for the missing lib (should be located in: /usr/local/cuda/lib64/libcurand.so): ln -s /usr/local/cuda/lib64/libcurand.so .
Retry running the command

TheRootOf3 · 2024-08-23T11:33:57Z

@Willmish Useful stuff, please convert to a markdown and add to a documentation directory :)

TheRootOf3 assigned Willmish and Adamliu1 Jul 1, 2024

This was referenced Jul 6, 2024

Integrate Accelerate, Deepspeed and fix start_loc compute for Microsfot Phi #115

Closed

Integrate Accelerate, Deepspeed #117

Merged

Willmish added the documentation Improvements or additions to documentation label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed integration for <10B models on strangeporks. #113

DeepSpeed integration for <10B models on strangeporks. #113

TheRootOf3 commented Jul 1, 2024 •

edited

Loading

Willmish commented Jul 10, 2024 •

edited

Loading

TheRootOf3 commented Aug 23, 2024 •

edited

Loading

DeepSpeed integration for <10B models on strangeporks. #113

DeepSpeed integration for <10B models on strangeporks. #113

Comments

TheRootOf3 commented Jul 1, 2024 • edited Loading

Willmish commented Jul 10, 2024 • edited Loading

Setting up the environment for deepspeed training

TheRootOf3 commented Aug 23, 2024 • edited Loading

TheRootOf3 commented Jul 1, 2024 •

edited

Loading

Willmish commented Jul 10, 2024 •

edited

Loading

TheRootOf3 commented Aug 23, 2024 •

edited

Loading