Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed integration for <10B models on strangeporks. #113

Open
TheRootOf3 opened this issue Jul 1, 2024 · 2 comments
Open

DeepSpeed integration for <10B models on strangeporks. #113

TheRootOf3 opened this issue Jul 1, 2024 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@TheRootOf3
Copy link
Collaborator

TheRootOf3 commented Jul 1, 2024

Includes #105.

@Willmish
Copy link
Collaborator

Willmish commented Jul 10, 2024

Setting up the environment for deepspeed training

  1. Create a venv after sourcing source /share/apps/source_files/python/python-3.11.5_bzip2.source (this includes python developer headers, so there wont be issues with missing Python.h files)
  2. install all requirements (accelerate, deepspeed, torch (should be 2.3.1>), transformers, and might be a few others depending on used model)
  3. Configure deepspeed using accelerate config. See this pdf for ZeRO 2 config with RAM (CPU) parameter offloading:
    NOTE_SNLP_DEEPSPEED.pdf
  4. Launch llm unlearning as before, but now source the venv with deepspeed, and use accelerate launch instead of python3 as the launcher.
  5. There might be potenitally more issues if multiple deepspeed sessions running on the same node (may need to tweak --main_process_port https://huggingface.co/docs/accelerate/en/package_reference/cli ) and potenitally CUDA_VISIBLE_DEVICES environment var

If encountering an error: while Building extension module cpu_adam...

FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/sduchnie/strangepork_venv/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status

It means the lib curand for CUDA is not visible by the linker (see microsoft/DeepSpeed#3929 ) and needs to be linked in manually. you can do so by:

  1. Cd into: cd venv/lib/python3.11/site-packages/torch/lib (or whatever your lib is called)
  2. create a symbolic link for the missing lib (should be located in: /usr/local/cuda/lib64/libcurand.so): ln -s /usr/local/cuda/lib64/libcurand.so .
  3. Retry running the command

@TheRootOf3
Copy link
Collaborator Author

TheRootOf3 commented Aug 23, 2024

@Willmish Useful stuff, please convert to a markdown and add to a documentation directory :)

@Willmish Willmish added the documentation Improvements or additions to documentation label Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants