Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove apex deps #9

Merged
merged 16 commits into from
Apr 25, 2024
Merged

Remove apex deps #9

merged 16 commits into from
Apr 25, 2024

Conversation

saforem2
Copy link
Member

  • Add DEBUG option to toggle set -euxo pipefail in train_llama_alcf.sh

  • Add option to switch between launching with deepspeed and mpiexec (default)

  • Remove file: generate_config.sh and move logic for creating deepspeed_config.json into ALCF/helpers.sh directly

  • Remove hard dependency on NVIDIA/apex

    • Explicitly, changes to:

      • megatron/model/__init__.py
      • megatron/optimizer/clip_grads.py
      • megatron/optimizer/distrib_optimizer.py
      • megatron/utils.py
    • This is done by replacing explicit import apex (or similar) calls with

      # from deepspeed.accelerator import get_accelerator
      # if get_accelerator().device_name() == 'cuda':
      try:
          from apex.multi_tensor_apply import multi_tensor_applier
          import amp_C
          HAS_APEX = True
      except Exception:
          HAS_APEX = False

      and using the default fallback methods when HAS_APEX is False

@saforem2
Copy link
Member Author

@saforem2
Copy link
Member Author

@saforem2 saforem2 merged commit 3145945 into main Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant