Remove apex deps #9

saforem2 · 2024-04-24T20:53:25Z

Add DEBUG option to toggle set -euxo pipefail in train_llama_alcf.sh
Add option to switch between launching with deepspeed and mpiexec (default)
Remove file: generate_config.sh and move logic for creating deepspeed_config.json into ALCF/helpers.sh directly
Remove hard dependency on NVIDIA/apex
- Explicitly, changes to:
  - megatron/model/__init__.py
  - megatron/optimizer/clip_grads.py
  - megatron/optimizer/distrib_optimizer.py
  - megatron/utils.py
- This is done by replacing explicit import apex (or similar) calls with
```
# from deepspeed.accelerator import get_accelerator
# if get_accelerator().device_name() == 'cuda':
try:
    from apex.multi_tensor_apply import multi_tensor_applier
    import amp_C
    HAS_APEX = True
except Exception:
    HAS_APEX = False
```
  and using the default fallback methods when HAS_APEX is False

Replace: ```bash if python3 -c 'import ezpz; print(ezpz.__file__)' 2> '/dev/null'; then ``` with ```bash if python3 -c "import sys; any(['ezpz' in s for s in sys.path])" 2> '/dev/null'; then ``` in `ezpz()` from `ALCF/helpers.sh`

saforem2 · 2024-04-24T23:55:17Z

add ALCF/test_sunspot.sh to run simple test on Sunspot

saforem2 · 2024-04-25T00:18:30Z

add ALCF/test_sirius.sh

saforem2 and others added 13 commits April 23, 2024 16:07

Remove apex deps from megatron/*

133f244

Move generate_config.sh logic into ALCF/helpers.sh

42a27fb

Add option to launch with mpiexec

3be7efc

Update train_llama_alcf.sh

a8a9a59

Update ALCF/helpers.sh

42140d7

Update ALCF/helpers.sh, train_llama_alcf.sh

fa0c5a6

Add ALCF/sunspot-env.sh

4b9c2f2

Update train_llama_alcf.sh, ALCF/helpers.sh

c2e9147

Update ALCF/helpers.sh

41a3f35

Much faster check if ezpz installed

71c725e

Replace: ```bash if python3 -c 'import ezpz; print(ezpz.__file__)' 2> '/dev/null'; then ``` with ```bash if python3 -c "import sys; any(['ezpz' in s for s in sys.path])" 2> '/dev/null'; then ``` in `ezpz()` from `ALCF/helpers.sh`

Add option to run in DEBUG mode (i.e. set -euxo pipefail)

ae0b4d8

Update ALCF/data-lists/sunspot/*.txt

2d6608a

Add ALCF/test_sunspot.sh

3648af5

saforem2 added 3 commits April 24, 2024 19:09

Add ALCF/data-lists/sirius/books.txt

9796eac

Add ALCF/test_sirius.sh

7b2ab6d

Update ALCF/test_sirius.sh

58cdcca

saforem2 merged commit 3145945 into main Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove apex deps #9

Remove apex deps #9

saforem2 commented Apr 24, 2024

saforem2 commented Apr 24, 2024

saforem2 commented Apr 25, 2024

Remove apex deps #9

Remove apex deps #9

Conversation

saforem2 commented Apr 24, 2024

saforem2 commented Apr 24, 2024

saforem2 commented Apr 25, 2024