-
Notifications
You must be signed in to change notification settings - Fork 228
Closed
Description
Need to figure out why CI fails with 4 gpus, but works fine with 2.
I set it for now to 2 gpus, until we sort this out. 391930b
FAILED tests/test_model.py::MyTestCase::test_gpt - Failed: Timeout >300.0s
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_0_base - ...
FAILED tests/test_training.py::MegDSTestTraining::test_training_all_1_cl - Ru...
FAILED tests/test_training.py::MegDSTestTraining::test_training_prefix_lm_all
the first one hangs on startup, the others fail with:
stderr: Traceback (most recent call last):
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: Traceback (most recent call last):
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/pretrain_gpt.py", line 246, in <module>
stderr: iteration = train(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 165, in pretrain
stderr: train_step(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr: iteration = train(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 734, in train
stderr: loss = model[0].train_batch(data_iter=data_iterator)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr: self._exec_schedule(sched)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr: train_step(forward_step_func,
stderr: File "/actions-runner/_work/Megatron-DeepSpeed/Megatron-DeepSpeed/megatron/training.py", line 405, in train_step
stderr: loss = model[0].train_batch(data_iter=data_iterator)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 329, in train_batch
stderr: self._exec_instr(**cmd.kwargs)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 723, in _exec_backward_pass
stderr: self._exec_schedule(sched)
stderr: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1313, in _exec_schedule
stderr: local_part=self.grad_layer[1],
stderr: IndexError: list index out of range
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels