The training gets stuck at `self.p_train_step` at a random step. #3

baofff · 2022-06-28T14:41:17Z

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

LuChengTHU · 2022-07-01T04:34:53Z

Same.

pjh4993 · 2022-07-03T11:38:07Z

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

Do you have any error msg?
Have you checked GPU-util when such thing happens?

I also have similar issue during training. In my situation, the lock file used in custom torch ops was a problem. Due to crashed previous training, lock file of torch extension hasn't been deleted completely, introducing waiting sequence for using such operation.

You'd better check such locks are cleaned before training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The training gets stuck at `self.p_train_step` at a random step. #3

The training gets stuck at `self.p_train_step` at a random step. #3

baofff commented Jun 28, 2022

LuChengTHU commented Jul 1, 2022

pjh4993 commented Jul 3, 2022

The training gets stuck at self.p_train_step at a random step. #3

The training gets stuck at self.p_train_step at a random step. #3

Comments

baofff commented Jun 28, 2022

LuChengTHU commented Jul 1, 2022

pjh4993 commented Jul 3, 2022

The training gets stuck at `self.p_train_step` at a random step. #3

The training gets stuck at `self.p_train_step` at a random step. #3