Skip to content
This repository has been archived by the owner on Nov 30, 2023. It is now read-only.

The training gets stuck at self.p_train_step at a random step. #3

Open
baofff opened this issue Jun 28, 2022 · 2 comments
Open

The training gets stuck at self.p_train_step at a random step. #3

baofff opened this issue Jun 28, 2022 · 2 comments

Comments

@baofff
Copy link

baofff commented Jun 28, 2022

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

@LuChengTHU
Copy link

Same.

@pjh4993
Copy link

pjh4993 commented Jul 3, 2022

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

Do you have any error msg?
Have you checked GPU-util when such thing happens?

I also have similar issue during training. In my situation, the lock file used in custom torch ops was a problem. Due to crashed previous training, lock file of torch extension hasn't been deleted completely, introducing waiting sequence for using such operation.

You'd better check such locks are cleaned before training.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants