Training loss is NaN now. #17

Strive21 · 2024-12-06T07:56:53Z

When will the latest version of the code and data processing code be released?

SHYuanBest · 2024-12-06T09:38:22Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

Strive21 · 2024-12-06T15:46:15Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

I downloaded your dataset and processed it appropriately, using CogvideoX-5B-I2V to initialize the weights，which bs is 5 and lr is 3e-7. It has loss in the initial training, but NaN appears after about 500 iterations. Is it because I processed the data wrong? And “fail to detect face using insightface, extract embedding on align face“ occurs during training。

SHYuanBest · 2024-12-07T02:52:58Z

Oh, I see. This may be a problem with MM-DiT. The training is very unstable because the activation value of the middle layer may be very large, resulting in loss NaN. You can try turning on EMA, gradient accumulation, increasing batchsize, and reducing the learning rate. Another method is to add a regularization term to the output of the middle layer.

SHYuanBest · 2024-12-07T03:10:51Z

fail to detect face using insightface, extract embedding on align face this warning cannot be avoided because facexlib may not be able to detect the face, and the code will automatically skip this training sample.

SHYuanBest · 2024-12-07T13:24:55Z

or you can try to train only LoRA instead of all parameters.

Oh, I see. This may be a problem with MM-DiT. The training is very unstable because the activation value of the middle layer may be very large, resulting in loss NaN. You can try turning on EMA, gradient accumulation, increasing batchsize, and reducing the learning rate. Another method is to add a regularization term to the output of the middle layer.

SHYuanBest · 2024-12-08T11:58:40Z

When will the latest version of the code and data processing code be released?

we have release the data processing code, please refer to here for more details.

Strive21 · 2024-12-09T03:14:52Z

When will the latest version of the code and data processing code be released?

we have release the data processing code, please refer to here for more details.

Thank you！I'll give it a try.

glimmer16 · 2024-12-23T12:04:14Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

I downloaded your dataset and processed it appropriately, using CogvideoX-5B-I2V to initialize the weights，which bs is 5 and lr is 3e-7. It has loss in the initial training, but NaN appears after about 500 iterations. Is it because I processed the data wrong? And “fail to detect face using insightface, extract embedding on align face“ occurs during training。

Hi! Have you solved this problem? I meet the same issue and wonder which way to avoid loss NaN.

SHYuanBest · 2024-12-24T01:54:54Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

I downloaded your dataset and processed it appropriately, using CogvideoX-5B-I2V to initialize the weights，which bs is 5 and lr is 3e-7. It has loss in the initial training, but NaN appears after about 500 iterations. Is it because I processed the data wrong? And “fail to detect face using insightface, extract embedding on align face“ occurs during training。

Hi! Have you solved this problem? I meet the same issue and wonder which way to avoid loss NaN.

You may need to construct a higher dataset to continue finetuning ConsisID, or have a larger batch size. Since ConsisID is trained on a higher quality internal dataset, if it continues to be trained on the ConsisID-Preview-Data, it is likely to get worse.

SHYuanBest · 2024-12-24T01:56:16Z

Or you can load the ckpt of CogVideoX-5B-I2V for training IPT2V from scratch. (Instead of load ConsisID-Preview for continue finetuning.)

SHYuanBest · 2024-12-24T14:04:49Z

Some solutions can refer to #31.

Strive21 · 2024-12-24T14:42:57Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

I downloaded your dataset and processed it appropriately, using CogvideoX-5B-I2V to initialize the weights，which bs is 5 and lr is 3e-7. It has loss in the initial training, but NaN appears after about 500 iterations. Is it because I processed the data wrong? And “fail to detect face using insightface, extract embedding on align face“ occurs during training。

Hi! Have you solved this problem? I meet the same issue and wonder which way to avoid loss NaN.

I try a larger batch size and solve this problem.

SHYuanBest mentioned this issue Dec 24, 2024

训练时loss为nan #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loss is NaN now. #17

Training loss is NaN now. #17

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 6, 2024

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 8, 2024

Strive21 commented Dec 9, 2024

glimmer16 commented Dec 23, 2024

SHYuanBest commented Dec 24, 2024

SHYuanBest commented Dec 24, 2024

SHYuanBest commented Dec 24, 2024

Strive21 commented Dec 24, 2024

Training loss is NaN now. #17

Training loss is NaN now. #17

Comments

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 6, 2024

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 8, 2024

Strive21 commented Dec 9, 2024

glimmer16 commented Dec 23, 2024

SHYuanBest commented Dec 24, 2024

SHYuanBest commented Dec 24, 2024

SHYuanBest commented Dec 24, 2024

Strive21 commented Dec 24, 2024