-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training crashing for images that used to work fine, with error: latent shape mismatch: torch.Size([16, 88, 136]) != torch.Size([16, 88, 144]) #977
Comments
config.json: |
I've changed the multidatabackend.json to use pixel_area: |
you should disable disable-bucket-pruning and see |
I assume that the code generating the cache and the code verifying the cache have a difference in rounding and calculation of the sizes, which is causing the mismatch. From what I see in the code related to disable_bucket_pruning, it just doesn’t remove images, it doesn’t have a different calculation for the image sizes. Maybe not using disable_bucket_pruning can hide the issue, but I don’t think it’s part of the problem or the solution. |
if it goes away then i'll know where to look, otherwise i have to assume this is a problem with just your setup and can't do anything about it. it's up to you how to proceed |
there's no difference in rounding between cache generation and cache loading. the actual size of the cache element is checked against other cache entries in the batch. if you really want to just keep training with all other settings the same, use a batch size of 1 with gradient accumulation steps > 1 to emulate larger batch sizes with the appropriate slowdown? |
Ok, I'm testing now with disable_bucket_pruning. |
I tested the training with disable_bucket_pruning=false. And I used repeats=3 with dataset.
The exact settings to reproduce this issue: tr_01 |
The attached dataset and settings are reproducing the issue very reliably in a few hundreds steps. I don't think "1 / 0 magic" needed. |
i still can't reproduce it at all on mac or linux, hence not having a solution yet. i set up your dataset in combination with 12 other datasets containing roughly 3800 to 13000 images in each, plus one with 576,000 images in it and there's no problems locally. you will probably have to enable SIMPLETUNER_LOG_LEVEL=DEBUG in config.env and reproduce it with the details in debug.log and review the contents to determine why the sizes are going AWOL. |
Using additional datasets is something that might hide the issue, because you will have more buckets having more images. And you will have a smaller chance to get buckets from my dataset that causing the crash. |
it is just a dataset of images. there are not two images with the same name and different sizes. having the one dataset didn't cause the error either. i do in fact train quite frequently with just one dataset, and your configuration here has two datasets. |
My configurations have two resolutions training on the same dataset that I attached in the dataset file. |
no, just pointing out that when you mention using additional datasets would somehow hide the issue, your config has the two already. not sure what you meant by hiding the issue with more buckets - a dataset is independent from the other datasets. i didn't run the accelerator, so i am able to get through 6000 steps quite quickly. it just validates shapes, and does not train. everything else is the same. |
You said |
I started a new training with smaller dataset and got the error:
Debug log: |
Here is a debug log - I deleted the cache first: |
Error:
|
A new run with: "--aspect_bucket_rounding": 2,
Log: |
I added: "--debug_aspect_buckets": true
Log: |
My settings: |
I tried to run with crop enabled, and it crashed anyway with this error:
Here is the |
I updated, simplified settings, converted to SDXL, and tried again with reduced dataset that has only 41 images. Here is the dataset (v4): I'm attaching all the files I used for training with all the settings, including full debug log per each run. Here is the crash log 1 -
Here is the crash log 2 -
Here is the crash log 3 -
|
ok. i am in the mood for some pain i guess after dinner. i will proverbially dig in after i literally dig in |
Thank you! |
fixed by #1076 locally here |
I tested the PR with fix, with a small and medium dataset, they both finished the first epoch without crashing, so I think that the fix is working. |
Using latest version of SimpleTuner producing an error: "latent shape mismatch: torch.Size([16, 88, 136]) != torch.Size([16, 88, 144])"
Examples:
error_crash_latent_mismatch_01.txt
error_crash_latent_mismatch_02.txt
Both images used in the past training without issues.
I have the option in settings: "--delete_problematic_images": true
Tested with two separate servers with clean caching for each.
multidatabackend:
s01_multidatabackend.json
Attaching images that crashed so far, zip password is "password":
images_crashed.zip
Interestingly, I had a crash on step 607 now. With checkpoint made on step 600. I continued the training from last, to see if it will crash on the same step, but training passed over step 607 without crashing.
The text was updated successfully, but these errors were encountered: