-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local data backend should have file locking for writes and reads #1160
Conversation
…that we do not read partial writes or corrupt multiprocess writes
Training gets clogged up here:
Tried again, errored further down:
|
Update after trying again: It seems like ST doesn't like one of my dataset folders. It will say it can't find any images for 512, and when I remove it from the data loader, then it says it doesn't like 768. In both cases, training start-up doesn't proceed. It will take 1024 however (with the same "No images were discovered" error) , move to training, start, and then error out with the same "cannot find cache" error after a few hundred steps. I took that dataset out completely, and training proceeds as normal. At a loss as to what could be the issue with this dataset. Tried a number of things: checked data loader, converted to jpg, converted to png, checked captions, checked resolution (all were 1000px or more), checked count (26), deleting anything that wasn't an image or caption file. All the other datasets were 100s or 1000s of images and processed without issue. |
you'll have to check debug.log - since this isn't a widespread issue it's likely something like captions not getting found |
i'm trying this on 3x 4090 and seeing no loss in throughput and it seems like everything is working so i'm going to merge it and assume your issue is now somewhere else 👍 |
The code could be improved by having each process write to a different temporary file. This would at least solve the issues with the file being missing when os.rename() is called. Adding the process rank after |
@mhirki did you hit an error there? i did not. however, i ran into a situation where none of the pt files were being moved into place during VAE Caching, so i have now resolved that. and implemented your idea. |
Nah, I don't have a multi-GPU system for testing this. I was just looking at the errors that @playerzer0x was encountering. |
Different datasets, similar issue. Error:
debug.log:
|
think it needs to be an absolute path to the dataset dir |
Hm, relative paths are the only paths that have worked for me as long as I've been using SimpleTuner. Only seem to run into issues when training on a dataset with subdirectories, which is often the case when I train multiple subjects, and multi-gpu. For now, I've solved my issue by removing subdirectories and having all images + caption files live in the same directory. This isn't ideal longterm, b/c I'd like to have more granular control over repeats at the sub-dataset level. Not super important now, but could be soon if I continue building on this training set. Let me know if fixing this is becoming too onerous for you. Wondering if there's someone we can hire from the Discord to help out as I'm not as technically capable at solving these issues as I seem to be at finding them :). Happy to put resources in that direction in any case. |
when using subfolders you need abs paths |
I changed everything to absolute paths and still run into the same issue. If I change to a single GPU, caching and training starts fine, so I think it's pretty isolated to a multi-gpu issue at this point. I can change TRAINING_NUM_PROCESSES=1 and get through the caching process, but I can't quit and increase the GPU count after caching because I run into the same error starting up again. My hunch is it has to do with bucketing per GPU. Maybe if there more GPUs than there are bucketed images to go around per GPU, it throws the error? |
so that we do not read partial writes or corrupt multiprocess writes