-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178
[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178
Conversation
Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object. However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time... |
Actually not only size So we don't need to wait for listdir and check image I can ensure the startup time with cached dataset is less than 1min From I press enter to I see the tqdm progress bar Pickling is aggressive, I just use this t o show how it help at firstXD BTW |
I will implement a version which cache absolute path list and imagesize for each subset |
That's nice! I think it is straightforward :) |
@kohya-ss I have done the implementation I only implement it for DreamboothDataset at first. |
Thank you for update! This is really nice. I will copy it to other datasets :) I may change the format to JSON or something else for future proof. It makes the metadata bigger three times or more, but I believe it is no problem. I appreciate your understanding. |
…aiting time for initilization (kohya-ss#1178) * support meta cached dataset * add cache meta scripts * random ip_noise_gamma strength * random noise_offset strength * use correct settings for parser * cache path/caption/size only * revert mess up commit * revert mess up commit * Update requirements.txt * Add arguments for meta cache. * remove pickle implementation * Return sizes when enable cache --------- Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>
For large scale dataset, sd-scripts will suffer from long waiting time to read the image size or other meta.
So I propose 2 improvements:
imagesize
library to read the image size, don't use PIL which is overkill.In my cache script, I successfully only use half hour to get the dataset groups. (Which will cost 4hour if I directly run 4card training)
And the loading for cached dataset groups also be fine. I have do a quick sanity check that first few images are same. But need more check from community