Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization #1178

Merged
merged 13 commits into from
Mar 24, 2024

Conversation

KohakuBlueleaf
Copy link
Contributor

For large scale dataset, sd-scripts will suffer from long waiting time to read the image size or other meta.

So I propose 2 improvements:

  1. Cache the built dataset group object into disk so we don't need to calculate it multiple time
  • For ddp or some expriments across settings. This will be SUPER helpful
  1. use imagesize library to read the image size, don't use PIL which is overkill.
  • This can provide 5~10 times speed up for reading image size on NVME ssd.

In my cache script, I successfully only use half hour to get the dataset groups. (Which will cost 4hour if I directly run 4card training)
And the loading for cached dataset groups also be fine. I have do a quick sanity check that first few images are same. But need more check from community

@kohya-ss
Copy link
Owner

Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object.

However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time...

@KohakuBlueleaf
Copy link
Contributor Author

Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object.

However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time...

Actually not only size
Also the absolute path and bucket

So we don't need to wait for listdir and check image

I can ensure the startup time with cached dataset is less than 1min

From I press enter to I see the tqdm progress bar

Pickling is aggressive, I just use this t o show how it help at firstXD

BTW
Absolute path + size + bucket for 5mil image in pickle only cost me 3GB

@KohakuBlueleaf
Copy link
Contributor Author

Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object.

However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time...

I will implement a version which cache absolute path list and imagesize for each subset
So the loading procedure will be same, just ignore the listdir and imagesize part

@kohya-ss
Copy link
Owner

I will implement a version which cache absolute path list and imagesize for each subset
So the loading procedure will be same, just ignore the listdir and imagesize part

That's nice! I think it is straightforward :)

@KohakuBlueleaf
Copy link
Contributor Author

@kohya-ss I have done the implementation
which only cache the image path/caption and image size.
With cached metadata, dataset with 32768 img only need 4sec from start the program to finish the dataset setup (include creating bucket)

I only implement it for DreamboothDataset at first.
But I think you can copy the implementation to another 2 dataset class easily.

@kohya-ss
Copy link
Owner

Thank you for update! This is really nice. I will copy it to other datasets :)

I may change the format to JSON or something else for future proof. It makes the metadata bigger three times or more, but I believe it is no problem. I appreciate your understanding.

@kohya-ss kohya-ss changed the base branch from dev to dataset-cache March 24, 2024 06:35
@kohya-ss kohya-ss merged commit ae97c8b into kohya-ss:dataset-cache Mar 24, 2024
1 check passed
deepdelirious pushed a commit to deepdelirious/sd-scripts that referenced this pull request Mar 29, 2024
…aiting time for initilization (kohya-ss#1178)

* support meta cached dataset

* add cache meta scripts

* random ip_noise_gamma strength

* random noise_offset strength

* use correct settings for parser

* cache path/caption/size only

* revert mess up commit

* revert mess up commit

* Update requirements.txt

* Add arguments for meta cache.

* remove pickle implementation

* Return sizes when enable cache

---------

Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants