Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add detail dataset config feature by extra config file #227

Merged
merged 66 commits into from
Mar 1, 2023

Conversation

fur0ut0
Copy link
Contributor

@fur0ut0 fur0ut0 commented Feb 24, 2023

README: https://github.com/fur0ut0/sd-yascripts/blob/feature/dataset_config/config_README-ja.md

Solve #58 #130

There are several major changes:

  • add config file feature by --config_file option
    • please check README for more detail
  • add subset feature for combining multiple directories into dataset
    • also enable configuration per subset
  • support multiple datasets
    • combine datasets into DatasetGroup and use it as pseudo dataset
  • add switch feature of inter-dataset shuffle by --bucket_shuffle_across_dataset
    • I'm wondering if I should make this behavior default
  • change metadata scheme for supporting multiple datasets and multiple subsets

I have conserved backward compatibility of current DreamBooth directory handling, which uses subdirectory structure for class tokens and the number of dataset repeat.

NOTE: Sorry for a bit messy history. Maybe squash merge would be better option.

@fur0ut0
Copy link
Contributor Author

fur0ut0 commented Feb 27, 2023

This toml file seems to cause an error: voluptuous.error.MultipleInvalid: extra keys not allowed @ data['datasets'][0]['subsets'][0]['metadata_file']

Maybe I am doing something wrong, any ideas?

[general]

# これは fine tuning 方式のデータセット
[[datasets]]
resolution = [768, 512]

  [[datasets.subsets]]
  # image_dir = 'D:\Work\SD\Diffusers-DB\data\ds_test\ft_train1\images'
  metadata_file = 'D:\Work\SD\Diffusers-DB\data\ds_test\ft_train1\ft1_lat.json'

  [[datasets.subsets]]
  metadata_file = 'D:\Work\SD\Diffusers-DB\data\ds_test\ft_train2\ft2_cln.json'

We need to provide image_dir for every subset.

However, this error message seems weird, right? This comes from error message of voluptuous only shows the first one when there are multiple errors.

In this case, it firstly tries to parse config as DreamBooth dataset. This time it causes extra key error because DreamBooth dataset does not support "metadata_file" option.

After that, it tries to parse config as fine tuning dataset. This time it causes required key error because there is no "image_dir" option.

I think adding another required config like is_dreambooth to dataset would improve this weird behavior because we can deterministically parse dataset config. What do you think?

@kohya-ss
Copy link
Owner

Thank you for details, I understand the behavior of voluptuous.

However, this error message seems weird, right? This comes from error message of voluptuous only shows the first one when there are multiple errors.

I agree that. I forgot to add image_dir, and get this error, so I'm bit confusing for the error message.

I think adding another required config like is_dreambooth to dataset would improve this weird behavior because we can deterministically parse dataset config. What do you think?

I think it might be good idea. I also wonder that matadata_file is required for fine tuning dataset, so it would be possible to parse as a fine tuning data set first. If there is no metadata_file, then it would be the DreamBooth dataset.

@fur0ut0
Copy link
Contributor Author

fur0ut0 commented Feb 27, 2023

I also wonder that matadata_file is required for fine tuning dataset, so it would be possible to parse as a fine tuning data set first. If there is no metadata_file, then it would be the DreamBooth dataset.

I'm afraid that simply swapping parsing order still causes the same issue because DreamBooth dataset also has distinct options, "is_reg" and "caption_extension".

However, checking whether metadata_file exists before parsing seems nice. I will try implementing deterministic dataset parsing feature.

@fur0ut0
Copy link
Contributor Author

fur0ut0 commented Feb 27, 2023

Ah, I have found image_dir is not required for FineTuningDataset now when metadata file has absolute path information.

I think image_dir should be handled as optional in fine tuning dataset while handled as required in DreamBooth dataset. I will also fix this.

@kohya-ss
Copy link
Owner

Thank you for updating! Now FineTuningDataset works without image_dir option.

I think we are about ready to release it :)

I would like to confirm one thing, is my understanding correct that we cannot mix subsets of DreamBooth and fine tuning as a subsets of a certain dataset?

@fur0ut0
Copy link
Contributor Author

fur0ut0 commented Feb 28, 2023

Update:

  • checking metadata_file existence before dataset parsing
    • This improves the confusing error message issue we have seen
  • update README

I would like to confirm one thing, is my understanding correct that we cannot mix subsets of DreamBooth and fine tuning as a subsets of a certain dataset?

You are right. These two subset types cannot be mixed into single dataset.

This is mainly because I have no idea how to compensate the number of regularization images when there are also fine tuning subsets. If you come up with some way, these different types of subsets might be able to mix together.

I have added description about this topic to README.

@kohya-ss
Copy link
Owner

Thank you for updating! The code and README are quite good!

You are right. These two subset types cannot be mixed into single dataset.

This is mainly because I have no idea how to compensate the number of regularization images when there are also fine tuning subsets. If you come up with some way, these different types of subsets might be able to mix together.

Thank you for the clarification. I got it.

I think the number of regularization images for the dataset might be the sum of all non-regularization subsets. Because, for example, if someone wants to train a particular character with images with captions, along with a regularization images, it is preferable for the person to be able to use metadata as well as captioned DreamBooth subset.
However I am considering the feature of using all regularization images, even if the repeats*number of training images is smaller than the number of regularized images. Therefore, updates can be made that time.

I will merge the PR after work today 😀

@kohya-ss kohya-ss changed the base branch from main to dev March 1, 2023 11:46
@kohya-ss kohya-ss merged commit 8abb864 into kohya-ss:dev Mar 1, 2023
@kohya-ss
Copy link
Owner

kohya-ss commented Mar 2, 2023

I've finally released the feature! I've changed the name of the option to --dataset_config, because another PR will add the config file feature for training parameters, so I think config_file may confuse users. I appreciate your understanding.

Thank you again for this great contribution!

wkpark pushed a commit to wkpark/sd-scripts that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants