add detail dataset config feature by extra config file #227

fur0ut0 · 2023-02-24T17:39:53Z

README: https://github.com/fur0ut0/sd-yascripts/blob/feature/dataset_config/config_README-ja.md

There are several major changes:

add config file feature by --config_file option
- please check README for more detail
add subset feature for combining multiple directories into dataset
- also enable configuration per subset
support multiple datasets
- combine datasets into DatasetGroup and use it as pseudo dataset
add switch feature of inter-dataset shuffle by --bucket_shuffle_across_dataset
- I'm wondering if I should make this behavior default
change metadata scheme for supporting multiple datasets and multiple subsets

I have conserved backward compatibility of current DreamBooth directory handling, which uses subdirectory structure for class tokens and the number of dataset repeat.

NOTE: Sorry for a bit messy history. Maybe squash merge would be better option.

* bucket_repo_range * shuffle_keep_tokens

fur0ut0 · 2023-02-27T13:18:05Z

This toml file seems to cause an error: voluptuous.error.MultipleInvalid: extra keys not allowed @ data['datasets'][0]['subsets'][0]['metadata_file']

Maybe I am doing something wrong, any ideas?
[general]

# これは fine tuning 方式のデータセット
[[datasets]]
resolution = [768, 512]

  [[datasets.subsets]]
  # image_dir = 'D:\Work\SD\Diffusers-DB\data\ds_test\ft_train1\images'
  metadata_file = 'D:\Work\SD\Diffusers-DB\data\ds_test\ft_train1\ft1_lat.json'

  [[datasets.subsets]]
  metadata_file = 'D:\Work\SD\Diffusers-DB\data\ds_test\ft_train2\ft2_cln.json'

We need to provide image_dir for every subset.

However, this error message seems weird, right? This comes from error message of voluptuous only shows the first one when there are multiple errors.

In this case, it firstly tries to parse config as DreamBooth dataset. This time it causes extra key error because DreamBooth dataset does not support "metadata_file" option.

After that, it tries to parse config as fine tuning dataset. This time it causes required key error because there is no "image_dir" option.

I think adding another required config like is_dreambooth to dataset would improve this weird behavior because we can deterministically parse dataset config. What do you think?

kohya-ss · 2023-02-27T13:33:51Z

Thank you for details, I understand the behavior of voluptuous.

However, this error message seems weird, right? This comes from error message of voluptuous only shows the first one when there are multiple errors.

I agree that. I forgot to add image_dir, and get this error, so I'm bit confusing for the error message.

I think adding another required config like is_dreambooth to dataset would improve this weird behavior because we can deterministically parse dataset config. What do you think?

I think it might be good idea. I also wonder that matadata_file is required for fine tuning dataset, so it would be possible to parse as a fine tuning data set first. If there is no metadata_file, then it would be the DreamBooth dataset.

fur0ut0 · 2023-02-27T14:15:09Z

I also wonder that matadata_file is required for fine tuning dataset, so it would be possible to parse as a fine tuning data set first. If there is no metadata_file, then it would be the DreamBooth dataset.

I'm afraid that simply swapping parsing order still causes the same issue because DreamBooth dataset also has distinct options, "is_reg" and "caption_extension".

However, checking whether metadata_file exists before parsing seems nice. I will try implementing deterministic dataset parsing feature.

fur0ut0 · 2023-02-27T14:44:50Z

Ah, I have found image_dir is not required for FineTuningDataset now when metadata file has absolute path information.

I think image_dir should be handled as optional in fine tuning dataset while handled as required in DreamBooth dataset. I will also fix this.

kohya-ss · 2023-02-27T23:42:21Z

Thank you for updating! Now FineTuningDataset works without image_dir option.

I think we are about ready to release it :)

I would like to confirm one thing, is my understanding correct that we cannot mix subsets of DreamBooth and fine tuning as a subsets of a certain dataset?

fur0ut0 · 2023-02-28T14:02:37Z

Update:

checking metadata_file existence before dataset parsing
- This improves the confusing error message issue we have seen
update README

I would like to confirm one thing, is my understanding correct that we cannot mix subsets of DreamBooth and fine tuning as a subsets of a certain dataset?

You are right. These two subset types cannot be mixed into single dataset.

This is mainly because I have no idea how to compensate the number of regularization images when there are also fine tuning subsets. If you come up with some way, these different types of subsets might be able to mix together.

I have added description about this topic to README.

kohya-ss · 2023-02-28T22:53:06Z

Thank you for updating! The code and README are quite good!

You are right. These two subset types cannot be mixed into single dataset.

This is mainly because I have no idea how to compensate the number of regularization images when there are also fine tuning subsets. If you come up with some way, these different types of subsets might be able to mix together.

Thank you for the clarification. I got it.

I think the number of regularization images for the dataset might be the sum of all non-regularization subsets. Because, for example, if someone wants to train a particular character with images with captions, along with a regularization images, it is preferable for the person to be able to use metadata as well as captioned DreamBooth subset.
However I am considering the feature of using all regularization images, even if the repeats*number of training images is smaller than the number of regularized images. Therefore, updates can be made that time.

I will merge the PR after work today 😀

kohya-ss · 2023-03-02T22:41:20Z

I've finally released the feature! I've changed the name of the option to --dataset_config, because another PR will add the config file feature for training parameters, so I think config_file may confuse users. I appreciate your understanding.

Thank you again for this great contribution!

v20.8.1

Luo Boming added 30 commits February 23, 2023 01:35

add config file schema

fe765b6

change config file specification

2cd24c4

refactor config utility

bae5760

unify batch_size to train_batch_size

0e0c5fb

fix indent size

3233f12

use batch_size instead of train_batch_size

7743031

make cache_latents configurable on subset

66c9162

rename options

436cc8f

* bucket_repo_range * shuffle_keep_tokens

update readme

3c9e399

revert to min_bucket_reso & max_bucket_reso

192e3d5

use subset structure in dataset

6abfbb6

format import lines

3e8a498

split mode specific options

7482c3c

use only valid subset

0400d0f

change valid subsets name

0f909a4

manage multiple datasets by dataset group

6585cfd

update config file sanitizer

f0c04cd

prune redundant validation

3db7ea0

add comments

d55565b

update type annotation

5e41925

rename json_file_name to metadata_file

ab5cb02

ignore when image dir is invalid

d92af62

fix tag shuffle and dropout

00d651d

ignore duplicated subset

66d7145

add method to check latent cachability

9cae8b1

fix format

5b12bbe

fix bug

a5b2466

update caption dropout default values

d1e8edb

update annotation

1612684

fix bug

d191328

Luo Boming added 3 commits February 27, 2023 21:43

fix reference bug

629ac27

fix undefined variable bug

1208e7b

prevent raise overwriting

a7796bc

Luo Boming added 6 commits February 27, 2023 23:55

assert image_dir and metadata_file validity

a89e6ad

add debug message for ignoring subset

7866e31

fix inconsistent import statement

392a271

loosen too strict validation on float value

27ae0fa

sanitize argument parser separately

7d295d4

make image_dir optional for fine tuning dataset

9994630

Luo Boming added 9 commits February 28, 2023 22:04

fix import

0525ccc

fix trailing characters in print

bbe112f

parse flexible dataset config deterministically

6f0a588

use relative import

59731d7

print supplementary message for parsing error

bf1ac1a

add note about different methods

c9d285b

add note of benefit of separate dataset

2cef87b

add error example

2609cf9

add note for english readme plan

9912f9a

kohya-ss changed the base branch from main to dev March 1, 2023 11:46

Merge branch 'dev' into feature/dataset_config

7722af6

kohya-ss merged commit 8abb864 into kohya-ss:dev Mar 1, 2023

wkpark pushed a commit to wkpark/sd-scripts that referenced this pull request Feb 27, 2024

Merge pull request kohya-ss#227 from bmaltais/dev

c7e99eb

v20.8.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add detail dataset config feature by extra config file #227

add detail dataset config feature by extra config file #227

fur0ut0 commented Feb 24, 2023 •

edited

Loading

fur0ut0 commented Feb 27, 2023

kohya-ss commented Feb 27, 2023

fur0ut0 commented Feb 27, 2023

fur0ut0 commented Feb 27, 2023

kohya-ss commented Feb 27, 2023

fur0ut0 commented Feb 28, 2023 •

edited

Loading

kohya-ss commented Feb 28, 2023

kohya-ss commented Mar 2, 2023

add detail dataset config feature by extra config file #227

add detail dataset config feature by extra config file #227

Conversation

fur0ut0 commented Feb 24, 2023 • edited Loading

fur0ut0 commented Feb 27, 2023

kohya-ss commented Feb 27, 2023

fur0ut0 commented Feb 27, 2023

fur0ut0 commented Feb 27, 2023

kohya-ss commented Feb 27, 2023

fur0ut0 commented Feb 28, 2023 • edited Loading

kohya-ss commented Feb 28, 2023

kohya-ss commented Mar 2, 2023

fur0ut0 commented Feb 24, 2023 •

edited

Loading

fur0ut0 commented Feb 28, 2023 •

edited

Loading