Merge pull request #928 from bghira/main

merge
bghira · Sep 2, 2024 · bebcbee · bebcbee
2 parents bf4951d + 0b0d941
commit bebcbee
Show file tree

Hide file tree

Showing 6 changed files with 97 additions and 79 deletions.
diff --git a/INSTALL.md b/INSTALL.md
@@ -19,6 +19,8 @@ source .venv/bin/activate
 pip install -U poetry pip
 ```
 
+> ℹ️ You can use your own custom venv path by setting `export VENV_PATH=/path/to/.venv` in your `config/config.env` file.
+
 **Note:** We're currently installing the `release` branch here; the `main` branch may contain experimental features that might have better results or lower memory use.
 
 Depending on your system, you will run one of 3 commands:
@@ -41,15 +43,15 @@ The following must be executed for an AMD MI300X to be useable:
 ```bash
 apt install amd-smi-lib
 pushd /opt/rocm/share/amd_smi
-python3 -m pip install --upgrade pip
-python3 -m pip install .
+  python3 -m pip install --upgrade pip
+  python3 -m pip install .
 popd
 ```
 
 ### All platforms
 
-2a. **Option One**: Run `configure.py`
-2b. **Option Two**: Copy `config/config.json.example` to `config/config.json` and then fill in the details.
+- 2a. **Option One (Recommended)**: Run `configure.py`
+- 2b. **Option Two**: Copy `config/config.json.example` to `config/config.json` and then fill in the details.
 
 #### Multiple GPU training
 
@@ -59,6 +61,7 @@ popd
 TRAINING_NUM_PROCESSES=1
 TRAINING_NUM_MACHINES=1
 TRAINING_DYNAMO_BACKEND='no'
+# this is auto-detected, and not necessary. but can be set explicitly.
 CONFIG_BACKEND='json'
 ```
 
@@ -74,6 +77,9 @@ Follow the instructions that are printed, to locate your API key and configure i
 
 Once that is done, any of your training sessions and validation data will be available on Weights & Biases.
 
+> ℹ️ If you would like to disable Weights & Biases or Tensorboard reporting entirely, use `--report-to=none`
+
+
 4. Launch the `train.sh` script; logs will be written to `debug.log`
 
 ```bash
@@ -84,7 +90,11 @@ Once that is done, any of your training sessions and validation data will be ava
 
 ### Run unit tests
 
-To run unit tests to ensure that installation has completed successfully, execute the command `poetry run python -m unittest discover tests/`.
+To run unit tests to ensure that installation has completed successfully:
+
+```bash
+poetry run python -m unittest discover tests/
+```
 
 ## Advanced: Multiple configuration environments
 

diff --git a/OPTIONS.md b/OPTIONS.md
@@ -4,14 +4,24 @@
 
 This guide provides a user-friendly breakdown of the command-line options available in SimpleTuner's `train.py` script. These options offer a high degree of customization, allowing you to train your model to suit your specific requirements.
 
+### JSON Configuration file format
+
+The JSON filename expected is `config.json` and the key names are the same as the below `--arguments`. The leading `--` is not required for the JSON file, but it can be left in as well.
+
+### Easy configure script (***RECOMMENDED***)
+
+The script `configure.py` in the project root can be used via `python configure.py` to set up a `config.json` file with mostly-ideal default settings.
+
 ---
 
 ## 🌟 Core Model Configuration
 
 ### `--model_type`
 
-- **What**: Choices: lora, full, deepfloyd, deepfloyd-lora, deepfloyd-stage2, deepfloyd-stage2-lora. Default: lora
-- **Why**: Select whether a LoRA or full fine-tune are created. LoRA only supported for SDXL.
+- **What**: Select whether a LoRA or full fine-tune are created.
+- **Choices**: lora, full.
+- **Default**: lora
+  - If lora is used, `--lora_type` dictates whether PEFT or LyCORIS are in use. Some models (PixArt) work only with LyCORIS adapters.
 
 ## `--model_family`
 
@@ -20,12 +30,12 @@ This guide provides a user-friendly breakdown of the command-line options availa
 
 ### `--pretrained_model_name_or_path`
 
-- **What**: Path to the pretrained model or its identifier from huggingface.co/models.
+- **What**: Path to the pretrained model or its identifier from https://huggingface.co/models.
 - **Why**: To specify the base model you'll start training from. Use `--revision` and `--variant` to specify specific versions from a repository.
 
 ### `--pretrained_t5_model_name_or_path`
 
-- **What**: Path to the pretrained T5 model or its identifier from huggingface.co/models.
+- **What**: Path to the pretrained T5 model or its identifier from https://huggingface.co/models.
 - **Why**: When training PixArt, you might want to use a specific source for your T5 weights so that you can avoid downloading them multiple times when switching the base model you train from.
 
 ### `--hub_model_id`
@@ -43,6 +53,10 @@ This guide provides a user-friendly breakdown of the command-line options availa
 
 - **What**: Enables training a custom mixture-of-experts model series. See [Mixture-of-Experts](/documentation/MIXTURE_OF_EXPERTS.md) for more information on these options.
 
+### `--disable_benchmark`
+
+- **What**: Disable the startup validation/benchmark that occurs at step 0 on the base model. These outputs are stitchd to the left side of your trained model validation images.
+
 ## 📂 Data Storage and Management
 
 ### `--data_backend_config`
@@ -75,6 +89,8 @@ This guide provides a user-friendly breakdown of the command-line options availa
 - **What**: Retrieve batches ahead-of-time.
 - **Why**: Especially when using large batch sizes, training will "pause" while samples are retrieved from disk (even NVMe), impacting GPU utilisation metrics. Enabling dataloader prefetch will keep a buffer full of entire batches, so that they can be loaded instantly.
 
+> ⚠️ This is really only relevant for H100 or better at a low resolution where I/O becomes the bottleneck. For most other use cases, it is an unnecessary complexity.
+
 ### `--dataloader_prefetch_qlen`
 
 - **What**: Increase or reduce the number of batches held in memory.
@@ -97,14 +113,19 @@ A lot of settings are instead set through the [dataloader config](/documentation
 
 - **What**: This tells SimpleTuner whether to use `area` size calculations or `pixel` edge calculations. A hybrid approach of `pixel_area` is also supported, which allows using pixel instead of megapixel for `area` measurements.
 - **Options**: 
-  - `resolution_type=pixel` - All images in the dataset will have their smaller edge resized to this resolution for training, which could result in a lot of VRAM use due to the size of the resulting images.
-  - `resolution_type=area` - It is recommended use a value of 1.0 if also using `--resolution_type=area`.
-  - `resolution_type=pixel_area` - A `resolution` value of 1024 will be internally mapped to an accurate area measurement for efficient aspect bucketing.
+  - `resolution_type=pixel_area`
+    - A `resolution` value of 1024 will be internally mapped to an accurate area measurement for efficient aspect bucketing.
+    - Example resulting sizes for `1024`: 1024x1024, 1216x832, 832x1216
+  - `resolution_type=pixel`
+    - All images in the dataset will have their smaller edge resized to this resolution for training, which could result in a lot of VRAM use due to the size of the resulting images.
+    - Example resulting sizes for `1024`: 1024x1024, 1766x1024, 1024x1766
+  - `resolution_type=area`
+    - An internal option that isn't user-friendly. Use `pixel_area` instead.
 
 ### `--resolution`
 
-- **What**: Input image resolution. Can be expressed as pixels, or megapixels, depending on what your selected value for `resolution_type` is.
-- **Default**: Using `resolution_type=pixel_area` with `resolution=1024`. When `resolution_type=area` instead, you will have to supply a megapixel value, such as `1.05`.
+- **What**: Input image resolution expressed in pixel edge length
+- **Default**: 1024
 
 ### `--validation_resolution`
 

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 > ⚠️ **Warning**: The scripts in this repository have the potential to damage your training data. Always maintain backups before proceeding.
 
-**SimpleTuner** is a repository dedicated to a set of experimental scripts designed for training optimization. The project is geared towards simplicity, with a focus on making the code easy to read and understand. This codebase serves as a shared academic exercise, and contributions are welcome.
+**SimpleTuner** is geared towards simplicity, with a focus on making the code easily understood. This codebase serves as a shared academic exercise, and contributions are welcome.
 
 ## Table of Contents
 
@@ -58,38 +58,28 @@ For memory-constrained systems, see the [DeepSpeed document](/documentation/DEEP
 
 ### Flux.1
 
-Preliminary training support for Flux.1 is included:
+Full training support for Flux.1 is included:
 
-- Low loss training using optimised approach
-  - Preserve the dev model's distillation qualities
-  - Or, reintroduce CFG to the model and improve its creativity at the cost of inference speed.
-- LoRA or full tuning via DeepSpeed ZeRO
-- ControlNet training is not yet supported
-- Train either Schnell or Dev models
+- Classifier-free guidance training
+  - Leave it disabled and preserve the dev model's distillation qualities
+  - Or, reintroduce CFG to the model and improve its creativity at the cost of inference speed and training time.
+- (optional) T5 attention masked training for superior fine details and generalisation capabilities
+- LoRA or full tuning via DeepSpeed ZeRO on a single GPU
 - Quantise the base model using `--base_model_precision` to `int8-quanto` or `fp8-quanto` for major memory savings
 
 See [hardware requirements](#flux1-dev-schnell) or the [quickstart guide](/documentation/quickstart/FLUX.md).
 
 ### PixArt Sigma
 
-SimpleTuner has extensive training integration with PixArt Sigma - both the 600M & 900M models load without any fuss.
+SimpleTuner has extensive training integration with PixArt Sigma - both the 600M & 900M models load without modification.
 
 - Text encoder training is not supported, as T5 is enormous.
-- LoRA and full tuning both work as expected
+- LyCORIS and full tuning both work as expected
 - ControlNet training is not yet supported
 - [Two-stage PixArt](https://huggingface.co/ptx0/pixart-900m-1024-ft-v0.7-stage1) training support (see: [MIXTURE_OF_EXPERTS](/documentation/MIXTURE_OF_EXPERTS.md))
 
 See the [PixArt Quickstart](/documentation/quickstart/SIGMA.md) guide to start training.
 
-### Stable Diffusion 2.0 & 2.1
-
-Stable Diffusion 2.1 is known for difficulty during fine-tuning, but this doesn't have to be the case. Related features in SimpleTuner include:
-
-- Training only the text encoder's later layers
-- Enforced zero SNR on the terminal timestep instead of offset noise for clearer images.
-- The use of EMA (exponential moving average) during training to ensure we do not "fry" the model.
-- The ability to train on multiple datasets with different base resolutions in each, eg. 512px and 768px images simultaneously
-
 ### Stable Diffusion 3
 
 - LoRA and full finetuning are supported as usual.
@@ -105,20 +95,29 @@ An SDXL-based model with ChatGLM (General Language Model) 6B as its text encoder
 
 Kolors support is almost as deep as SDXL, minus ControlNet training support.
 
+### Legacy Stable Diffusion models
+
+RunwayML's SD 1.5 and StabilityAI's SD 2.x are both trainable under the `legacy` designation.
+
 ---
 
 ## Hardware Requirements
 
-EMA (exponential moving average) weights are a memory-heavy affair, but provide fantastic results at the end of training. Options like `--ema_cpu_only` can improve this situation by loading EMA weights onto the CPU and then keeping them there.
+### NVIDIA
+
+Pretty much anything 3090 and up is a safe bet. YMMV.
+
+### AMD
 
-Without EMA, more care must be taken not to drastically change the model leading to "catastrophic forgetting" through the use of regularisation data.
+LoRA and full-rank tuning are verified working on a 7900 XTX 24GB and MI300X.
 
-### GPU vendors
+Lacking `xformers`, it will use more memory than Nvidia equivalent hardware.
 
-- NVIDIA - pretty much anything 3090 and up is a safe bet. YMMV.
-- AMD - SDXL LoRA and UNet are verified working on a 7900 XTX 24GB. Lacking `xformers`, it will likely use more memory than Nvidia equivalents
-- Apple - LoRA and full u-net tuning are tested to work on an M3 Max with 128G memory, taking about **12G** of "Wired" memory and **4G** of system memory for SDXL.
+### Apple
+
+LoRA and full-rank tuning are tested to work on an M3 Max with 128G memory, taking about **12G** of "Wired" memory and **4G** of system memory for SDXL.
   - You likely need a 24G or greater machine for machine learning with M-series hardware due to the lack of memory-efficient attention.
+  - Subscribing to Pytorch issues for MPS is probably a good idea, as random bugs will make training stop working.
 
 ### Flux.1 [dev, schnell]
 
@@ -139,9 +138,8 @@ Flux prefers being trained with multiple large GPUs but a single 16G card should
 
 ### Stable Diffusion 2.x, 768px
 
-- A100-40, A40, A6000 or better (EMA, 1024px training)
-- NVIDIA RTX 4090 or better (24G, no EMA)
-- NVIDIA RTX 4080 or better (LoRA only)
+- 16G or better
+
 
 ## Toolkit
 
@@ -153,9 +151,9 @@ Detailed setup information is available in the [installation documentation](/INS
 
 ## Troubleshooting
 
-Enable debug logs for a more detailed insight by adding `export SIMPLETUNER_LOG_LEVEL=DEBUG` to your environment file.
+Enable debug logs for a more detailed insight by adding `export SIMPLETUNER_LOG_LEVEL=DEBUG` to your environment (`config/config.env`) file.
 
-For performance analysis of the training loop, setting `SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG` will have timestamps that hilight any issues in your configuration.
+For performance analysis of the training loop, setting `SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG` will have timestamps that highlight any issues in your configuration.
 
 For a comprehensive list of options available, consult [this documentation](/OPTIONS.md).
 

diff --git a/documentation/DATALOADER.md b/documentation/DATALOADER.md
@@ -185,12 +185,14 @@ Images are not resized before cropping **unless** `maximum_image_size` and `targ
 ### `repeats`
 
 - Specifies the number of times all samples in the dataset are seen during an epoch. Useful for giving more impact to smaller datasets or maximizing the usage of VAE cache objects.
+- If you have a dataset of 1000 images vs one with 100 images, you would likely want to give the lesser dataset a repeats of `9` **or greater** to bring it to 1000 total images sampled.
 
-> ℹ️ This value behaves differently to the same option in Kohya's scripts, where a value of 1 means no repeats. **For SimpleTuner, a value of 0 means no repeats**. Subtract one from your Kohya config value to obtain the equivalent for SimpleTuner.
+> ℹ️ This value behaves differently to the same option in Kohya's scripts, where a value of 1 means no repeats. **For SimpleTuner, a value of 0 means no repeats**. Subtract one from your Kohya config value to obtain the equivalent for SimpleTuner, hence a value of **9** resulting from the calculation `(dataset_length + repeats * dataset_length)` .
 
 ### `vae_cache_clear_each_epoch`
 
 - When enabled, all VAE cache objects are deleted from the filesystem at the end of each dataset repeat cycle. This can be resource-intensive for large datasets, but combined with `crop_style=random` and/or `crop_aspect=random` you'll want this enabled to ensure you sample a full range of crops from each image.
+- In fact, this option is **enabled by default** when using random bucketing or crops.
 
 ### `skip_file_discovery`