-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support SD3 #1374
base: dev
Are you sure you want to change the base?
support SD3 #1374
Conversation
this is a chance to just use Diffusers modules instead of doing everything from scratch. why not take it? |
There are several reasons for this, but the biggest reason is that it is difficult to extend. For example, LoRA, custom ControlNet and Deep Shrink etc. Also, considering the various processes in the training scripts, such as conditional loss, SNR, masked loss, etc., the training scripts need to be written from scratch. |
all of that is done via peft other than deepshrink but you can make a pipeline callback for that. |
i mean to use the sd3 transformer module from the diffusers project. it is frustrating to see bespoke versions of things with unreadable comments always in this repository. can you at least leave better comments? |
I think transformer module should be extendable for the future. In addition, SD3 transformer is based on sd3-ref (Stability AI official repo), and modified by KBlueLeaf to support xformers etc. So it is prior to Diffusers, and not full scratch. I appreciate your understanding. I will add better comments in future codes, including this PR. |
Hello, I have been trying out SD3 training. It seems to be working pretty well. 😊 One thing I noticed is that generation of sample images while training is not yet implemented. This made it hard for me to see how my SD3 training was going, and make adjustments. Implementing full support for all the sample images was difficult, but I found a cheap way to get most features working, and now I have sample images working again. This code is not properly integrated with the usual sample image generation code, but if people want to use it while they wait for a real well-integrated implementation, it does the basics of what's needed. Just go into your # sdxl_train_util.sample_images(
# accelerator,
# args,
# None,
# global_step,
# accelerator.device,
# vae,
# [tokenizer1, tokenizer2],
# [text_encoder1, text_encoder2],
# mmdit,
# ) and replace that with this: # Generate sample images
if args.sample_every_n_steps is not None and global_step % args.sample_every_n_steps == 0:
from sd3_minimal_inference import do_sample
from PIL import Image
import datetime
import numpy as np
import shlex
import random
assert args.save_t5xxl, "When generating sample images in SD3, --save_t5xxl parameter must be set"
with open(args.sample_prompts, 'r') as file:
lines = [line.strip() for line in file if line.strip()]
vae.to("cuda")
for line in lines:
logger.info(f"Generating image: {line}")
if line.find('--') != -1:
prompt = line[:line.find('--') - 1].strip()
line = line[line.find('--'):]
else:
prompt = line
line = ''
parser_s = argparse.ArgumentParser()
parser_s.add_argument("--w", type=int, action="store", default=1024, help="image width")
parser_s.add_argument("--h", type=int, action="store", default=1024, help="image height")
parser_s.add_argument("--s", type=int, action="store", default=30, help="sample steps")
parser_s.add_argument("--l", type=int, action="store", default=4, help="CFG")
parser_s.add_argument("--d", type=int, action="store", default=random.randint(0, 2**32 - 1), help="seed")
prompt_args = shlex.split(line)
args_s = parser_s.parse_args(prompt_args)
# prepare embeddings
lg_out, t5_out, pooled = sd3_utils.get_cond(prompt, sd3_tokenizer, clip_l, clip_g, t5xxl) # +'ve prompt
cond = torch.cat([lg_out, t5_out], dim=-2), pooled
lg_out, t5_out, pooled = sd3_utils.get_cond("", sd3_tokenizer, clip_l, clip_g, t5xxl) # No -'ve prompt
neg_cond = torch.cat([lg_out, t5_out], dim=-2), pooled
latent_sampled = do_sample(
args_s.h, args_s.w, None, args_s.d, cond, neg_cond, mmdit, args_s.s, args_s.l, weight_dtype, accelerator.device
)
# latent to image
with torch.no_grad():
image = vae.decode(latent_sampled)
image = image.float()
image = torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0)[0]
decoded_np = 255.0 * np.moveaxis(image.cpu().numpy(), 0, 2)
decoded_np = decoded_np.astype(np.uint8)
out_image = Image.fromarray(decoded_np)
# save image
output_dir = os.path.join(args.output_dir, "sample")
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, f"{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.png")
out_image.save(output_path)
vae.to("cpu") It supports a caption followed by the usual optional --w, --h, --s, --l, --d (for width, height, steps, cfg, and seed). It doesn't support negative captions, and it won't work right with captions longer than 75 tokens. I'm finding sample image generation to be helpful. For example, I notice that most of my sample output images start off by looking brighter than expected (with white or bright backgrounds). Edit: Might have been my cfg of 7.5; SD3 seems to want lower cfgs. I had to push the sample count up as the cfg was lowered. Image quality still seems poor though, compared to what some people are getting out of SD3. |
Think I've found an issue that's causing the poor quality SD3 samples. The do_sample() function is not filling in the class ModelSamplingDiscreteFlow:
"""Helper for sampler scheduling (ie timestep/sigma calculations) for Discrete Flow models"""
def __init__(self, shift=1.0):
self.shift = shift
timesteps = 1000
self.sigmas = self.sigma(torch.arange(1, timesteps + 1, 1)) From sd-script's sd3_minimal_inference.py function, model_sampling = sd3_utils.ModelSamplingDiscreteFlow()
|
Thank you! I fixed it. The generated images seemed to be better now. |
I agree that the sample image generation is really useful. In my understanding, T5XXL is on CPU, so I wonder I think it might be necessary to get TE's output for the sampling prompt in advance, at the same time the TE caching. However, if T5XXL works on CPU with an acceptable time, the implementation of the sample generation will be much easier (like your implementation :) . |
it takes about 30-50 seconds to run T5 XL on the CPU, i think XXL is even worse latency for each embed |
@kohya-ss, the calls to get_cond() only take around 2 seconds each on my machine. The whole sample image generation takes just 16 seconds per image for me, and I am still doing 80 sample steps for the images. :D My PC is an ordinary (but good) home PC machine with a 13th gen Intel i7, and I've got 64 GB of CPU RAM. Perhaps the people finding the T5 XL to be very slow are running out of CPU memory and swapping the T5 XL out to disk without realizing? @bghira |
@kohya-ss thank you for reply even 512 fails can you check if something wrong? full train logs : https://gist.github.com/FurkanGozukara/b13e2c263138afd5e8548eb6ae9786ce toml : https://gist.github.com/FurkanGozukara/f01c76c4eaa2172352ebf1b8e08a395f |
no i don't think the fused backward pass works with DDP. it's an Accelerate thing |
There doesn't seem to be a problem with the settings, but the memory is really tight, so it may also depend on the environment. Things may be different if you have 3 or more GPUs. Here's my case:
Hmm... I don't know the details, but it seems that removing |
thank you so much this is 512px right? i will try different torch and accelerator lets see if helps |
Yes, it is 512x512 and batch size=1. |
Fixed a bug where train/eval was not called correctly in schedule free. train()/eval() is called at every step, so if training becomes slow, please let me know. I will add another fix if that happens. |
@kohya-ss which would be the proper repo to open an issue regarding the DDP overhead when training FLUX? single GPU taking 25 GB becomes more than 48 GB even in 2x GPU :/ |
Just did a LoRA test with |
I just did the same, and it seems much better to me too. I had Adafactor before I switched over. I'm planning to use AdamWScheduleFree as my new default optimizer. (Edit: I also deleted Thanks go to @sdbds (and @kohya-ss too) for the implementation, and for recognizing that it was worth adding to sd-scripts. |
I actually continued training LoRAs that had been trained for thousands of steps with Adafactor (not schedule-free) with the new schedule-free optimizer (and also got rid of my max_grad_norm=1.0), and saw immediate image quality improvements. The key lengths probably didn't shrink way back in the first 30 or so steps before I output sample images, so I guess the improvements can't be entirely down to shorter key lengths? |
Note that I mentioned the setting |
do you need changes in learning rate and VRAM usage changed? |
The Facebookresearch page (which has the code for this optimizer) says a higher LR is suggested:
In my case, I left my LoRA training rate at 8e-5 and saw immediate quality gains 30 iterations later when my first sample images were generated. But, that's picking up the LoRA from where I left off training it before. I haven't tried training a new one from scratch. Maybe that would need a higher LR? I haven't tested the memory requirements, but the Facebookresearch page says:
We're currently not using their 'wrapper' version of the optimizer which uses more memory. The page I'm quoting from is here, if you want more detail: |
@araleza thank you which Optimizer extra arguments it needs? like weight_decay=0.01 or any other? sadly it uses more VRAM than adafactor 27700 MB vs 29000 at fp16, but it is faster per step |
I've been playing with Flux for almost 3 weeks and found some interesting things I'd like others to corroborate:
|
This overhead of DDP seems to be expected, and more efficient training seems to require DeepSpeed or FSDP etc. |
well deepspeed only works with cpu based optimiser AdamW and fsdp is all or nothing for sharding and slower than ZeRO |
Currently the alpha channel is dropped by `pil_resize()` when `--alpha_mask` is supplied and the image width does not exceed the bucket. This codepath is entered on the last line, here: ``` def trim_and_resize_if_required( random_crop: bool, image: np.ndarray, reso, resized_size: Tuple[int, int] ) -> Tuple[np.ndarray, Tuple[int, int], Tuple[int, int, int, int]]: image_height, image_width = image.shape[0:2] original_size = (image_width, image_height) # size before resize if image_width != resized_size[0] or image_height != resized_size[1]: # リサイズする if image_width > resized_size[0] and image_height > resized_size[1]: image = cv2.resize(image, resized_size, interpolation=cv2.INTER_AREA) # INTER_AREAでやりたいのでcv2でリサイズ else: image = pil_resize(image, resized_size) ```
Cleanup
fix: backward compatibility for text_encoder_lr
Retain alpha in `pil_resize` for `--alpha_mask`
Thanks to everyone in the community for conducting the tests, @araleza @recris |
gen_img.py
to use refactored Text Encoding etc.