finetune Stable Diffusion to medical images #11589

taustudent · 2025-05-20T12:16:01Z

taustudent
May 20, 2025

Hi all,
I'm trying to fine tune SD model on custom medical dataset. The logic is to be able to create new samples from text prompt.
I have ~20 different classes and mask of the relevant part in the image that also determines the class of the image.
I generate ~30 generic caption sentences that I incorporate the class name into them along with the location and size based on the binary mask and from that I generate pairs of image-caption dataset (~2500 samples).
Next I'm using the train_text_to_image_lora.py script on this data but I'm getting invalid images.

what I'm doing wrong?
should I train LoRA per each class?
should I finetune the entire SD model before? the vae/tokenaizer/unet? I saw that they are frozen in the train_text_to_image_lora.py script.
do you know some hyperparameters that I should tweak?
Will appreciate any help.
thanks

asomoza · 2025-05-20T18:57:32Z

asomoza
May 20, 2025
Maintainer

Hi!, you're being to generic to be able to receive some help, you're doing something very specific but only providing some very basic information:

For example:

which SD model are you using as a base
what are the classes you're referring to?
what is an invalid image specifically? is it noise? is it something not related? bad quality? etc.
what hyperparameters are you using?

with this we can try to guess something but essentially the only way to get real help is if you provide the full information of what you're doing.

What I can say is that if you're looking to get something very real and detailed from a diffusion model that you can use in the real world you will need to do a full finetune of the model and with a lot more images than 2500, most full fine tunes use at least a couple of millions of images and even then, you will still have some artifacts and errors in the images like you can see in the finetuned models.

If you just need some lower random quality generation from a specific class like "cancer" or similar, you can probably get do it with a lora like you're doing.

0 replies

taustudent · 2025-05-21T07:10:50Z

taustudent
May 21, 2025
Author

Hi @asomoza, thanks for your answer.
So I'm using stable-diffusion-v1-5/stable-diffusion-v1-5 from HF, my data have classes of different kinds of medical observation in the image (tumor, cyst, lesion, normal etc.).
When I said invalid images I mean that I'm getting images that don't look like belong to the data. it's not noise but also not medical image (my data was taken with electron microscope, it's have 3 channels but it's not RGB channels. It's different angels of the same tissue) so I'm getting some kind of surfaces images from my SD+LoRA but it's don't look like my data. I'm using the default parameters from the train_text_to_image_lora.py script.

Before SD I tried to train LDM (from the original GitHub repo of CompVis, not from HF) with class-conditioned on smaller dataset and after finetune the VAE and the UNET I got pretty god results so I hoped that I can at least get reasonable images from SD finetuning, maybe I should finetune also in this case (of SD instead of LDM) the UNET and VAE (and the text embedding? I'm not sure) before training the LoRA? if so, how I can do this? with the train_text_to_image.py script (without LoRA) script?

thanks again

1 reply

asomoza May 21, 2025
Maintainer

you were on the right track, if you want to use something else that's not RGB images you will need to also train the VAE. If you're changing the channels of the image you are pretty much making all the previous training invalid which at this point I don't know if it's better to just train a model from scratch.

Diffusers it's just for using existing models and learning and we have some basic training scripts but what you're trying to do is in a whole different level which is on par to just releasing a model for a new modality and for that you will need to understand and learn a lot more about machine learning and diffusion models.

I remember I discussed something similar with someone else trying to do the same a while back but never heard back of that project, probably it was too much in time and money to make.

What I can tell you is that for what you want, training a lora won't do it, and for finetuning, a 2500 samples dataset also won't do it.

This is just my guessing but you will probably need to fully train the unet, clip and vae models, but I've read people claiming to be able to do something I normally would consider not doable like just switching the text encoders and still get "better results" than the base model which in theory seems really wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

finetune Stable Diffusion to medical images #11589

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

finetune Stable Diffusion to medical images #11589

Uh oh!

taustudent May 20, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

asomoza May 20, 2025 Maintainer

Uh oh!

taustudent May 21, 2025 Author

Uh oh!

Uh oh!

asomoza May 21, 2025 Maintainer

taustudent
May 20, 2025

Replies: 2 comments 1 reply

asomoza
May 20, 2025
Maintainer

taustudent
May 21, 2025
Author

asomoza May 21, 2025
Maintainer