what changes would we need to do if we used our own dataset? #1

awais00012 · 2023-12-16T04:29:16Z

Thanks for the awesome explanation. Could you tell me which changes we need before training the model on our data?

explainingai-code · 2023-12-16T08:16:01Z

Hello,

Thanks for the appreciation. I apologize that should have been part of the README , I have updated it now.
Can you take a look - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/README.md#training-on-your-own-images and let me know in case you face any issues

awais00012 · 2023-12-18T06:40:08Z

thanks for the modification in the repo for training the model on custom dataset. however i am facing this issue when i triad the model on my own data. my dataset with the name of ultrasound256CH1 contain train and test images. all the images have 256*256 size, channel 1.

explainingai-code · 2023-12-18T06:46:03Z

Can you tell me the im_path value you used in the config ?
And also the directory structure of your dataset. Is it $REPO_ROOT/ultrasound256CH1/train/*.png ?

The error basically means that the code wasn't able to find any png files in the location it was searching.

awais00012 · 2023-12-18T09:03:57Z

Can you tell me the im_path value you used in the config ? And also the directory structure of your dataset. Is it $REPO_ROOT/ultrasound256CH1/train/*.png ?

The error basically means that the code wasn't able to find any png files in the location it was searching.

yes the dataset is in the repo root, still getting this error, kindly how can i solve it.

explainingai-code · 2023-12-18T09:09:08Z

Got it. Create a subfolder 'images' inside train directory and put all training png files in there.
So $REPO_ROOT/ultrasound256CH1/train/images/*.png

Leave the config as it is to point to "ultrasound256CH1/train" .
Can you try that and let me know if it works?

awais00012 · 2023-12-18T09:42:02Z

yes i tried, unfortunately it does not work.

explainingai-code · 2023-12-18T09:50:44Z

Can you print the directory and path the code is searching at https://github.com/explainingai-code/DDPM-Pytorch/blob/main/dataset/mnist_dataset.py#L40 and share that.

print(d_name, os.path.join(im_path, d_name, '*.{}'.format(self.im_ext)))
Also comment line https://github.com/explainingai-code/DDPM-Pytorch/blob/main/dataset/mnist_dataset.py#L42

awais00012 · 2023-12-18T14:16:58Z

that error has been resolved. that error was occurring because of the arrangement of the dataset. i created 5 classes of the different images for the train data and used "data/train/" as a path and its worked.
now encountering this error:

explainingai-code · 2023-12-18T14:27:04Z

You are training on cpu as of now right ?
Also can you confirm if your conda environment has python3.8 and have the requirements installed as mentioned in https://github.com/explainingai-code/DDPM-Pytorch/tree/main?tab=readme-ov-file#quickstart

awais00012 · 2023-12-19T07:15:05Z

hi sir, i kept the batch size 10, just want to run for 40 epochs and the total images are only 828. could you pleases tell me why the model required so heavy computational power(memory) and how can i handle this issue?

RuntimeError: CUDA out of memory. Tried to allocate 640.00 GiB (GPU 0; 14.75 GiB total capacity; 2.16 GiB already allocated; 11.63 GiB free; 2.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

explainingai-code · 2023-12-19T07:21:34Z

Its because the images are 256x256 and by default the model config does downsampling only twice.
Few things you can try to ensure you are able to train.

Resize the images to 64x64 in the loader and train diffusion model on these 64x64 images
Have all three down blocks to downsample by placing down_sample : [True, True, True] in the config
Try with num_mid_layers : 1 in the config
Reduce number of Midblocks by changing mid_channels : [256, 128] in config

I think that should reduce the model size considerably and should allow you to train.

awais00012 · 2023-12-26T09:45:43Z

Thanks, the model works very well i trained on my datasets.
I would request you to add a few more things to the repo for better results and comparative analytical analysis between the original and the generated images like classifier free guidance, exponential moving average, IS and FID score. looking forward to your outstanding and easy implementation!

explainingai-code · 2024-01-01T07:42:43Z

Yes, I wanted to have this repo as a intro to diffusion which is why didn't want to add those and leave this as a bare minimum diffusion repo. I do plan to create a stable diffusion repo which should have some of these incorporated. Once that is done I will try to put the parts you mentioned here as well (if I am able to do that without adding too much complexity to the current implementation ).

thatdev6 · 2024-03-16T14:18:28Z

Hello,

I made all the relevant changes mentioned in the readme and in this thread but after my images are loaded I get an AttributeError.

explainingai-code · 2024-03-16T14:32:34Z

Hello @thatdev6 , this code expects the path to have png files. But seems like thats not the case for the path you have provided. Is it npy file?
Cause In that case you would have to change this line

thatdev6 · 2024-03-16T14:35:44Z

Hello @thatdev6 , this code expects the path to have png files. But seems like thats not the case for the path you have provided. Is it npy file? Cause In that case you would have to change this line

No my path has png files

explainingai-code · 2024-03-16T14:51:58Z

Are you using the same code or have you made some modifications ? Your list at the end of dataset initialization is a list of numpy.ndarray objects(according to the error), which cannot be because the dataset class during initialization just fetches the filenames.
Also only 19 training images?

thatdev6 · 2024-03-16T15:10:38Z

Are you using the same code or have you made some modifications ? Your list at the end of dataset initialization is a list of numpy.ndarray objects(according to the error), which cannot be because the dataset class during initialization just fetches the filenames.

Yes, I modified the loader function to load and downsample my images. They are rectangular and have a jpg format
I figured out my mistake and it has corrected

This is how modified the loader function

I also changed the im channels to 3, Now I get a runtime error while training

explainingai-code · 2024-03-16T15:11:58Z

The shapes of two images that your dataset returns are different (3x3264x2448 and 3x2448x3264).

thatdev6 · 2024-03-16T15:13:28Z

Before converting to tensor by any chance did you forget to convert the numpy arrays from HxWx3 to 3xWxH ?

How would i fix that?

thatdev6 · 2024-03-16T15:15:53Z

I also modified the sample function for rectangular images

explainingai-code · 2024-03-16T15:17:33Z

I dont think the 3xwxh is an issue because the error says that your image shapes are 3xWxh so thats fine. But I think your path does not have all same size images. Some images are 3264x2448 and some are 2448x3264 .
Can you check this.

thatdev6 · 2024-03-16T15:20:03Z

I dont think the 3xwxh is an issue because the error says that your image shapes are 3xWxh so thats fine. But I think your path does not have all same size images. Some images are 3264x2448 and some are 2448x3264 . Can you check this.

Yes, I think your right so the solution would be to downsample all of them to 64x64?

explainingai-code · 2024-03-16T15:23:12Z

Yes center square crop of (2448x2448) and then resize to 64x64.
How many images are there in your dataset?

thatdev6 · 2024-03-16T15:26:06Z

Yes center square crop of (2448x2448) and then resize to 64x64. How many images are there in your dataset?

Around 600 images

thatdev6 · 2024-03-16T15:34:47Z

These are changes i made to the loader and getitem function, I assume there is no problem here but for some reason the training gets interrupted (^C)

explainingai-code · 2024-03-16T15:50:02Z

Couple of things. Move the image reading to the data loader get_item method just like the code in repo. Simply collect the filenames from load_images method and nothing else. You can do the cropping and resize also in get_item method.
Secondly can you check why its printing "Found 19 images" when actually it should be 600.

thatdev6 · 2024-03-16T15:54:23Z

Couple of things. Move the image reading to the data loader get_item method just like the code in repo. Simply collect the filenames from load_images method and nothing else. You can do the cropping and resize also in get_item method. Secondly can you check why its printing "Found 19 images" when actually it should be 600.

Okay so first of all i should leave the loader function as it is just modify for the jpg images, secondly i should do the image formatting in the get item function
It says found 19 images because at the moment i have only uploaded a subset of the dataset, It was quite annoying to wait for the images to load only to encounter an error in training

thatdev6 · 2024-03-16T16:54:37Z

How do you suggest i fix this?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.75 GiB. GPU 0 has a total capacity of 14.75 GiB of which 57.06 MiB is free. Process 15150 has 14.69 GiB memory in use. Of the allocated memory 11.21 GiB is allocated by PyTorch, and 3.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

These are the modifications i made

explainingai-code · 2024-03-18T17:45:39Z

Thats something you would have to experiment with and tune but I am assuming you are anyway limited by compute and 4 is the max you can go ?

thatdev6 · 2024-03-18T17:53:56Z

Yes 4 is the highest I can go, I will obviously need help in tuning the model parameters for better results so I will keep you updated.

Thank you so much for the help in the mean time.

thatdev6 · 2024-03-18T21:19:19Z

Yes 4 is the highest I can go, I will obviously need help in tuning the model parameters for better results so I will keep you updated.

Thank you so much for the help in the mean time.

Is there any way i can save my progress while training, what i want to do is say train up to 130 epochs stop my training and then continue training from 130 epochs again?

thatdev6 · 2024-03-19T00:09:11Z

this is my result after 100 epochs on batch 6

Why are the generated outputs getting so dark, shouldnt they start to mimic the actual images at some point

This is actually fine. When I was also training on an rgb dataset , I first used to get these outputs, so more training should lead to actual outputs(close to dataset images) ultimately. In these generation results, when you see x0_400/500 were they closer to your actual dataset images? Also did you end up training for more epochs? Did the outputs improve?

thatdev6 · 2024-03-19T00:18:53Z

This is at x0_500

And this is at x0_999

thatdev6 · 2024-03-19T00:20:01Z

for reference these are some dataset images

thatdev6 · 2024-03-19T00:41:01Z

I would also like to mention that as of now my training time per epoch is 40s so to train 100 epochs it took me a little over an hour, I remember in your yt video you trained your model for around 3 hours on 60 epochs, so my dataset might also be a limiting factor over here(257 images after split)
Let me know what think of this conclusion, also how would you update the model params to get better results

explainingai-code · 2024-03-19T03:29:51Z

->Is there any way i can save my progress while training, what i want to do is say train up to 130 epochs stop my training and then continue training from 130 epochs again?
For resuming training, this should be already happening when the code is loading the checkpoint here So all you would need to do is after 130 epochs, download the checkpoint and before running the next 130 epochs, simply place this downloaded checkpoint in the right path.
->so my dataset might also be a limiting factor over here(257 images after split).
Yes 250 images are very less, the number that I gave was for training mnist, 60000 images of 28x28 on NvidiaV100, where I trained for ~50 epochs with batch size 64, so which means about 50000 steps. With 250 images and batch size 6, effectively that means 40 steps in one epoch so even after 100 epochs thats actually just 4000 steps.
I would suggest doing the following things:

If you can get more data then that would definitely help.
Continue training for longer if by analyzing the x0-x999 images you are continuing to see improvement in models generation capabilities.
Use augmentations, flipping , and even cropping, based on your images, I am assuming your goal is to generate images with those cracks, so you could take multiple crops from the same images like below

thatdev6 · 2024-03-19T05:56:45Z

So

->Is there any way i can save my progress while training, what i want to do is say train up to 130 epochs stop my training and then continue training from 130 epochs again? For resuming training, this should be already happening when the code is loading the checkpoint here So all you would need to do is after 130 epochs, download the checkpoint and before running the next 130 epochs, simply place this downloaded checkpoint in the right path.

Are the checkpoints saved in the .pth file? So I would have to transfer that file before starting training then

explainingai-code · 2024-03-19T06:02:36Z

Yes download the .pth file after one iteration of training, and put it back in the necessary path before starting the second iteration.

thatdev6 · 2024-03-19T07:00:13Z

Okay, I will work on the suggestions, in your opinion what is the better alternative
Opting for a larger dataset with images from different sources and sizes
Or a the same images cropped to form a larger dataset

explainingai-code · 2024-03-19T07:43:00Z

I would say a larger dataset is beneficial and then using cropping you can further increase the images that diffusion model gets to see during training.
But obv if that cannot be done due to some constraints then I would still suggest to try out only the cropping solution.

thatdev6 · 2024-03-19T13:11:43Z

these are my images after crop and resize in the data loader function are these okay?, i think they have lost a lot of quality

Also i do i have a larger dataset around 45k images but the images in it are to inconsistent in every manner size, quality
so what effect will my my current crop in data loader function will have on images which are not 3264 x 2448 and Size: 2448 x 3264

explainingai-code · 2024-03-19T13:16:45Z

For the 45K images, my guess is that the centre crop and resize to 64x64 should handle the inconsistencies in size(I dont know how inconsistent they are in quality). But if you have 45K iamges, I would say why not try with that just to see how good of an output quality you get from ddpm.

thatdev6 · 2024-03-19T13:56:14Z

Yes that is the goal right, But training those images will take a considerate amount of time and if the output is still noisy, all the time spent will have been wasted

Also are the black outputs normal, currently I am saving my progress and increasing epochs on the 257 images, but on 130 epochs the final outputs were mostly black so don't they indicate the final noise less images will be black or is this part of the process?

For the 45K images, my guess is that the centre crop and resize to 64x64 should handle the inconsistencies in size(I dont know how inconsistent they are in quality). But if you have 45K iamges, I would say why not try with that just to see how good of an output quality you get from ddpm.

explainingai-code · 2024-03-19T15:53:16Z

Yes diffusion models require decent amount of data to train so unless you throw in more compute power, I dont see any other way to reduce the time.
The single color outputs are normal to occur during the training process(as I mentioned it happened during my training as well). And I would suggest to not compare based on epochs, 100 epochs on a dataset of 200 is not the same as 100 epochs on a dataset of 50000 images. Rather use the number of steps/iterations that has happened in your training, in your case its just 5K steps(compared to the 80K steps that I used in the video). I dont think you need as many steps as that cause you have lesser variation in your dataset but still just wanted to give a perspective.

thatdev6 · 2024-03-19T20:55:27Z

Yes diffusion models require decent amount of data to train so unless you throw in more compute power, I dont see any other way to reduce the time. The single color outputs are normal to occur during the training process(as I mentioned it happened during my training as well). And I would suggest to not compare based on epochs, 100 epochs on a dataset of 200 is not the same as 100 epochs on a dataset of 50000 images. Rather use the number of steps/iterations that has happened in your training, in your case its just 5K steps(compared to the 80K steps that I used in the video). I dont think you need as many steps as that cause you have lesser variation in your dataset but still just wanted to give a perspective.

I am finally starting to see results on the 257 image dataset after 400 epochs of training. I am assuming there are always going to be noisy images in the generated batch even if i train on the 45k images for around 500 epochs.

Also is there a way I can rescale the generated images into a higher resolution after generation.

explainingai-code · 2024-03-20T02:55:05Z

Great. I dont think that assumption is correct, once your model has converged(looks like that point maybe somewhere around 1000 epochs) it will not have these noisy images at all. And with 45K images around after 200 epochs you will most likely be able to see decent outputs(obv you will have to experiment to assert that).

For higher resolution, you can try with pre-trained super resolution models(just searching on the web should give you some models and you can test them to see how good of a result you are getting with that.
Or you can go the stable diffusion route by first training autoencoder to convert from 256x256 to 32x32, training diffusion models on 32x32 images, generating latent 32x32 images at sampling and then decoding them to 256x256 using autoencoder. But that would require training multiple(diffusion and autoencoder) models.

thatdev6 · 2024-03-23T21:09:48Z

Hello its me again
I have a question, can this model generate images based on prompts, for example say I prompt to generate an image with snow and cracks, is this achievable?

explainingai-code · 2024-03-24T17:33:00Z

Hello :) , Yes the diffusion model can be conditioned on class(using class embeddings) or text(using cross attention). But this repo does not have that , this one only allows you to generate images unconditionally.
I do have it present in the stable diffusion repo(https://github.com/explainingai-code/StableDiffusion-PyTorch), you can either use that or you can look at how class conditioning/text conditioning is achieved there and replicate the same with ddpm.

dangdinh17 · 2024-07-18T13:53:12Z

hello, i have a question with my model
i want to train on my data with the config params like that

and my image is in jpg format. I have tried another methods like Dataparallel but it doesn't work.
so please help me with this

explainingai-code · 2024-07-18T14:11:18Z

@dangdinh17 Can you tell me what error you are facing ? Out of memory ?

dangdinh17 · 2024-07-18T14:12:38Z

yes, my error is that

explainingai-code · 2024-07-18T14:16:00Z

Yeah, can you try with 64x64 . In config set im_size as 64 and in dataset class's getitem method, resize the image to 64x64.
Can you see if that gets rid of the error ?

dangdinh17 · 2024-07-18T14:26:18Z

i have tried with 64x64 and it worked. but i want to train with the shape 128x128 because of my study. let me introduce about my study and can you give me some essential suggests, please?
i have a dataset with the image in 150x150 image, and i have a blured images with the motion blur and i want to use DDPM model to restore the image the the highest quality. so i want to train my model with the shape 128x128 or 256x256 to suit my data shape. so i have some questions that:

if i devided my image (for example 128x128) into 64x64 parts and train with this 64x64 part, so when i used this trained model with the shape 128x128, does it work good?
i've tried many model like GANs but the results didn't good, so now i try with the DDPM and may be the next stable diffusion, can you give me some must try models for my study?
thank you

explainingai-code · 2024-07-18T16:55:41Z

If your images are from a dataset(or match a dataset) that has a super resolution model available, then you can train DDPM on 64x64 images and then use that super resolution model checkpoint to get 128x128 image. Or you can train a super resolution model yourself.
The second option is like you said to try with LDM, so then your trained auto-encoder will take care of converting 256x256 images to 64x64 latents(as well as converting 64x64 generated latents to 256x256). And diffusion model will need to be trained on the smaller, 64x64 latent images.

dangdinh17 · 2024-07-20T14:47:19Z

oh yes, i see, thank you so much

dangdinh17 · 2024-07-20T14:49:42Z

i have another question that, if my data is about motion blur or exposure blur, so if i only use the original code for this blur, does it work? or i must train the model with adding more noise like motion blur and light blur rather than only gaussian noise?

explainingai-code · 2024-07-24T14:02:57Z

@dangdinh17 , apologies for the late reply. But I have responded to your issue on Stable Diffusion repo. Do take a look at the repo mentioned in that reply - explainingai-code/StableDiffusion-PyTorch#21 (comment) . I think that implementation does exactly what you need.

xXCoffeeColaXc · 2024-11-09T07:19:01Z

@dangdinh17 Hey, just dont allow the network to compute attentions on resolution 128. Only in lower resolution. This should solve your memory issues.

explainingai-code mentioned this issue Jan 1, 2024

what parameter changes would I need to make sure it runs on our dataset? #2

Open

what changes would we need to do if we used our own dataset? #1

what changes would we need to do if we used our own dataset? #1

Comments

awais00012 commented Dec 16, 2023

explainingai-code commented Dec 16, 2023

awais00012 commented Dec 18, 2023

explainingai-code commented Dec 18, 2023 • edited Loading

awais00012 commented Dec 18, 2023

explainingai-code commented Dec 18, 2023

awais00012 commented Dec 18, 2023

explainingai-code commented Dec 18, 2023

awais00012 commented Dec 18, 2023

explainingai-code commented Dec 18, 2023

awais00012 commented Dec 19, 2023

explainingai-code commented Dec 19, 2023 • edited Loading

awais00012 commented Dec 26, 2023

explainingai-code commented Jan 1, 2024

thatdev6 commented Mar 16, 2024

explainingai-code commented Mar 16, 2024

thatdev6 commented Mar 16, 2024

explainingai-code commented Mar 16, 2024 • edited Loading

thatdev6 commented Mar 16, 2024

explainingai-code commented Mar 16, 2024 • edited Loading

thatdev6 commented Mar 16, 2024 • edited Loading

thatdev6 commented Mar 16, 2024

explainingai-code commented Mar 16, 2024

thatdev6 commented Mar 16, 2024

explainingai-code commented Mar 16, 2024

thatdev6 commented Mar 16, 2024

thatdev6 commented Mar 16, 2024

explainingai-code commented Mar 16, 2024 • edited Loading

thatdev6 commented Mar 16, 2024

thatdev6 commented Mar 16, 2024 • edited Loading

explainingai-code commented Mar 18, 2024

thatdev6 commented Mar 18, 2024 • edited Loading

thatdev6 commented Mar 18, 2024

thatdev6 commented Mar 19, 2024

thatdev6 commented Mar 19, 2024

thatdev6 commented Mar 19, 2024

thatdev6 commented Mar 19, 2024

explainingai-code commented Mar 19, 2024

thatdev6 commented Mar 19, 2024

explainingai-code commented Mar 19, 2024

thatdev6 commented Mar 19, 2024 • edited Loading

explainingai-code commented Mar 19, 2024 • edited Loading

thatdev6 commented Mar 19, 2024

explainingai-code commented Mar 19, 2024 • edited Loading

thatdev6 commented Mar 19, 2024 • edited Loading

explainingai-code commented Mar 19, 2024

thatdev6 commented Mar 19, 2024

explainingai-code commented Mar 20, 2024

thatdev6 commented Mar 23, 2024

explainingai-code commented Mar 24, 2024

dangdinh17 commented Jul 18, 2024

explainingai-code commented Jul 18, 2024

dangdinh17 commented Jul 18, 2024

explainingai-code commented Jul 18, 2024

dangdinh17 commented Jul 18, 2024

explainingai-code commented Jul 18, 2024

dangdinh17 commented Jul 20, 2024

dangdinh17 commented Jul 20, 2024

explainingai-code commented Jul 24, 2024

xXCoffeeColaXc commented Nov 9, 2024

explainingai-code commented Dec 18, 2023 •

edited

Loading

explainingai-code commented Dec 19, 2023 •

edited

Loading

explainingai-code commented Mar 16, 2024 •

edited

Loading

explainingai-code commented Mar 16, 2024 •

edited

Loading

thatdev6 commented Mar 16, 2024 •

edited

Loading

explainingai-code commented Mar 16, 2024 •

edited

Loading

thatdev6 commented Mar 16, 2024 •

edited

Loading

thatdev6 commented Mar 18, 2024 •

edited

Loading

thatdev6 commented Mar 19, 2024 •

edited

Loading

explainingai-code commented Mar 19, 2024 •

edited

Loading

explainingai-code commented Mar 19, 2024 •

edited

Loading

thatdev6 commented Mar 19, 2024 •

edited

Loading