Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364

Closed
mgcrea opened this issue Sep 4, 2022 · 180 comments
Closed

Comments

@mgcrea
Copy link
Contributor

mgcrea commented Sep 4, 2022

Seen on HN, might be interesting to pull in this repo? (PR looks a bit dirty with a lot of extra changes though).

basujindal/stable-diffusion#103

@neonsecret
Copy link

check out my other compvis PR CompVis/stable-diffusion#177
might be more suitable for you

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

Thanks for the tip! I'll check them both out.

@sunija-dev
Copy link

Would be cool to get this implemented! ❤️
I got two users with only 4 GB VRAM and the model won't even load. If I saw it correctly, that should work with basujindal's version.

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

Don't I know it!

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

By the looks of it without all the white space changes we get...

diff ldm/modules/attention.py ldm/modules/attention.py.opt
181a182
>         del context, x
187a189
>         del q, k
193a196
>             del mask
196c199,200
<         attn = sim.softmax(dim=-1)
---
>         sim[4:] = sim[4:].softmax(dim=-1)
>         sim[:4] = sim[:4].softmax(dim=-1)
198,200c202,204
<         out = einsum('b i j, b j d -> b i d', attn, v)
<         out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
<         return self.to_out(out)
---
>         sim = einsum('b i j, b j d -> b i d', sim, v)
>         sim = rearrange(sim, '(b h) n d -> b n (h d)', h=h)
>         return self.to_out(sim)

attached in in diff -u format for patching

attn.patch.txt

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

Oh thank you very much for that! I actually just did the same thing with @neonsecret 's attention optimization and it works amazingly. Without any change to execution speed my test prompt now uses 3.60G of VRAM. Previously it was using 4.42G.

Now I'm looking to see if image quality is affected.

Any reason to prefer basunjal's attention.py optimization?

@smoke2007
Copy link

nothing to add to this conversation, except to say that i'm excited for this lol ;)

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

I've merged @neonsecret 's optimizations into the development branch "refactoring-simplet2i" and would welcome people testing it and sending feedback. This branch probably still has major bugs in it, but I refactored the code to make it much easier to add optimizations and new features (particularly inpainting, which I'd hoped to have done by today).

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

Hmmm....on my barely coping 8G M1 its not so hot, the image is different and it took twice as long, but its an old clone, let me try in on a fresher one

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

Darn. I'd hoped that there was such a thing as a free lunch.

I'm on an atypical system with 32G of VRAM, so maybe my results aren't representative. I did timing and peak VRAM usage, and then looked at two images generated with the same seed and they were indistinguishable to the eye. Let me know what you find out.

Are you on an Apple? I didn't know there were clones. The M1 MPS support in this fork is really new, and I wouldn't be surprised it needs additional tweaking to get it to work properly with the optimization.

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

Sorry meant its an old local clone of your repo as I didn't want to make changes in my local clone of the current one as hot works quite nicely :-) , but yes I'm not surprised mps is breaking things and pytorch is pretty buggy too, I raised a few mps related issues over there that stable diffusion hits.

@neonsecret
Copy link

neonsecret commented Sep 4, 2022

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

okay, on the main branch the images are the same, but it is really slow, even compared to my normal times...

10/10 [06:49<00:00, 40.94s/it]
compared
10/10 [02:14<00:00, 13.44s/it]
(spot the man with the 8Gb M1)

I'll do some more digging

@magnusviri any chance you can check this out on a bigger M1 ?

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

Ok, this is @neonsecret 's PR, which I just tested and merged into the refactor branch. I'm seeing a 20% reduction in memory footprint, but unfortunately not the 35% reduction reported in the Reddit post. Presumably this is due to the earlier optimizations in basunjal's branch. I haven't really wanted to use those opts because the code is complex and I hear it has a performance hit. Advice?

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

Last I looks at basunjal's there where loads of assumption about using cuda, a big chunk of the memory saving seemed to come from forcing half. Was a week ago things might have changed

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

I've got half precision on already as the default. I think what I'm missing is basunjal's optimization of splitting the task into several chunks and loading them into the GPU sequentially.

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

Frankly, I'm happy with the 20% savings for now.

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

Seems the speed loss is coming from the twin calls to softmax.

<         attn = sim.softmax(dim=-1)
---
>         sim[4:] = sim[4:].softmax(dim=-1)
>         sim[:4] = sim[:4].softmax(dim=-1)

If I change it to use
sim = sim.softmax(dim=-1)

instead I get all my speed back (I Assume more memory usage though I need better diagnostic stuff than Activity Monitor).

I can do 640x512 images now so there do appear to be some memory saving even reverting that change, be interesting to see what happens on a larger box and if is it worth wrapping them into one variation in a "if mps" statement

EDIT:

seems I can also now do 384 x320 without it using swap.

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

No measurable slowdown at all on CUDA. Maybe we make the twin softmax conditional on not MPS?

@veprogames
Copy link
Contributor

veprogames commented Sep 4, 2022

test run:

751283a (main) EDIT: corrected hash
"test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42
00:15<00:00, 3.30it/s
Max VRAM used for this generation: 4.44G

89a7622 (refactoring-simplet2i)
"test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42
00:13<00:00, 3.77it/s
Max VRAM used for this generation: 3.61G

XOR of both images gave a pitch black result -> seems to be no difference

maybe this helps. Inference even seems to be a little faster in this test run.

@lstein
Copy link
Collaborator

lstein commented Sep 4, 2022

CUDA platform? What hardware?

test run:

4406fd1 (main) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:15<00:00, 3.30it/s Max VRAM used for this generation: 4.44G

89a7622 (refactoring-simplet2i) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:13<00:00, 3.77it/s Max VRAM used for this generation: 3.61G

XOR of both images gave a pitch black result -> seems to be no difference

maybe this helps. Inference even seems to be a little faster in this test run.

@veprogames
Copy link
Contributor

Windows 10, NVIDIA GeForce RTX 2060 SUPER 8GB, CUDA

if there's info missing I'll edit it in

@blessedcoolant
Copy link
Collaborator

I made a new local PR with just changes to the attention.py. There are definitely memory improvements but nothing as drastic as what the PR claims.

Here are some of my test results after extensive testing - RTX 3080 8GB Card.

Base Repo:

  • Max Possible Resolution: 512x768
  • Max VRAM Usage: 7.11GB

Updated attention.py

  • Max Possible Resolution: 576x768
  • Max VRAM Usage: 6.91GB

For a 512x768 image, the updated repo consumes 5.94GB of memory. That's approximately an 18% memory boost.


I saw no difference in performance or inference time when using a single or twin softmax. On CUDA, the difference seems to be negligible if there is any.


tl;dr -- Just the attention.py changes are giving an approximate 18% VRAM saving.

@Vargol
Copy link
Contributor

Vargol commented Sep 4, 2022

I think its a side affect of the unified architecture, it looks okay at 256x256 when all of the python 'image' fits in memory, but as soon as swapping kicks in I get half speed compared to the original code or single softmax

@bmaltais
Copy link

bmaltais commented Sep 4, 2022

I just tried regenerating an image done from the current development branch and the new refactor branch and they are totally different... Not sure if it is the memory saving feature doing it or something else. Is there a switch to activate/deactivate the memory saving?

Would be nice to isolate if the diff is related to that or something else. Running on a GTX 3060.

EDIT:

OK... for some reason the picture I got when running the command the 1st time is different from the result when running the prompt logged in the prompt log file... Strange... but using the log file prompt on both the dev and the refactoring does indeed produce the same result with much less VRAM usage...

I will try to reproduce the variation... This might be an issue with the variation code base not producing consistent result on 1st run vs reruns from logs.

EDIT 2: I tracked the issue with the different outputs... it was a PEBKAC... I pasted the file name and directory info in front of the prompt in the log... this is why it resulted in a different output... so all good, it was my error.

So as far as I can see the memory optimisation has no side effect on time not quality to generate an image.

6.4G on the dev branch vs 4.84G on the refactoring branch... so a 34% memory usage reduction and exactly the same run time.

@thelemuet
Copy link

thelemuet commented Sep 4, 2022

I did not do extensive testing to compare generation times but so far I have gotten the same exact results visually when comparing to images I've generated yesterday on main with same prompt/seeds.

And I can cramp up resolution from max 576x576 up to 640x704, using 6.27G on my RTX 2070.

Last time I have tried basunjal's, I could manage 704x768 but it was very slow. However if they implemented this PR + their original optimization, if it uses up even less memory than it used to, I can image doing even higher resolution. Very impressive.

@cvar66
Copy link

cvar66 commented Sep 4, 2022

The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day.

@bmaltais
Copy link

bmaltais commented Sep 4, 2022

The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day.

On my 3060 with 12GB VRAM I am seeing a 34% memory improvement... so this is pretty great. hat is when generating 512x704 images.

@Ratinod
Copy link

Ratinod commented Sep 4, 2022

768x2048 or 1216x1216 on 8 gb vram (neonsecret/stable-diffusion). Incredible.
1024x1024 on 8 gb vram (and maybe even more)

My mistake... believed what was written before checking...

upd.
It works!

@ryudrigo
Copy link

ryudrigo commented Sep 8, 2022

I think it's possible to incorporate the best of both methods. I've been working on adapting Doggettx's dynamic threshold. Will update the branch as soon as I can

@lstein
Copy link
Collaborator

lstein commented Sep 8, 2022

@Doggettx 's dynamic threshold works great, but has the issue that it makes calls to torch.cuda's memory stats functions. These aren't supported on Apple M1 hardware or other non-CUDA devices. See Issue #431

@Any-Winter-4079
Copy link
Contributor

Any-Winter-4079 commented Sep 8, 2022

I'm porting @Doggettx optimizations to M1 and it looks promising :)
Initial results.

Development branch

Size Time for 3 images (s) Peak RAM Used (GB)
512x512 89.33 15.83
640x448 102.43 17.44
704x512 132.35 20.74
832x768 312.43 29.15
896x896 X X
1024x768 X X
1024x1024 X X

Doggettx-optimizations branch (M1 workarounds)

Size Time for 3 images (s) Peak RAM Used (GB)
512x512 95.24 15.15
640x448 108.42 19.00
704x512 135.68 23.26
832x768 345.68 38.52
896x896 520.39 38.29
1024x768 464.30 21.36
1024x1024 X X

X means failed test.
Overall, I got 2 more image sizes to work using Doggettx's code, and I'm hopeful the performance drop (6-10%) can be recovered. It's still a very basic adaptation for M1. I'll add it to #431 in case someone wants to test it and will try to improve it.

@lstein
Copy link
Collaborator

lstein commented Sep 8, 2022

Great! I want to make a release soon and the choice of optimization(s) is a key milestone.

@willcohen
Copy link

After reviewing the back and forth on this thread, I do have one clarifying question:
Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I've very much appreciated the set of optimizations and additional functionality provided by dream and the rest of this fork on my personal M2 machine -- having some option to deploy this superset of functionality (even if with reduced speed) on 4GB cards would be quite helpful too. At my place of employment, most of my colleagues and I are on 4GB GPUs which are clearly not primarily built for ML, but this particular use case is still right up many of our alleys.

@JohnAlcatraz
Copy link

JohnAlcatraz commented Sep 8, 2022

  1. It looks like the 'ryudrigo-optimizations' branch uses significantly more VRAM at high resolutions, causing it to run out of memory much sooner than the 'doggettx-optimizations' branch. The 'doggettx-optimizations' branch can generate resolutions where the 'ryudrigo-optimizations' branch already is OOM, so the winner regarding VRAM usage is the 'doggettx-optimizations' branch.
  2. It also looks like the 'ryudrigo-optimizations', at least on some hardware, is significantly slower than the 'doggettx-optimizations' branch, especially at lower resolutions. I have not yet seen any data that shows that the 'ryudrigo-optimizations' branch is faster than the 'doggettx-optimizations' branch anywhere. So in this regard, the 'doggettx-optimizations' branch is also the clear winner.
  3. The 'ryudrigo-optimizations' branch seems to on some resolutions on some hardware manage to have the same speed as the 'doggettx-optimizations' branch while having a lower VRAM usage, but there is not really any benefit to that, as it does not translate to being able to generate higher resolutions, it's actually the opposite.

So overall, the 'doggettx-optimizations' branch seems to be superior in every regard.

@JohnAlcatraz
Copy link

After reviewing the back and forth on this thread, I do have one clarifying question: Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I don't have a 4 GB GPU to test that, but my guess would be that the 'doggettx-optimizations' branch works fine on 4 GB GPUs.

@willcohen
Copy link

I thought it might too, but with a Quadro P1000 (4GB), I get a CUDA OOM using doggettx-optimizations, ryudrigo-optimizations, and development on Windows 11 via conda. The optimizedSD fork works, though.

@lstein
Copy link
Collaborator

lstein commented Sep 8, 2022

I thought it might too, but with a Quadro P1000 (4GB), I get a CUDA OOM using doggettx-optimizations, ryudrigo-optimizations, and development on Windows 11 via conda. The optimizedSD fork works, though.

All the memory optimizations have been directed at reducing memory requirements during image generation. Loading the model itself, which happens during initialization, requires 4.2 GB on its own. So 4GB cards are currently not supported by this fork. Sorry!

@willcohen
Copy link

Right, of course. Thanks!

@lstein
Copy link
Collaborator

lstein commented Sep 8, 2022

I should add that the basujindal fork does reduce the memory used during model loading and allows you to run SD on 4 GB cards.

@lkewis
Copy link

lkewis commented Sep 8, 2022

Have tested out these branches and just want to say the work being done he is amazing, both for people with less VRAM and those like myself that can squeeze out larger res. Though I notice larger res is not always better since the model is still effectively 512x512, there's definitely a point where you hit tradeoffs.

Just wondered if these changes would / could pass through into the upstitching methods like txt2imghd allowing for larger sized chunks to be processed during upscaling and detailing?

@JohnAlcatraz
Copy link

Though I notice larger res is not always better since the model is still effectively 512x512, there's definitely a point where you hit tradeoffs.

Take a look at what I explained here, that way you can generate at high resolutions without tradeoffs: #364 (comment)

@lstein
Copy link
Collaborator

lstein commented Sep 9, 2022

I'm not sure how etiquette goes in these cases.

What @tildebyte said. The goal is to make the repo better and better. The whole idea of open source is to allow everyone to contribute so we can end up with the best experience. Some discussions might lead to debates but as long as we keep it civil, reasonable and reach an amicable understanding, I think we're all good.

No code is perfect. Everything can be bettered in some way or another.

Just piping in here to second what @tildebyte and @blessedcoolant said. 99% of this repository is other people's code, and I and my collaborators are just trying to pull together the best innovations that are out there to create a stable base and a good user experience. All contributions are gratefully accepted, but are subject to review and testing.

@lstein
Copy link
Collaborator

lstein commented Sep 9, 2022

After reviewing the back and forth on this thread, I do have one clarifying question: Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I don't have a 4 GB GPU to test that, but my guess would be that the 'doggettx-optimizations' branch works fine on 4 GB GPUs.

No, I don't think they will. All the optimizations we have been working on affect VRAM usage during image inference and generation. The loading of the model that takes place during initialization takes more than 4 GB (I think it's 4.2, so tantalizingly close!) and will cause an OOM error before you get to the inference prompt on 4 GB cards.

The basunjal optimizations reduce memory requirements at load time, but unfortunately slow down performance noticeably and so haven't been incorporated here.

@lstein
Copy link
Collaborator

lstein commented Sep 9, 2022

After reviewing the back and forth on this thread, I do have one clarifying question: Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I've very much appreciated the set of optimizations and additional functionality provided by dream and the rest of this fork on my personal M2 machine -- having some option to deploy this superset of functionality (even if with reduced speed) on 4GB cards would be quite helpful too. At my place of employment, most of my colleagues and I are on 4GB GPUs which are clearly not primarily built for ML, but this particular use case is still right up many of our alleys.

Personally I don't like the memory->speed tradeoff in optimizedSD. I'm hoping that stable-diffusion-v1.5 will have reduced memory requirements and will run on 4 GB cards out of the box.

@willcohen
Copy link

That makes a lot of sense, and I think all the testing above bears out that in any situation where the model loads it’s probably the right move.

The only potential thing that might be worth considering (even with a potential sub-4GB later release) is that many of the potential users with 4GB cards may be precisely the people who can’t fully free the full 4GB if it’s their laptop or machine’s sole GPU. A colleague with a small-ish XPS laptop can get the optimizedSD fork to run, but only with full precision and all the optimizations active and only after closing out all other open apps. In an odd way, that kind of user would be especially appreciative of the other UX improvements here — batch queuing a set of commands to run, logging all the output in a clean out file for easier reference given that even low DDIM runs still can take some time, etc.

I guess after this next round of optimization settles here, it might still be worth considering the possibility of including a flag that splits the model for the dream workflow — with all the serious performance caveats documented in a big way — as a singular final fallback for the machines that have no other choice.

Separately — thank you so much to all for all of your work. Watching this develop so rapidly and collaboratively the last few weeks has been absolutely fascinating!

@lstein
Copy link
Collaborator

lstein commented Sep 9, 2022

Yes, I think we'll get the inference optimization squared away and then can look into the model loading optimizations as a user flag.

Thanks for the kind words! This is a fun project to work on.

@lstein
Copy link
Collaborator

lstein commented Sep 9, 2022

I'm porting @Doggettx optimizations to M1 and it looks promising :)
Initial results.

@Any-Winter-4079, how's your progress on the M1 port?

@Any-Winter-4079
Copy link
Contributor

I'm porting @Doggettx optimizations to M1 and it looks promising :)
Initial results.

@Any-Winter-4079, how's your progress on the M1 port?

Check #431 (comment)
It works reasonably well. Speed is on par with the best I've had. Memory is a bit better.
There's more improvements to be made -for a future PR-, but those may take a bit more time b/c there seems to be a bug pertaining Metal

@lstein
Copy link
Collaborator

lstein commented Sep 9, 2022

Thank you everyone for the wonderful work you did benchmarking and debugging the various optimizations. I have chosen the @Doggettx optimizations, with fixes contributed by @Any-Winter-4079 to run correctly on Macintosh M1 hardware. These optimizations have gone into the development branch and will be in the soon-forthcoming (I hope) 1.14 release.

@Doggettx
Copy link

I made some more minor improvements when running in auto_cast or half mode, which seems to have made it run a lot faster
and need much less vram. Can even run at 2560x2560 on my 3090 now, haven't tested higher. Performance has gone up by about 45% on 1024x1024 when running in auto_cast

Could use some testing though, since I don't know what it does on lower VRAM cards. If someone is willing to try, I've put it in a separate branch at

https://github.com/Doggettx/stable-diffusion/tree/autocast-improvements

@lstein lstein reopened this Sep 10, 2022
@lstein
Copy link
Collaborator

lstein commented Sep 10, 2022

Just a heads up. Mac users who have been testing on the release candidate (which contains the previous set of @Doggettx optimizations) are reporting a 2-3x decrease in speed on M1 hardware. This needs to be fixed before we announce a release, unfortunately.

@lstein lstein unpinned this issue Sep 11, 2022
@mh-dm
Copy link
Contributor

mh-dm commented Sep 17, 2022

With latest performance changes like #540 and #582 and #569 and #495 and #653 it's possible to generate standard 512x512 images faster than ever (ex: 2.0it/s or 25.97s for full 50 step run) or larger and larger images, like up to 1532x1280 with video card with 8GB.
Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests