Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not freeing RAM when changing between checkpoints #2180

Closed
FerrahWolfeh opened this issue Oct 10, 2022 · 18 comments
Closed

Not freeing RAM when changing between checkpoints #2180

FerrahWolfeh opened this issue Oct 10, 2022 · 18 comments
Labels
bug-report Report of a bug, yet to be confirmed

Comments

@FerrahWolfeh
Copy link

Describe the bug
When you start the webui with a X checkpoint, it fills the system RAM into a certain ammount. Now if you change the checkpoint to Y in the webui, the RAM usage increases as if both models are loaded. If you change back to checkpoint X, RAM usage remains unusually high and the system begins to violently swap as soon as you start to generate images.

To Reproduce
Steps to reproduce the behavior:

  1. Start webui.sh with any model (eg. Waifu-Diffusion 1.3) and measure the system RAM once the startup finishes
  2. use the selector on the top of the page to change to another model (eg. Stable-Diffusion 1.4)
  3. Change back to the first model and check RAM again.

Expected behavior
As soon as you switch the checkpoints, the program should free most of the memory used for the currently loaded model and fill it with the new selected model. And when you switch back to the first checkpoint, the memory should again be freed and filled to about a similar amount when the program had when it was freshly started.

Screenshots
Here are some screenshots of the memory usage of my system (notice the used and available columns)

image
Base system usage (only firefox open with some active youtube tabs)

image
Usage right after initialization


Usage after switching to another checkpoint


Usage after switching back to first model

Desktop (please complete the following information):

  • OS: Arch Linux 5.19.13
  • Browser: Firefox
  • Commit revision ce37fdd

Additional context
This is most visible on a system where you don't have much RAM to begin with (16GB in my case) and the effects are visible even without generating anything. It gets worse if you begin switching checkpoints between generations

@FerrahWolfeh FerrahWolfeh added the bug-report Report of a bug, yet to be confirmed label Oct 10, 2022
@bmaltais
Copy link

bmaltais commented Oct 10, 2022

This probably explain why loading a new model usually crash webui after 7 or 8 model swap on my system with 16gb RAM allocated to WSL2.

@CoffeeMomoPad
Copy link

Experiencing the same thing on 16GB RAM, did not happen before till now

@TechOtakupoi233
Copy link

When loading a new ckpt, the program will start loading the new ckpt, but left the old one in VRAM and RAM. I have only 6GB VRAM, It can't hold two models at once, it will be nice if the program free up VRAM and RAM BRFORE loading a new ckpt.

@nerdyrodent
Copy link

Same here. Switching models uses more and more RAM. I've tried changing "Checkpoints to cache in RAM", but it appears to make no difference.

@anonymous721
Copy link

It's causing me a lot of annoyance too. 15-20 minutes testing a couple different models, and I'm already over 20GB of system RAM used.

@RandomLegend
Copy link

This is a serious issue for me now.
I have 16GB of ram and i never had any issues switching between models before.

I ran an old version of webui perfectly fine, upgraded to newest git because of new features and now i cannot even swap a model once. It will just crash violently.

@GeorgiaM-honestly
Copy link

Hello,

I'm trying to replicate this by using, as your screenshots say, 15 GiB of ram. My setup differs: the bare metal OS is gentoo linux, then I'm using a qemu VM with devuan (debian without systemd), and within that, auto running inside of a docker container. Hopefully that added complexity doesn't screw my testing up.

And yes you are seeing correct: I don't have swap. I didn't bother because this stuff lives on a host that has 128GB of ram and I'm able to just dial in whatever I want to give to the VM.

The formatting here is getting completely hosed, I'm not sure what is going on, sorry about that.

After initial start and before visiting the ui, which here loads the standard 1.5 model ( v1-5-pruned-emaonly.ckpt | 81761151 ):

GiB:

           total        used        free      shared  buff/cache   available

Mem: 14 5 4 0 4 8
Swap: 0 0 0

After visiting the UI and switching the model to the standard v1.4 ( 7460a6fa ):

GiB:

           total        used        free      shared  buff/cache   available

Mem: 14 8 1 0 4 5
Swap: 0 0 0

After switching back to the standard 1.5 model ( v1-5-pruned-emaonly.ckpt | 81761151 ):

GiB:

           total        used        free      shared  buff/cache   available

Mem: 14 8 1 0 4 5
Swap: 0 0 0

As such I am not able to replicate this. Please let me know if I missed something, or if you'd like me to try something else! You can also look into zram / compressed ram on linux, it is a handy and tuneable set of options which begins to compress the oldest ram contents (gently and more heavily if resources continue to run out) with the goal of delaying when the very slow swap space is used.

@0xdevalias
Copy link

0xdevalias commented Nov 9, 2022

The formatting here is getting completely hosed, I'm not sure what is going on, sorry about that.

@GeorgiaM-honestly Have you wrapped it in triple backticks to make it a code block? (```)


Random thought/musing/not sure if this actually relates to how things are done in the code at all, but is the model ckpt hash used for caching it anywhere (or was it at some point in the past)? I know there are some other issues here (can't remember the links off the top of my head see link below) that were talking about different model ckpts that had the same hash, even though they were different. I'm wondering if perhaps switching back and forth between models with that 'hash clash' might somehow be causing this memory leak?

Edit:

This one:


I ran an old version of webui perfectly fine, upgraded to newest git because of new features and now i cannot even swap a model once. It will just crash violently.

@RandomLegend Is this swapping between any models at all? Are you able to provide the model hashes for some of the models that cause it to crash? Do they happen to have the same hash as per my theory above by chance?


Also, this is a separate issue, but I saw it linked here, and wanted to backlink to it in case it's relevant:

And this one may also be related:

Changing to an inpainting model is calling the load_model() and creating a new model, but the previous model is not being removed from memory, even calling gc.collect() is not removing the old model from memory.

So if you keep changing from inpainting to not inpainting or vice versa the leak keep increasing.

Originally posted by @jn-jairo in #3449 (comment)

The fact that gc.collect() doesn't clear the old model is interesting however. This means that something is keeping a pointer to the old model alive and preventing it from being cleaned up.

Originally posted by @random-thoughtss in #3449 (comment)

Just to notify the progress I made, It is indeed a reference problem, some places are keeping a reference of the model, what prevents the garbage collector to free the memory.

I am checking it with ctypes.c_long.from_address(id(shared.sd_model)).value and there are multiples references.

I am eliminating the references but there are still some to find, It will take a while to find everything.

Originally posted by @jn-jairo in #3449 (comment)

@0xdevalias
Copy link

0xdevalias commented Nov 9, 2022

Looking at the 'references timeline' on #3449 also pointed me to this PR by @jn-jairo that was merged ~9 days ago:

@GeorgiaM-honestly I wonder if that's why you can't replicate the issues here anymore?

@RandomLegend have you updated to a version of the code that has that fix merged, and if so, are you still seeing issues despite it?

@0xdevalias
Copy link

@0xdevalias when i observed and reported this issue i was on the latest code, yes.

However i just completely wiped the installation, with the venv and the repos and reinstalled from scratch. That fixed it. I assume it was some incompatibility with some old stuff laying around that wasn't cleared in recent commits.

Originally posted by @RandomLegend in #2264 (comment)

@tzwel
Copy link

tzwel commented Dec 5, 2022

how to downgrade?

@clementine-of-whitewind

pliese fix

@Coderx7
Copy link

Coderx7 commented Jul 30, 2023

Im having the same issue on the latest commit. I never had this issue and it just popped up out of nowhere!
I'm on an ubuntu 22.04 with 32GB of RAM( and no swap) and

Python revision: 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0]
Dreambooth revision: 9f4d931a319056c537d24669cb950d146d1537b0
SD-WebUI revision: 68f336bd994bed5442ad95bad6b6ad5564a5409a

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.16rc425 installed.
[+] torch version 1.13.1+cu117 installed.
[+] torchvision version 0.14.1+cu117 installed.

side note: I did install google-perftools and then removed it thinking it may have sth to do with it. nothing changed

@catboxanon
Copy link
Collaborator

catboxanon commented Aug 7, 2023

The dev branch and upcoming 1.6.0 may have resolved this with the rework in b235022. I'm going to leave this open for the time being but for those that would like to test it out earlier you can switch to the dev branch to do so.

@Avsynthe
Copy link

Hey all. I'm having this issue also. I'm using 1.6.0 and it never releases RAM. The more I generate, the higher it goes.

The server went down today and I couldn't figure out why the last snapshot of the system showed 99% memory used of 64GB. I realised SD is just compounding away. This happens no matter what model I use with VAE models increasing it quicker for obvious reasons. Switching models makes no difference, it just continues on.

I've had to limit SD to 20GB RAM and so it'll eventually crash when it hits.

@Wynneve
Copy link

Wynneve commented Oct 14, 2023

@Avsynthe Hello there! I've been having the same issue for entire day now and it seems like I found a “solution”.
I've tried switching some settings in the webui, changing my CUDA toolkit version in the PATH, changing the version of CUDA of PyTorch, updating to the “dev” branch of the webui, etc... Nothing worked.

Then I realized that I had updated PyTorch before this problem appeared and then I tried to downgrade to PyTorch 2.0.1. And it worked! No more memory leak, now it properly offloads the weights from the RAM to VRAM and vice versa each generation.

For your convenience, here is the command for installing this previous version of PyTorch:
pip3 install torch==2.0.1 torchvision --index-url https://download.pytorch.org/whl/cu118
As I remember, I've deleted it before reinstalling, so, if it refuses to downgrade, you can manually remove it before executing the command:
pip3 uninstall torch torchvision

Seems like it's more an issue of the new PyTorch on its own, something related to moving tensors between devices.

If you aren't using Torch 2.1.0, well, my sincere apologies for not helping you :(

@DanielXu123
Copy link

@Avsynthe Same thing in Linux, added to 100GB RAM , is there any possible solutions?

@DanielXu123
Copy link

@Wynneve I reinstalled torch from 2.1.0 to 2.0.1 , but it shows mu xformers cannot be activate correctly, could you please help to check what your xfomers version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-report Report of a bug, yet to be confirmed
Projects
None yet
Development

No branches or pull requests