Model Isn't Learning #4

ExponentialML · 2023-01-13T23:21:14Z

Using Stable Diffusion 1.5 on torch 1.13.1, Cuda 11.6, and the latest version of xformers==0.0.16. I cannot build torch 1.12.1 on my machine.
The model won't learn. It simply looks like the first iteration after every epoch.

(0-500 all look like this)

Ericxgao · 2023-01-20T10:04:59Z

Same issue here, I'm using 3.9.12 with Torch @ 1.12.1, Cuda 11.6

JulianJuaner · 2023-01-30T08:50:31Z

Same issue. Whether using this repo or the official repo.
It seems the model is not updating during training.

ExponentialML · 2023-01-30T09:40:36Z

@Ericxgao and @JulianJuaner I found a fix that worked for me. You guys can give it a go and report back.

Install from requirements.txt. Torch version must be 1.12.1 on CUDA 11.3 or 11.6.
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
Clone xformers: git clone https://github.com/facebookresearch/xformers
After cloning: cd xformers
Run :git reset --hard 0bad001ddd56c080524d37c84ff58d9cd030ebfd
git submodule update --init --recursive
pip install -e .

After install, try running the script.

JulianJuaner · 2023-01-30T12:20:24Z

@Ericxgao and @JulianJuaner I found a fix that worked for me. You guys can give it a go and report back.

Install from requirements.txt. Torch version must be 1.12.1 on CUDA 11.3.
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Clone xformers: git clone https://github.com/facebookresearch/xformers

After cloning: cd xformers

Run :git reset --hard 0bad001ddd56c080524d37c84ff58d9cd030ebfd

git submodule update --init --recursive

pip install -e .

After install, try running the script.

Thanks! It works for me. It seems the version of xformers is essential.

Ericxgao · 2023-01-30T21:16:22Z

Hmm I'm still having trouble getting this version of xformers installed. What GPU and python version are you @ExponentialML @JulianJuaner ? I'm using a cloud A100.

ExponentialML · 2023-01-30T23:07:26Z

@Ericxgao If you're using an A100, you should be able to fit the model in 40GB of vram when training, so xformers shouldn't be needed. Is this not the case?

Ericxgao · 2023-01-30T23:36:08Z

I still get OOM errors - I disabled Adam 8 bit as that was also failing on my system (bitsandbytes doesn't seem to install properly)

mili-inch · 2023-02-05T03:48:10Z

facebookresearch/xformers#631
This is probably caused by the same issue as this one, but it has been resolved and the model was successfully trained using xFormers in version 0.0.17.dev441.

ExponentialML · 2023-02-05T08:38:30Z

Closing with solutions from me & @mili-inch . If there are any other issues, feel free to ask for a re-open to discuss.

ExponentialML closed this as completed Feb 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Isn't Learning #4

Model Isn't Learning #4

ExponentialML commented Jan 13, 2023

Ericxgao commented Jan 20, 2023

JulianJuaner commented Jan 30, 2023

ExponentialML commented Jan 30, 2023 •

edited

Loading

JulianJuaner commented Jan 30, 2023

Ericxgao commented Jan 30, 2023

ExponentialML commented Jan 30, 2023

Ericxgao commented Jan 30, 2023

mili-inch commented Feb 5, 2023

ExponentialML commented Feb 5, 2023

Model Isn't Learning #4

Model Isn't Learning #4

Comments

ExponentialML commented Jan 13, 2023

Ericxgao commented Jan 20, 2023

JulianJuaner commented Jan 30, 2023

ExponentialML commented Jan 30, 2023 • edited Loading

JulianJuaner commented Jan 30, 2023

Ericxgao commented Jan 30, 2023

ExponentialML commented Jan 30, 2023

Ericxgao commented Jan 30, 2023

mili-inch commented Feb 5, 2023

ExponentialML commented Feb 5, 2023

ExponentialML commented Jan 30, 2023 •

edited

Loading