Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Isn't Learning #4

Closed
ExponentialML opened this issue Jan 13, 2023 · 9 comments
Closed

Model Isn't Learning #4

ExponentialML opened this issue Jan 13, 2023 · 9 comments

Comments

@ExponentialML
Copy link

Using Stable Diffusion 1.5 on torch 1.13.1, Cuda 11.6, and the latest version of xformers==0.0.16. I cannot build torch 1.12.1 on my machine.
The model won't learn. It simply looks like the first iteration after every epoch.

(0-500 all look like this)
step_0

@Ericxgao
Copy link

Same issue here, I'm using 3.9.12 with Torch @ 1.12.1, Cuda 11.6

@JulianJuaner
Copy link

Same issue. Whether using this repo or the official repo.
It seems the model is not updating during training.

@ExponentialML
Copy link
Author

ExponentialML commented Jan 30, 2023

@Ericxgao and @JulianJuaner I found a fix that worked for me. You guys can give it a go and report back.

  1. Install from requirements.txt. Torch version must be 1.12.1 on CUDA 11.3 or 11.6.
    pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
  2. Clone xformers: git clone https://github.com/facebookresearch/xformers
  3. After cloning: cd xformers
  4. Run :git reset --hard 0bad001ddd56c080524d37c84ff58d9cd030ebfd
  5. git submodule update --init --recursive
  6. pip install -e .

After install, try running the script.

@JulianJuaner
Copy link

@Ericxgao and @JulianJuaner I found a fix that worked for me. You guys can give it a go and report back.

  1. Install from requirements.txt. Torch version must be 1.12.1 on CUDA 11.3.
    pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
  2. Clone xformers: git clone https://github.com/facebookresearch/xformers
  3. After cloning: cd xformers
  4. Run :git reset --hard 0bad001ddd56c080524d37c84ff58d9cd030ebfd
  5. git submodule update --init --recursive
  6. pip install -e .

After install, try running the script.

Thanks! It works for me. It seems the version of xformers is essential.

@Ericxgao
Copy link

Hmm I'm still having trouble getting this version of xformers installed. What GPU and python version are you @ExponentialML @JulianJuaner ? I'm using a cloud A100.

@ExponentialML
Copy link
Author

@Ericxgao If you're using an A100, you should be able to fit the model in 40GB of vram when training, so xformers shouldn't be needed. Is this not the case?

@Ericxgao
Copy link

I still get OOM errors - I disabled Adam 8 bit as that was also failing on my system (bitsandbytes doesn't seem to install properly)

@mili-inch
Copy link

facebookresearch/xformers#631
This is probably caused by the same issue as this one, but it has been resolved and the model was successfully trained using xFormers in version 0.0.17.dev441.

@ExponentialML
Copy link
Author

Closing with solutions from me & @mili-inch . If there are any other issues, feel free to ask for a re-open to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants