-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Textual Inversion Training on M1 (works!) #517
Comments
I started working with the training functionality last night as well and ran into problems on CUDA. The textual inversion modifications to ddpm.py seem to have adversely affected vanilla training and we'll have to do a careful comparison with the original CompViz implementation in order to isolate the conflicts. @tmm1, have you tried main.py on M1 using any of the other (multitudinous) forks? If so, any success? |
If there's a fork that advertises M1 training support I would be happy to try it. I have not seen one, but I have not looked much either. My understanding was that most of the M1 work was happening here. |
@Any-Winter-4079, when you've finished the latest round of code tweaking, could you have a look at training? It seems to be messed up on M1. |
I made some progress today, and was able to get through all the setup and start training to send commands to the mps backend: development...tmm1:dev-train-m1 Currently stuck here:
I'm looking to see if there's a way to turn off the |
Made some more progress by changing strategy from ddp to dp (https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html) However, it seems
EDIT: Found solution in Lightning-AI/pytorch-lightning#10315 |
It is training!
I recall tho others saying training was broken on CUDA too in this fork, so I'm not sure if this is actually working or just appearing to. But a lot of the blockers are solved and we can get into the guts of the impl now. EDIT: I am not seeing any of the warnings mentioned on the CUDA thread (related to batch_size) |
Died at the end. Everything turned to nan at some point, I don't know if that's a bad sign. I will try to make
|
Hmm got past last error but a new one now:
|
Okay now its able to move onto epoch 1!
|
I have a |
I started fresh and by epoch 2 everything turns to nan. I think that is causing the black images?
|
when I encountered black images with k-diffusion sampler, it was due to this problem (with ±Inf): fix was just to detach and clone the tensor: if you're having NaN (rather than ±Inf), maybe that's unrelated. I recommend to narrow down which line first introduces NaN. you can use this check to do so: mycooltensor.isnan().any()
# returns a boolean |
Thanks @Birch-san! I see this warning at the start of training which may be related.
|
This looks interesting. I will have a look. |
Thanks @Any-Winter-4079! You could use the ugly-sonic training samples along with instructions in TEXTUAL_INVERSION.md I am going to try |
Caught something:
|
This is all very new to me, but if I'm interpreting the output correctly it seems to suggest the gradients from einsum are nan. Important bits:
This is including the changes merged into development this morning. |
@lstein What sort of problems did you run into on CUDA? I wonder if you can try this and see if any anomalies are detected? diff --git a/main.py b/main.py
index c45194d..57c8832 100644
--- a/main.py
+++ b/main.py
@@ -864,6 +864,7 @@ if __name__ == '__main__':
]
trainer_kwargs['max_steps'] = trainer_opt.max_steps
+ trainer_opt.detect_anomaly = True
trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
trainer.logdir = logdir ###
|
I switched I don't have a CUDA setup to test with, so maybe nan at this step is expected. This happens right away for me, whereas loss was fine for a few hundred steps before so there must be a different anomaly later on. |
I'm actually getting a core dump at a step that says "validating". I'm
trying to get IT to install gdb on the cluster so that I can do a stack
trace, not that it will be very helpful.
Lincoln
…On Tue, Sep 13, 2022 at 12:43 PM Aman Gupta Karmani < ***@***.***> wrote:
I started working with the training functionality last night as well and
ran into problems on CUDA. The textual inversion modifications to ddpm.py
seem to have adversely affected vanilla training and we'll have to do a
careful comparison with the original CompViz implementation in order to
isolate the conflicts.
@lstein <https://github.com/lstein> What sort of problems did you run
into on CUDA? I wonder if you can try this and see if any anomalies are
detected?
diff --git a/main.py b/main.py
index c45194d..57c8832 100644--- a/main.py+++ b/main.py@@ -864,6 +864,7 @@ if __name__ == '__main__':
]
trainer_kwargs['max_steps'] = trainer_opt.max_steps
+ trainer_opt.detect_anomaly = True
trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
trainer.logdir = logdir ###
—
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA3EVIKE3BZ5PDHJQI5LE3V6CVKNANCNFSM6AAAAAAQKODTKA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Maybe this is another opportunity to try replacing einsum with matmul? Birch-san/stable-diffusion@d2d533d It's like 30% slower, but it might do something different regarding NaN? Context:. |
Good idea! But it just failed in the same way, so either nan is normal or something else bigger is the problem.
|
If it's about replacing |
@tmm1 Have you encountered this error? |
Hm no I didn't see that one. |
Well, for anyone that encounters the issue, it's fixed with |
wow, that's cool. yeah, it's just a drop-in replacement: unfortunately I'm finding opt_einsum to be about 30x slower on MPS. 8 steps inference: |
Yes I used |
@Birch-san by the way if you got it to work, there is this other version (Dreambooth) you may be interested in. XavierXiao/Dreambooth-Stable-Diffusion#4 They claim even better results than with regular TI Also 256x256 sounds interesting/promising (4x promise), but there is an issue where 2 characters sometimes appear on the sample images. |
@Any-Winter-4079 Excellent results and information about your experiments! This is all really helpful for me, and I'll continue to test Textual Inversion (also in combination with prior training in Dreambooth, which seems to require less overall training). As a direct comparison these were my results with the same training data using Dreambooth. The results are far easy to stylise and don't require prompt weighting, though one thing I noticed about this particular training, it has exact likeness in face but rarely produces the dreadlock hairstyle: Unless I specify 'with short dreadlocks' in the prompt, then it produces the dreadlocks and bleached tips but it never fully gets the exact hair style (depending on the artistic modifiers also used in the prompt, these examples were both norman rockwell for comparison) |
@Birch-san I've tried using your embeddings, but they don't seem to generate plush dolls.
I assume it's due to changes in your |
It would be great if that fixes my core dump as well. I'll give it a try.
Meanwhile, just got a version of python installed with debugging symbols,
so hopefully I'll be able to do a backtrace.
Lincoln
…On Mon, Sep 26, 2022 at 9:38 PM Aman Gupta Karmani ***@***.***> wrote:
My segfault went away with this change from @Any-Winter-4079
<https://github.com/Any-Winter-4079>'s branch
--- a/ldm/modules/embedding_manager.py+++ b/ldm/modules/embedding_manager.py@@ -170,7 +170,7 @@ class EmbeddingManager(nn.Module):
)
placeholder_rows, placeholder_cols = torch.where(- tokenized_text == placeholder_token.to(device)+ tokenized_text == placeholder_token#.to(device)
)
if placeholder_rows.nelement() == 0:
—
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA3EVJ4NHNZHNRLBMLJUI3WAJF3DANCNFSM6AAAAAAQKODTKA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*Lincoln Stein*
Head, Adaptive Oncology, OICR
Senior Principal Investigator, OICR
Professor, Department of Molecular Genetics, University of Toronto
Tel: 416-673-8514
Cell: 416-817-8240
***@***.***
*E**xecutive Assistant*
Michelle Xin
Tel: 647-260-7927
***@***.*** ***@***.***>*
*Ontario Institute for Cancer Research*
MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G
0A3
@OICR_news
<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0>
| www.oicr.on.ca
*Collaborate. Translate. Change lives.*
This message and any attachments may contain confidential and/or privileged
information for the sole use of the intended recipient. Any review or
distribution by anyone other than the person for whom it was originally
intended is strictly prohibited. If you have received this message in
error, please contact the sender and delete all copies. Opinions,
conclusions or other information contained in this message may not be that
of the organization.
|
my k-diffusion integration is just for the logger used during training. I think the explanation is simpler: my embedding is weak. I also used it could explain why it turns everyone into pillows when I use it on waifu-diffusion: (this was "photo of Reimu * plush doll") |
By the way, is training deterministic? I aborted training, started over, and I'm getting the same images in my train/val folders. Maybe it's because of our Same value for |
Hello, @tmm1, you said that you faced with this error and solved it. How did you fix it care to elaborate? I tried using different flags to use cpu, but it didn't fix it for me. Is there another method, maybe I should change the package versions or something? I have Apple M1 Max. |
I wonder if the model learns better/worse with e.g. waifu diffusion. |
Update to say I've had my best success so far training with images of myself. |
I tried training with different repos, this one, mac-optimized by Birch-san, web-ui by AUTOMATIC1111 (currently broken for training), DreamBoothMac by SujeethJinesh. Tried applying optimizations mentioned in this thread (most of them as I understand, are already implemented in the code), but every time after the 1st iteration, loss becomes nan. Updated to latest pytorch-nightly. No Idea where to dig next. My last bet is to upgrade MacOS to Ventura in hope for changes in Metal backend, since I'm currently on an old 12.3.1. |
@remixer-dec try the previous PyTorch stable, 1.12.1. there's a bug in PyTorch 1.13 stable which means autograd returns NaN gradients. I'm nearly done making a minimal repro of it. I can reproduce it using just the autoencoder's decoder. IIRC detect_anomaly said the NaN comes from NativeGroupNorm. This also breaks CLIP guidance on Mac. |
My best Dreambooth result has been with https://colab.research.google.com/drive/1-HIbslQd7Ei_mAt25ipqSUMvbe3POm98?usp=sharing#scrollTo=CnBAZ4eje2Sl (in case it's helpful). I will try out https://github.com/SujeethJinesh/DreamBoothMac. Thanks! If you get nan, you may need:
And |
@Birch-san thank you! It actually worked! You are a legend! |
@Any-Winter-4079 be careful with the LastBen ones, he turned of class + regularisation to make training faster in some of them, and in our testing it caused dogs to have human eyes lol. You should be doing regularisation unless you don't care about using the model like a regular SD one. |
if prior preservation loss / coarse classes / regularisation are removed… is it still Dreambooth? is there any way in which it differs from regular finetuning at that point? |
@remixer-dec I got to the bottom of the problem returning NaN gradients from autograd on 1.13.0. minimal repro here: |
@Birch-san A lot more people are going down the image + caption route now for multi-trained models, and it is basically just full fine tuning at that point, albeit in smaller quantities |
Lots of people, including myself, have had trouble getting textual
inversion/fine-tuning working with InvokeAI. If someone has it working
reliably, could you share the recipe so that I can update the docs?
Lincoln
…On Wed, Nov 2, 2022 at 2:03 PM lkewis ***@***.***> wrote:
@Birch-san <https://github.com/Birch-san> A lot more people are going
down the image + caption route now for multi-trained models, and it is
basically just full fine tuning at that point, albeit in smaller quantities
—
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA3EVJAF5THE3BISMTJKFTWGKUFJANCNFSM6AAAAAAQKODTKA>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
--
*Lincoln Stein*
Head, Adaptive Oncology, OICR
Senior Principal Investigator, OICR
Professor, Department of Molecular Genetics, University of Toronto
Tel: 416-673-8514
Cell: 416-817-8240
***@***.***
*E**xecutive Assistant*
Michelle Xin
Tel: 647-260-7927
***@***.*** ***@***.***>*
*Ontario Institute for Cancer Research*
MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G
0A3
@OICR_news
<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0>
| www.oicr.on.ca
*Collaborate. Translate. Change lives.*
This message and any attachments may contain confidential and/or privileged
information for the sole use of the intended recipient. Any review or
distribution by anyone other than the person for whom it was originally
intended is strictly prohibited. If you have received this message in
error, please contact the sender and delete all copies. Opinions,
conclusions or other information contained in this message may not be that
of the organization.
|
The notebook I shared has prior preservation and regulation I think -and thanks to Joe Penna and these other contributors, also has images available to be used (instead of creating them yourself, or allowing them to be created on the spot...): All of this as far as I know, of course. I'm not an expert :) In any case, I tested 1-2 days ago. I'll re-test and fork TheLastBen's repo -if it still works-, if there is fear of it changing. |
WIP HERE: development...tmm1:dev-train-m1
I started experimenting with running main.py on M1 and wanted to document some immediate issues.
Looks like we need a newer pytorch-lightning for MPS. Currently using 1.6.5 but latest is 1.7.5
However bumping it causes this error:
which is because TestTubeLogger was deprecated: Lightning-AI/pytorch-lightning#13958 (comment)
The text was updated successfully, but these errors were encountered: