-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bf16 support for VAE as a fallback #9295
Conversation
the only vae that producing black squares is vae from nai/any/k8, just saying |
The VAE should be converted to bf16 beforehand, I don't think the current implementation is the correct way to go about this because then it wastes time decoding the latent image twice. Also this shouldn't be merged yet since only pytorch 2.1 nightly supports this iirc, and the only other PR for upgrading torch is for the 2.0 release. Should be marked as a draft for now. |
In theory, this is the case. According to some preferences, I did not add the version number to judge, but considering the actual use, it will not have any effect if it will not trigger NAN or disable NAN, so the merge is also feasible. If someone finds it necessary, an exception handling can also be added to explain that the PyTorch version does not support bf16.
I don't agree with this opinion, ideally this conversion is done only once, so it takes very little time. Conversely, bf16 performs worse than fp16 in terms of speed, VRAM consumption, and accuracy, so the overall bf16 is not ideal. Considering that even a problematic VAE does not necessarily generate a black image, lazy conversion is a good solution. |
What kind of Performance and VRAM impact are you seeing to make this sort of statement? Do you have numbers to back this up, since personally I've not see this after using BF16 VAE full-time during the past 4 months, and performance improved after PyTorch merged into BF16 interpolate support into nightly. I just did a quick test with This is to be expected, since TF32<->BF16 should be a faster and higher quality conversion than TF32<->FP16 on NVIDIA. While both FP16 and BF16 have identical 16bit data sizes, so there should be no additional VRAM usage unless an unneeded conversion with duplication of data is occurring somewhere in WebUI.
I'm also of the opinion that it would make more sense to implement this similar to |
You can refer to the data at the end of this issue, but there may be some differences for VAE. I am most concerned about the accuracy, but as you said, it may be necessary to do a comparison to test which of bf16 and fp16 has better results.
I do the bf16 conversion as a trigger operation, so it can be considered as part of the |
It seems like that old resolved issue was about a perf regression in Lightning only compared to PyTorch itself when using CUDNN V7 API, which was resolved with CUDNN V8 API. So I don't think that will affect us as Torch switched to using the CUDNN V8 API for BF16 convolution support a long time ago. The statement at the bottom of the issue is also true currently, since BF16 mixed-precision can indeed having poor performance under certain workflows. This can cause situations where working in TF32 can be faster by eliminating slow casts which result in a negative perf benefit, but that doesn't seem to apply for webui SD inference. I've tested BF16 autocast before in webui and it does indeed have a significant memory and performance impact on inference since Torch doesn't autocast many ops to BF16 like it does for FP16, resulting it mostly working in TF32 when BF16 autocasting is enabled. Still faster than TF32 only for our use-case, but very little benefit of doing so for inference unlike training. Similarly converting other components of a model such a UNET or CLIP to BF16 has a rather large impact on seed output, so that should likely be avoided as well unless someone trained a BF16 precision model directly.
Yes, but rather than specifically VAE, I believe it has more to do with not using BF16 mixed-precision (autocast) here. We are only doing a single manual cast to BF16 VAE bias, while using FP16 mixed-precision. |
@Cyberbeing briefly tested the VAE running under different types, and posted two pictures here to show that there is no obvious content difference in these test samples. Tested on these samples, there is no significant difference in speed between tf16 and bf16. The following is the content difference between tf16 and bf16 compared with tf32 on the test sample, and bf16 shows more deviations on all samples. In previous tests I found an example of global bf16 causing significant content differences (I didn't keep it), which is why I insist that bf16 is only suitable as a fallback, there is no advantage of bf16 on the current implementation. |
It would appear your testing may have been done with the non-default webui options which I mentioned recently in discussions. Can you double check?
I did a quick test myself and I can only reproduce results similar to yours when those options are set, though it does seem to be true that BF16 always degrades output more than FP16 does, the difference is basically invisible to the human eye unless you pixel peep. I think what happened, was months ago when I last tested this I was still using xformers (non-deterministic), and I had only set the following before I discovered the massive degradation recently if I used FP16 VAE without that also set to false.
Which is why I got the impression that BF16 VAE was closer to FP32 VAE than FP16 VAE was, since with those settings it was. Yet my testing wasn't apples to apples. This reminds me, that we really should create a pull request to disable the reduced_precision_reduction options, which is a huge quality boost for FP16 & BF16 and seems to bring seed reproduction very close to FP32 VAE. At least on my GPU setting both to False had no impact on inference performance, but someone should really test the impact on training. FP32.png vs TF32.png (reduced_precision_reduction disabled) FP32.png vs FP16.png (reduced_precision_reduction disabled) FP32.png vs BF16.png (reduced_precision_reduction disabled) With webui/pytorch defaults they had nearly identical seed output with image output noticeably different from FP32, though technically FP16 was still 0.03% better, that was invisible to the human eye on the noise floor: FP32.png FP16-pytorch-defaults.png FP32.png vs BF16-pytorch-defaults.png BF16-pytorch-defaults.png vs FP16-pytorch-defaults.png I'm beginning to see the merit of your approach, but I'd still prefer this to be a command line argument which is optionally enabled. Even better would be implementing both methods and have The main problems with the auto on nan method, is indeed the first time you'd be repeating processing, but also that the vae then stays permanently as bfloat16 until you load a new model or vae, which could lead to inconsistent seed output. In other words, your output could change depending on the order and type of generations you perform. |
I didn't set it in the code. If the PyTorch documentation is not invalid, the default setting is fp16 on and bf16 off, and it is not recommended to enable allow bf16. |
@Cyberbeing This is the result of disabling allow fp16. Similar to the previous one, fp16 is still more similar, and there is no obvious performance difference between the two. These samples are significantly different from not disabling allow fp16.
I don't see any examples where disabling it improves the quality, fp16 can show a high similarity to fp32 whether or not allow fp16 is disabled. Predictably, disabling it slows down training.
Adding an enable parameter is trivial, but I want it to work out of the box so that users don't have to suffer through VAE anymore on supported devices. The reason why it is linked to nan-check is because it is relatively useless (I don’t think it makes sense to prevent a black image from being saved, because you have already spent time, it just makes you press delete one less time), and on the other hand, it is because it does not increase the complexity of use. Since I don't see any advantage of bf16, I won't support global bf16, it just reduces accuracy for nothing, maybe I can try another VAE. I think another suitable option is to add VAE type recognition to go with the pre-converted bf16 VAE, it can be reproduced stably, but it increases the cost of use. I know that this lazy conversion may cause inconsistent output on a specific VAE. This is a matter of trade-offs, but thanks to the similarity between fp32 and fp16, this inconsistency can be regarded as a difference with fp32. |
[Edit: I realized a potential oversight that I may have forgotten to test FP32 VAE with reduced precision reduction enabled which was a missing data point, so I removed most of this post. It's irrelevant to the PR anyway, so not worth further discussion here.]
The PyTorch documentation is indeed invalid, which surprised me as well when I first discovered it months ago. As you can see, the commit message states they've disabled it, but if you look at the code itself you'll see both fp16 & bf16 reduced_precision_reduction are enabled by default. Yet both degrade inference quality unless set to False, without any real performance benefit for SD inference. It does significantly change seeds though, so it would need to be made optional. pytorch/pytorch@909a989 Which is still the case in PyTorch 2.1 master branch You can easily check the double-check pytorch defaults by just importing torch and calling them. Either way this is getting a bit off-topic, since setting those options to false should likely be made a separate PR, which will then likely need to be added to compatibility options so people can reproduce their old reduced precision seeds. I only brought this up, since it seems to make my results closer to your with them set to False. I'll step out of this PR for now and just let automatic111 decide the course. |
@Cyberbeing I checked your example, it also clearly shows that fp16 maintains better accuracy under the same setting, as for the backend setting, it is not discussed here. You're right about one thing, we need opinion from the @AUTOMATIC1111. |
@YHD233 What version of PyTorch and what type of GPU are you using? |
python: 3.10.6 torch: 2.1.0.dev20230407+rocm5.4.2 GPU:RX6800 system: Kubuntu 22.04 But when I reopen the console the error doesn't appear again. |
@YHD233 I think I need to know your XYZ Plot parameters. |
X type:CFG Scale X values:8,9,10,11,12,13,14,15,16 When I have this error, this error will also occur when I close XYZ Plot and generate directly. It will not work until I close the console and open it again. |
@YHD233 My mistake, fixed. |
I found that after stopping the generation when using XYZ plot, and then starting to generate, this error will be output when the progress bar is full
I found that after stopping the generation when using XYZ plot, and then starting to generate, this error will be output when the progress bar is full |
@AUTOMATIC1111 What do you think about this PR? |
Since we are on torch 2.0 and this appears to need torch 2.1, I have not considered it yet. I don't like adding a commandfline flag - if it works and is supported by GPU I think it should be enabled without asking user to enable it. Also the most important question is does it really help with black square images in VAE? |
I originally did it as part of the
On my test case, it is clearly effective, and in theory it can solve the same problem as |
PyTorch lacks an explicit method to check for bf16 support on AMD GPUs. |
Can't you just create a one-number bf16 tensor and do some with it like multiply iy by 0.5 to test if bk16 is supported? |
I know it works, it's just that it's not aesthetically pleasing and I don't have the equipment to test it. |
I found a method on PyTorch that will try to enable this feature by default. |
The title for this PR isn't really clear either imo. It isn't adding support for bf16 VAEs, that's already supported out-of-the-box with Torch 2.1. All this adds is the rollback feature for when a fp16 VAE produces NaNs. |
WebUI doesn't have code to handle bf16, so it looks like it will switch to fp even though PyTorch supports bf16. But it doesn't matter, the name of the pr has no effect. |
I think 23c947a supercedes this? |
You are wrong, both fp16 and bf16 are designed to save VRAM. If users can accept fp32 at any time, it is better to run VAE with fp32 globally, and it is even more useless to fall back to fp32. |
Now that PyTorch 2.1 has been released, any news on this? |
Every since this was made, the webui got a similar mechanism (and I used the idea from this PR) to deal with SDXL VAE errors, but converting to FP32 instead of BF16. So this PR would have to be integrated into existing system, which I did in ac0ecf3. |
Describe what this pull request is trying to achieve.
According to the description here, bf16 can solve the problem of VAE working in half precision to generate black images, so I made this commit.
Additional notes and description of your changes
bf16 is great to use as a fallback, when the webui detects an empty image generation, it tries to convert and retry on supported devices, works fine on my test case. Note that if you want to use this feature, you need to use a GPU that supports bf16 and the webui works on PyTorch 2.1. For unsupported devices, you can still only use
--no-half-vae
.Edit: In theory, AMD GPUs are also supported.
Environment this was tested in