-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Keep sigmas on CPU #15823
Conversation
It seems it interferes with #15751? I get
|
That is a problem on their end, and can be resolved by not sending the sigmas to the device. |
* Consistent with implementations in k-diffusion. * Makes this compatible with AUTOMATIC1111#15823
I was testing with hr-fix and it seems it doesn't work?
It only happens when hi-res starts to generate (after tiled upscale process) |
That's not a hi-res fix issue, it's a k-diffusion issue specific to that sampler (Euler and Heun at least, DPM++ 2M works fine). I'll have to work on an upstream patch for k-diffusion for this fix to be viable, then. Marking as draft until then. |
So, for this to resolve, either this PR needs to be merged upstream (crowsonkb/k-diffusion#109) or I have to monkey patch the function. This probably isn't going to be a merge candidate any time soon so I'll give the upstream PR some time. |
Now that the new schedulers are merged, I've went ahead and patched the to_d method there and have stripped the |
Is there a performance improvement? |
Yes. As stated, the current implementation induces a forced device sync during every sampling step, and the CPU will have to wait on the GPU to be done with its work for certain control flow operations. I have a more detailed explanation of its impacts here: crowsonkb/k-diffusion#108. And it is completely safe to have the sigmas tensor on the CPU (at least without this single operation that I had to patch) because the sigmas are only used for control flow and scalar operations. Notably, this is also the last blocking operation that I am aware of in the main inference loop, so pytorch dispatch will be completely unblocked after this and you should see nearly uninterrupted 100% GPU usage. |
What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different. |
It's probably easiest to look at GPU usage with a fast update speed, since it ideally should almost never go below 100. Without it, you will notice a small dip between each step. It is per step so it may be subtle, it would probably be somewhat more pronounced on slower CPUs. But if you can see that dip, you know that it is not using the GPU for some period of time where it should be using it. My standard when testing this was always 512x512 batch size 4. For completeness, I was also using --opt-channelslast (which should be noticeably faster on Ampere and above after the other performance patches where it wasn't always before). Live preview will probably also need to be off since it also includes blocking operations. |
All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing
There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me. |
Yeah, I agree the device arguments don't necessarily have to be removed. Setting it to CPU should work. The |
* Consistent with implementations in k-diffusion. * Makes this compatible with AUTOMATIC1111#15823
Description
Checklist: