support HyperTile optimization #13948

aria1th · 2023-11-11T14:39:28Z

https://github.com/tfernd/HyperTile

Description

HyperTile is optimization with yet another split attention.

Currently, for testing other extensions like ControlNet, it requires modified repository

Thus requirements.txt is not modified here, it will require

pip install git+https://github.com/tfernd/HyperTile

Screenshots/videos:

The test is done for 3 environments:

512x512 (and 2x latent upscale)
768x768 (and 2x latent upscale)
OpenPose ControlNet with 512x768 2x latent upscale

			512x512	2x Latent	768x768	2x latent
		NO	6.02	3.64	5.56	1.03	it/s
		YES	6.04	4.65	5.86	2.05	it/s
		NO(b2)	4.97	1.7	3.6	0.45045	it/s
		YES(b2)	5.38	2.43	4.26	0.943396	it/s
		B1	0.33%	27.75%	5.40%	99.03%	Improve
		B2	8.25%	42.94%	18.33%	109.43%	Improve

ControlNet
512	768	NO	1.33	it/s
OpenPose	Latent 2x	YES	1.92	it/s
			44.36%

The test is done with animefull-latest-pruned model, with 1girl, negative easynegative.
Test is done in RTX4090@i7-8700-DDR4-3200 RAM, 16 batch counts for each test.

The patch is done to make hypertile deterministic.
Here is the behavior with / without hypertile:

Without HyperTile - original image. The result should be reproducable without patch.

With HyperTile - tested for 2 times with same seed. The result is slightly different, but deterministic.

TODO ; We need infotext for hypertile enabled / disabled.

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

https://github.com/tfernd/HyperTile

AndreyRGW · 2023-11-11T17:53:48Z

Finally someone decided to make HyperTile for Automatic1111!

aria1th · 2023-11-12T13:34:36Z

Note :
There is problem with Non-standard height / widths, Hypertile usually supports 128-multiples.

But with just txt2img process, some shapes like 512x704 would just work.

Unfortunately, for img2img process, the shape has to be reshaped into 128-multipliers first, or you will see these artifacts:

(Well its kinda beautiful(?) but still )

Thus we need more test for this.

Still, it should not have any problem with extensions / other stuffs though.

gel-crabs · 2023-11-14T21:12:38Z

Hey, if anyone wants to test this on SDXL, I created an amateur port of hypertile.py to use the SDXL depth layers.

I spent a lot of time tuning the numbers and tile sizes and whatnot over the past few days and I think I've found the best settings, i.e. best performance, no artifacts and it only slightly changes the seed compared to without.

I also added the LDM key for VAE, the original only had the diffusers key (A1111 only uses LDM/SGM and not diffusers) so this PR wasn't tiling the VAE at all, VAE hasn't changed across SD versions so it's relevant for both 1.5 and SDXL.

Big thank you to the creator of https://github.com/arenasys/stable-diffusion-webui-model-toolkit as the components directory in that repo is the only place I could find any info about what layers 1.5 and XL use.

hypertile.py.txt (Rename to hypertile.py in the modules directory)

This only works for SDXL and not 1.5. Without it I get 1.7 it/s, with it I get 1.82 it/s, with no loss to quality or determinism. Definitely worth it.

(Oh yeah, forgot to mention, I also commented out the line that prints every layer it hijacks to the console. SDXL has a LOT of layers.)

aria1th · 2023-11-15T06:10:16Z

SD Base (1.4-1.5)
512x768 1.5x Latent - 5.61it/s
Without - 4.36it/s

SD XL
512x768 1.5x - 2.71it/s
Without - 2.60it/s

(3 pass, batch count 6)

Co-Authored-By: Kieran Hunt <kph@hotmail.ca>

gel-crabs · 2023-11-15T20:52:03Z

Also note that increasing the max_depth further increases it/s, as it hijacks more layers.

I'm not sure about 1.5, but on SDXL I've gotten the best results with max_depth 2 (max_depth 1 is about the same speed as max_depth 2, but with a reduction in quality).

aria1th · 2023-11-16T00:52:46Z

The tile size / depth / etc options will be added to options soon ™️
(Except for auto-determining the largest tile size, I guess current implementation is correct for that)

Also, I found the old vladmandic's implementation, which says it is not compatible with ToMe / other types of extensions - but I guess it can just work if we hijack the hypertile at last moment, confirmed with ToMe ratio 0.3 / etc

Thus, if anyone find some bug - please ping me

gel-crabs · 2023-11-16T23:27:42Z

I'm unable to get the newest version of the patch to work (on SDXL at least), I'm unsure as to what's causing it

I'm pretty sure I was sleepy while implementing this

aria1th · 2023-11-17T00:36:16Z

@gel-crabs Thank you for the comment! You're right, I confirmed the issue was from typos from refactoring. (I may have to refactor again...)

A. The options were inverted so if you enable, it was disabled...
B. The hijack was only working for VAE, so there was very minor speed improvement. It is now fixed.

Confirmed working for SD Base 1.5 Now.
768x768 2x upscale 3 images - 2.15it/s vs 1.60it/s
512x512 2x upscale 6 images - 4.95it/s vs 3.72it/s

SD XL - 768x768 depth 0
2.73it/s vs 2.58it/s
(Not so dramatic maybe?)

gel-crabs · 2023-11-17T16:58:47Z

It works! With the newest commits and full max_depth, my it/s now goes from 1.7 to 1.88. Not bad at all!

If I'm able to find any information about SDXL depth layers in diffusers, I will hook it up in case A1111 gets diffusers support in the future (plz, I need inpaint)

AUTOMATIC1111 · 2023-11-26T08:25:59Z

I wanted this to work without changes to processing.py so I partially reworked the file into a built-in extension; additionally added an option to only apply unet hypertile to a hires fix pass. Still no infotext params - adding them is easy but I think before that reasonable defaults should be figured out - ones that give most speed improvement with least image difference.

AUTOMATIC1111 · 2023-11-27T07:31:45Z

	1024x1024, it/s	1600x1600, it/s
without hypertile	3.68	1.03
ht d=3, tile=256, s=3	4.68	2.33
ht d=2, tile=256, s=3	4.74	2.31
ht d=1, tile=256, s=3	4.72	2.34
ht d=0, tile=256, s=3	4.86	2.15
ht d=3, tile=128, s=3	4.84	2.35
ht d=3, tile=64, s=3	5.64	2.32
ht d=3, tile=512, s=3	4.42	2.33
ht d=3, tile=512, s=0	---	---
ht d=3, tile=512, s=1	5.39	2.54
ht d=3, tile=512, s=2	5.41	2.32
ht d=3, tile=512, s=4	4.75	2.33
ht d=3, tile=512, s=5	4.72	2.18
ht d=3, tile=512, s=6	4.73	1.94
ht d=3, tile=64, s=1	5.98	2.57
ht d=2, tile=64, s=1	6.09	2.5
ht d=1, tile=64, s=1	6.02	2.51
ht d=0, tile=64, s=1	2.95	2.15
ht d=0, tile=512, s=1	5.52	2.17

FurkanGozukara · 2023-12-04T10:22:17Z

ok i found it how do we use

aria1th · 2023-12-04T11:22:57Z

@FurkanGozukara go to Settings - Hypertile options, enable optimizations (and set swap size as large like 12 for safety) - then there you go

FurkanGozukara · 2023-12-04T11:25:54Z

@FurkanGozukara go to Settings - Hypertile options, enable optimizations (and set swap size as large like 12 for safety) - then there you go

thank you. i am testing right now SDXL on RTX 3060 - 12 GB - i don't see any difference in speed for 1024x1024

outputs changing

what does each option do

depth
swap size
max tile size

ArxFusion · 2023-12-04T12:06:44Z

Sorry but I don't quite understand how this works, is this for txt2img with hires fix or only img2img upscale? And as for the options, do I enable Enable Hypertile U-Net, Enable Hypertile U-Net for hires fix second pass and Enable Hypertile VAE? I have tried to use it with all the options enabled and one-by-one for txt2img, I do not see any real difference in speed or image quality....unless I am doing something wrong. There are some very minor changes in 1.5 but none that I can see in SDXL.

FurkanGozukara · 2023-12-04T12:22:36Z

i got very little speed improvement

testing with RTX 3090 TI

from 1280x1024 to 2176x1740

without hyper tile : 1.13 second / it - second pass
with hyper tile : 1.07 second / it - second pass

nothing like @AUTOMATIC1111 provided table above

tested settings

aria1th · 2023-12-04T12:24:59Z

@ArxFusion @FurkanGozukara
The speed-up is only provided when GPU was suffering from big image tiles - usually, high resolutions. For normal cases, as tested, it is not really noticable.

Thus the options are separated to 'first pass' and 'hires pass' and 'vae stage', to be used for corresponding bottlenecks.

(In other words, if you just use 512x512 then you usually don't need it)

Depth option is noticable if you are creating gigantic images, (well, depends on your ratio...)

Max tile size - large is better (adjusted by ratio)

Swap size - smaller is usually faster, but can produce artifact, thus there is trade-off between speed and aesthetic score.

FurkanGozukara · 2023-12-04T12:26:47Z

@ArxFusion @FurkanGozukara The speed-up is only provided when GPU was suffering from big image tiles - usually, high resolutions. For normal cases, as tested, it is not really noticable.

Thus the options are separated to 'first pass' and 'hires pass' and 'vae stage', to be used for corresponding bottlenecks.

(In other words, if you just use 512x512 then you usually don't need it)

Depth option is noticable if you are creating gigantic images, (well, depends on your ratio...)

Max tile size - large is better (adjusted by ratio)

Swap size - smaller is usually faster, but can produce artifact, thus there is trade-off between speed and aesthetic score.

thank you i tested like this. shouldnt i see super speed improvement at high res fix pass?

aria1th · 2023-12-04T12:31:46Z

@FurkanGozukara Did you get hit with any memory problem while generating images?

But as mentioned above, SD XL does not show that dramatic improvement compared to 1.5-type models, as expected.

There are too many layers in SD XL, which can be the cause for this issue... (comfyUI shows same behavior - afaik it does not do anything for SD XL)

FurkanGozukara · 2023-12-04T14:53:44Z

@FurkanGozukara Did you get hit with any memory problem while generating images?

But as mentioned above, SD XL does not show that dramatic improvement compared to 1.5-type models, as expected.

There are too many layers in SD XL, which can be the cause for this issue... (comfyUI shows same behavior - afaik it does not do anything for SD XL)

I see. Well I tested without any VRAM limiting issue. I have 24 GB VRAM with RTX 3090 TI

For SD 1.5 where can we utilize this? I mean when we make it higher resolution it produces garbage. So which places we could utilize?

aria1th · 2023-12-04T14:57:07Z

@FurkanGozukara It can be used for 1024x1024, 1600x1600 - or be combined with kohya's hires fix too, or even extreme high resolution with low denoise strength. (which was the main purpose by original author)

FurkanGozukara · 2023-12-04T14:59:42Z

@FurkanGozukara It can be used for 1024x1024, 1600x1600 - or be combined with kohya's hires fix too, or even extreme high resolution with low denoise strength. (which was the main purpose by original author)

can you show a screenshot of such sd 1.5 settings so i would like to test here

like generating 1600x1600 image with sd 1.5

aria1th · 2023-12-04T15:10:34Z

The simple setting ™️ for 1.5 will be like this - (unfortunately I'm currently running training, so I can't take screenshot of speed)

FurkanGozukara · 2023-12-04T15:12:45Z

i see thanks. yes i also saw some real improvement at sd 1.5. tensor RT brings more improvement will this work with TensorRT? @aria1th

FurkanGozukara · 2023-12-04T15:20:56Z

i will combine and test with tensorRT. if both works huge speed improvement for SD 1.5

aria1th · 2023-12-04T15:34:15Z

@FurkanGozukara Yes, but note that tensorRT requires code to be 'included' to compile, thus hypertile has to be the part of the model itself.... but still I'll say it is barely possible.

FurkanGozukara · 2023-12-05T00:29:08Z

@FurkanGozukara Yes, but note that tensorRT requires code to be 'included' to compile, thus hypertile has to be the part of the model itself.... but still I'll say it is barely possible.

ok i see. ty

ArxFusion · 2023-12-05T10:37:30Z

Thank you for updating the text with some info on the various options. I can see that for SDXL its not really working, but 1.5 there are some slight improvements, so far its only really noticable when running at more than 30 steps and it seems deterministic on your own hardware. I do notice that the images being generated can deviate between settings,some for the better and some for the worst depending on the options selected. I won't post my results since not sure how to even benchmark this because I feel everyone will have different experiences.

zcatharisis · 2023-12-06T11:40:52Z

Since Hypertile is intended for large images usually, could an option be added so that it's only enabled for hiresfix pass and img2img? I tried doing it myself but I didn't understand enough of the code to guess where to change.

aria1th · 2023-12-06T12:05:49Z

@zcatharisis Yes, Enable Hypertile for Unet second pass will exclusively allow Hypertile to be used for hires.fix.

Img2Img is, though, a first pass.

zcatharisis · 2023-12-07T05:30:16Z

Sorry, I didn't make myself clear; the option would be to enable hiresfix pass and img2img passes only, while it is disabled in regular txt2img first pass. Sometimes I flip flop around upscaling by 3x in img2img, then generating a 576x768 image in txt2img, and back. It's a bit of a pain having to turn hypertile on for img2img and off for txt2img (since as you pointed out, it generates artifacts if the resolution isn't 128-multiple).

EDIT:Or maybe a cleaner implementation would be to detect the resolution of the image being generated to toggle Hypertile on or off? For example, it is disabled at 1024x1024 and below and enable at that resolution and above?

add hyperTile

294f8a5

https://github.com/tfernd/HyperTile

aria1th requested a review from AUTOMATIC1111 as a code owner November 11, 2023 14:39

aria1th marked this pull request as draft November 11, 2023 14:41

aria1th marked this pull request as ready for review November 12, 2023 13:35

Implement Hypertile

b29fc6d

Co-Authored-By: Kieran Hunt <kph@hotmail.ca>

aria1th force-pushed the hypertile-in-sample branch from db0f9a1 to b29fc6d Compare November 15, 2023 06:14

copy LDM VAE key from XL

af45872

aria1th added 2 commits November 16, 2023 18:43

convert/add hypertile options

bcfaf39

fix ruff - add newline

472c22c

aria1th added 2 commits November 17, 2023 09:22

Fix critical issue - unet apply

c40be22

Fix inverted option issue

c0725ba

I'm pretty sure I was sleepy while implementing this

aria1th added 2 commits November 17, 2023 09:54

set empty value for SD XL 3rd layer

ffd0f8d

fix double gc and decoding with unet context

97431f2

AUTOMATIC1111 approved these changes Nov 26, 2023

View reviewed changes

AUTOMATIC1111 merged commit fd8674a into AUTOMATIC1111:dev Nov 26, 2023
3 checks passed

w-e-w mentioned this pull request Dec 4, 2023

1.7.0-RC #14196

Closed

w-e-w mentioned this pull request Dec 16, 2023

1.7.0 #14323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support HyperTile optimization #13948

support HyperTile optimization #13948

aria1th commented Nov 11, 2023 •

edited

Loading

AndreyRGW commented Nov 11, 2023

aria1th commented Nov 12, 2023

gel-crabs commented Nov 14, 2023 •

edited

Loading

aria1th commented Nov 15, 2023

gel-crabs commented Nov 15, 2023

aria1th commented Nov 16, 2023

gel-crabs commented Nov 16, 2023

aria1th commented Nov 17, 2023 •

edited

Loading

gel-crabs commented Nov 17, 2023

AUTOMATIC1111 commented Nov 26, 2023 •

edited

Loading

AUTOMATIC1111 commented Nov 27, 2023

FurkanGozukara commented Dec 4, 2023 •

edited

Loading

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

ArxFusion commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023 •

edited

Loading

FurkanGozukara commented Dec 4, 2023 •

edited

Loading

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 5, 2023

ArxFusion commented Dec 5, 2023

zcatharisis commented Dec 6, 2023 •

edited

Loading

aria1th commented Dec 6, 2023

zcatharisis commented Dec 7, 2023 •

edited

Loading

support HyperTile optimization #13948

support HyperTile optimization #13948

Conversation

aria1th commented Nov 11, 2023 • edited Loading

Description

Screenshots/videos:

Checklist:

AndreyRGW commented Nov 11, 2023

aria1th commented Nov 12, 2023

gel-crabs commented Nov 14, 2023 • edited Loading

aria1th commented Nov 15, 2023

gel-crabs commented Nov 15, 2023

aria1th commented Nov 16, 2023

gel-crabs commented Nov 16, 2023

aria1th commented Nov 17, 2023 • edited Loading

gel-crabs commented Nov 17, 2023

AUTOMATIC1111 commented Nov 26, 2023 • edited Loading

AUTOMATIC1111 commented Nov 27, 2023

FurkanGozukara commented Dec 4, 2023 • edited Loading

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

ArxFusion commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023 • edited Loading

FurkanGozukara commented Dec 4, 2023 • edited Loading

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

FurkanGozukara commented Dec 4, 2023

aria1th commented Dec 4, 2023

FurkanGozukara commented Dec 5, 2023

ArxFusion commented Dec 5, 2023

zcatharisis commented Dec 6, 2023 • edited Loading

aria1th commented Dec 6, 2023

zcatharisis commented Dec 7, 2023 • edited Loading

aria1th commented Nov 11, 2023 •

edited

Loading

gel-crabs commented Nov 14, 2023 •

edited

Loading

aria1th commented Nov 17, 2023 •

edited

Loading

AUTOMATIC1111 commented Nov 26, 2023 •

edited

Loading

FurkanGozukara commented Dec 4, 2023 •

edited

Loading

aria1th commented Dec 4, 2023 •

edited

Loading

FurkanGozukara commented Dec 4, 2023 •

edited

Loading

zcatharisis commented Dec 6, 2023 •

edited

Loading

zcatharisis commented Dec 7, 2023 •

edited

Loading