-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no speed up #19
Comments
If you're not using xformers, ToMe should increase inference speed even with 512x512 images so there's something going wrong here. A couple of steps to help debug:
Also, what stable diffusion code base are you using? Is it one of the supported ones? |
Thank you for your comment!
|
@liujianzhi @dbolya I am having a similar issue running on an a100. my baseline time is 25it/s on float16 at 50% ratio anything below that, the speed actually gets slower Additionally, increasing max_downsample further slows it down I imagine the merging/unmerging process is causing much of the latency? -- 1024x1024, batch size 1 (these were more exciting :)) the benefit from downsample 1 to 2, but then decrease at higher values seems to support issues with latency from merging process itself since i imagine doing token merging when the hidden states are already quite small isn't as useful. 2.35 1024x1024 batch size 8 |
@liujianzhi @ethansmith2000 Similarly, that's why the default downsample is at 1, where there are the most tokens. Even without xformers, there's not much benefit of applying ToMe deeper into the network where there are fewer tokens (see the ablations in the paper). |
Thank you daniel, makes sense! |
Sorry. I upgrade my |
I just tried to reproduce using your code above:
Now these results aren't stellar, probably because of the batch size of 1, but they at least improve. Can you try running with a higher image / batch size? The way to do that is to pass arguments when you call for i in range(2000//5):
image = pipe("cat", num_images_per_prompt=5).images[0] For a batch size of 5. Similarly, you can use Running with a batch size of 5, I get ~2000s for ToMe with 3 from above and ~3300s without (i.e., 1 from above). |
Hey there, I'm reviving this issue as I was facing a similar problem and I think I found out a part of what was going on in my case. I am using a single A10 GPU, and I'm trying to reproduce @dbolya results. It turns out that However, resetting to the default attention (non-optimized) using Main Packages Versionspython==3.10.11
diffusers==0.16.1
pytorch==2.0.1 Minimum code example to reproduceimport time
import tomesd
import torch
from diffusers import StableDiffusionPipeline
from tqdm import tqdm, trange
def test(batch_size=1, total_number_of_images=20, tome_ratio=0):
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
if tome_ratio > 0:
tomesd.apply_patch(
pipe, ratio=tome_ratio
) # Can also use pipe.unet in place of pipe here
pipe.set_progress_bar_config(disable=True)
# skip efficient torch.nn.functional.scaled_dot_product_attention based attention
pipe.unet.set_default_attn_processor()
start = time.time()
for i in trange(total_number_of_images//batch_size):
images = pipe(
["A photo of a dog riding a bicycle"],
num_images_per_prompt=batch_size,
generator=torch.manual_seed(4251142),
).images
return (time.time()-start)/total_number_of_images
def ddict():
return defaultdict(ddict)
runtimes = ddict()
## Run Experiments
for batch_size in [1, 2, 4, 5, 10]:
print(f"Batch size: {batch_size}")
for tome_ratio in [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
print(f"\tToMe ratio: {tome_ratio}")
runtimes[batch_size][tome_ratio] = test(
batch_size=batch_size, total_number_of_images=20, tome_ratio=tome_ratio
)
print(f"\t\tRuntime: {runtimes[batch_size][tome_ratio]:.3f}")
## Print Results
for batch_size, runtimes_batch in runtimes.items():
no_tome = runtimes_batch[0]
print(f"Batch size: {batch_size}.")
for tome_ratio, runtimes_batch_ratio in runtimes_batch.items():
time_perc = 100*(no_tome - runtimes_batch_ratio)/no_tome
speed_perc = 100*(no_tome - runtimes_batch_ratio)/runtimes_batch_ratio
print(f" ToMe ratio: {tome_ratio:.1f} -- runtime reduction: {time_perc:5.2f}% -- speed increase: {speed_perc:5.2f}%") ResultsResults with no memory-efficient attentionBatch size: 1.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -2.75% -- speed increase: -2.67%
ToMe ratio: 0.2 -- runtime reduction: 4.09% -- speed increase: 4.27%
ToMe ratio: 0.3 -- runtime reduction: 10.43% -- speed increase: 11.64%
ToMe ratio: 0.4 -- runtime reduction: 16.16% -- speed increase: 19.27%
ToMe ratio: 0.5 -- runtime reduction: 21.43% -- speed increase: 27.28%
ToMe ratio: 0.6 -- runtime reduction: 23.97% -- speed increase: 31.53%
Batch size: 2.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -4.79% -- speed increase: -4.57%
ToMe ratio: 0.2 -- runtime reduction: 3.11% -- speed increase: 3.21%
ToMe ratio: 0.3 -- runtime reduction: 9.73% -- speed increase: 10.78%
ToMe ratio: 0.4 -- runtime reduction: 16.22% -- speed increase: 19.36%
ToMe ratio: 0.5 -- runtime reduction: 22.89% -- speed increase: 29.68%
ToMe ratio: 0.6 -- runtime reduction: 25.23% -- speed increase: 33.74%
Batch size: 4.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -6.14% -- speed increase: -5.79%
ToMe ratio: 0.2 -- runtime reduction: 3.58% -- speed increase: 3.71%
ToMe ratio: 0.3 -- runtime reduction: 10.18% -- speed increase: 11.34%
ToMe ratio: 0.4 -- runtime reduction: 16.55% -- speed increase: 19.83%
ToMe ratio: 0.5 -- runtime reduction: 24.01% -- speed increase: 31.60%
ToMe ratio: 0.6 -- runtime reduction: 26.53% -- speed increase: 36.10%
Batch size: 5.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -5.18% -- speed increase: -4.92%
ToMe ratio: 0.2 -- runtime reduction: 3.88% -- speed increase: 4.03%
ToMe ratio: 0.3 -- runtime reduction: 10.39% -- speed increase: 11.59%
ToMe ratio: 0.4 -- runtime reduction: 17.37% -- speed increase: 21.03%
ToMe ratio: 0.5 -- runtime reduction: 24.37% -- speed increase: 32.22%
ToMe ratio: 0.6 -- runtime reduction: 27.11% -- speed increase: 37.19%
Batch size: 10.
ToMe ratio: 0.0 -- ΟΟΜ
ToMe ratio: 0.1 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.2 -- runtime reduction: 6.22% -- speed increase: 6.63%
ToMe ratio: 0.3 -- runtime reduction: 11.22% -- speed increase: 12.64%
ToMe ratio: 0.4 -- runtime reduction: 17.42% -- speed increase: 21.10%
ToMe ratio: 0.5 -- runtime reduction: 25.86% -- speed increase: 34.88%
ToMe ratio: 0.6 -- runtime reduction: 27.91% -- speed increase: 38.71% Results with memory-efficient attentionFor the sake of completeness, here are the results using the efficient Batch size: 1.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -3.11% -- speed increase: -3.02%
ToMe ratio: 0.2 -- runtime reduction: -2.54% -- speed increase: -2.48%
ToMe ratio: 0.3 -- runtime reduction: -0.66% -- speed increase: -0.65%
ToMe ratio: 0.4 -- runtime reduction: 1.40% -- speed increase: 1.42%
ToMe ratio: 0.5 -- runtime reduction: 3.15% -- speed increase: 3.26%
ToMe ratio: 0.6 -- runtime reduction: 4.69% -- speed increase: 4.92%
Batch size: 2.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -0.25% -- speed increase: -0.25%
ToMe ratio: 0.2 -- runtime reduction: 2.37% -- speed increase: 2.43%
ToMe ratio: 0.3 -- runtime reduction: 5.41% -- speed increase: 5.72%
ToMe ratio: 0.4 -- runtime reduction: 7.43% -- speed increase: 8.02%
ToMe ratio: 0.5 -- runtime reduction: 9.64% -- speed increase: 10.67%
ToMe ratio: 0.6 -- runtime reduction: 11.24% -- speed increase: 12.67%
Batch size: 4.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -0.45% -- speed increase: -0.44%
ToMe ratio: 0.2 -- runtime reduction: 2.36% -- speed increase: 2.42%
ToMe ratio: 0.3 -- runtime reduction: 4.53% -- speed increase: 4.74%
ToMe ratio: 0.4 -- runtime reduction: 7.20% -- speed increase: 7.76%
ToMe ratio: 0.5 -- runtime reduction: 9.43% -- speed increase: 10.41%
ToMe ratio: 0.6 -- runtime reduction: 10.90% -- speed increase: 12.24%
Batch size: 5.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -1.03% -- speed increase: -1.02%
ToMe ratio: 0.2 -- runtime reduction: 2.72% -- speed increase: 2.79%
ToMe ratio: 0.3 -- runtime reduction: 4.72% -- speed increase: 4.96%
ToMe ratio: 0.4 -- runtime reduction: 7.64% -- speed increase: 8.27%
ToMe ratio: 0.5 -- runtime reduction: 9.98% -- speed increase: 11.09%
ToMe ratio: 0.6 -- runtime reduction: 10.58% -- speed increase: 11.83%
Batch size: 10.
ToMe ratio: 0.0 -- runtime reduction: 0.00% -- speed increase: 0.00%
ToMe ratio: 0.1 -- runtime reduction: -0.73% -- speed increase: -0.72%
ToMe ratio: 0.2 -- runtime reduction: 2.53% -- speed increase: 2.60%
ToMe ratio: 0.3 -- runtime reduction: 5.22% -- speed increase: 5.51%
ToMe ratio: 0.4 -- runtime reduction: 7.68% -- speed increase: 8.32%
ToMe ratio: 0.5 -- runtime reduction: 10.23% -- speed increase: 11.39%
ToMe ratio: 0.6 -- runtime reduction: 11.83% -- speed increase: 13.42% Discussion Points
@dbolya Is this the expected performance boost or am I missing something here? Does anyone else have similar results? |
Hi @alex-bene, thanks for the detailed write-up. The experiments in the paper were performed in the original stable diffusion repo (namely the runway-ml one). I think users have consistently found that the diffusers implementation doesn't give them the same speed-up. Perhaps diffusers does a bunch of extra things like different memory management? Even still, 38% does seem low. I was getting at least 60% or higher with 0.5 reduction using diffusers (using a 4090). Unsure why you would be getting such low results. As for the torch SDPA: in performance that should be equivalent to "xformers" or "flash attn", which I already have a disclaimer about in the readme / paper. For small images, that means not much extra speed up when using ToMe on top. But, for bigger images ToMe still leads to a large speed-up there. The ToMe + xFormers figure in the paper used a 2048px image, for instance. |
Hey @dbolya and thanks for the quick response.
|
To be honest, I'm not sure. I didn't even write the diffusers implementation of ToMe, and it was added well after release. Do you think you could try using the runway-ml repo that I used for the paper? I am actually travelling to CVPR right now so I don't have access to the machine I did the original testing on right now. |
Hey @dbolya, hope CVPR went well! Unfortunately, I haven't found the time to test this yet. If by any chance you have this already set up (the runway-ml environment and/or a "diffusers" environment to cross-check my results), I'd much appreciate the help. |
without xformers Batch size: 1. |
I have the same issues with #6. I test the tome without xformers using 3090, the inference speed is the same with no TOME. I make TOME apply to all three. I use the 512x512 images, The result is 7036s with TOME and 6827s without TOME on inference 2000 images. Why is that?
The text was updated successfully, but these errors were encountered: