taking about 40 minutes to generate one sentence，Is this speed normal? #186

kingdoom1 · 2024-09-26T08:42:27Z

I have set the input maxlength to 128 and the output maxlength to 128 as well. The speed of output is very slow, taking about 40 minutes to generate one sentence. I am using the Qwen-2.5 7B model. Is this speed normal? My GPU is an NVIDIA 3090 with 12GB of VRAM, and it's using around 5GB.

parsa-pico · 2024-09-29T10:02:59Z

same here with RTX A4000 using llama3:8b

ggaaooppeenngg · 2024-10-17T09:40:10Z

I guess a splitted group of layer is around 4GB，is there a way to load more groups once a time for those GPUs having more VRAM?

AsocPro · 2024-12-17T20:18:30Z

In my experience disk speed is the primary bottleneck a least in my config (RTX 3050 6GB). If the model is stored on my HDD spinning disk I'm sitting at just above 2 minutes per token using Qwen2.5-Coder-32B-Instruct with 4 bit compression but if I create a ramdisk and use that to store the model on a ramdisk ( I have 32 gigs of system memory so the 17.566 gigs the compressed model takes up easily fits) using the layer_shards_saving_path parameter for AutoModel.from_pretrained then I'm down to 13 seconds per token and which makes everything MUCH more usable.

If the whole model can fit in your gpu like @kingdoom1 or @parsa-pico you guys' should being 7B and 8B models with a 3090 and A4000 then it would be best to just use something else to run the models because at least from my experience AirLLM really is just good for using larger models than you have vram for and not really suited for the smaller models as I assume that there is some overhead in facilitating the vram management stuff even if the whole model can fit in vram at one time.

@ggaaooppeenngg I think this is what the prefetching parameter is for but I really don't know because with it set to true or false I'm not seeing much of a difference speed nor GPU usage wise. README says its only about 10% benefit. I don't have any hard evidence to support it but I assume that loading the model from the disk is still likely to be the biggest bottleneck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

taking about 40 minutes to generate one sentence，Is this speed normal? #186

taking about 40 minutes to generate one sentence，Is this speed normal? #186

kingdoom1 commented Sep 26, 2024

parsa-pico commented Sep 29, 2024

ggaaooppeenngg commented Oct 17, 2024

AsocPro commented Dec 17, 2024

taking about 40 minutes to generate one sentence，Is this speed normal? #186

taking about 40 minutes to generate one sentence，Is this speed normal? #186

Comments

kingdoom1 commented Sep 26, 2024

parsa-pico commented Sep 29, 2024

ggaaooppeenngg commented Oct 17, 2024

AsocPro commented Dec 17, 2024