Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taking about 40 minutes to generate one sentence,Is this speed normal? #186

Open
kingdoom1 opened this issue Sep 26, 2024 · 3 comments
Open

Comments

@kingdoom1
Copy link

I have set the input maxlength to 128 and the output maxlength to 128 as well. The speed of output is very slow, taking about 40 minutes to generate one sentence. I am using the Qwen-2.5 7B model. Is this speed normal? My GPU is an NVIDIA 3090 with 12GB of VRAM, and it's using around 5GB.

@parsa-pico
Copy link

same here with RTX A4000 using llama3:8b

@ggaaooppeenngg
Copy link

I guess a splitted group of layer is around 4GB,is there a way to load more groups once a time for those GPUs having more VRAM?

@AsocPro
Copy link

AsocPro commented Dec 17, 2024

In my experience disk speed is the primary bottleneck a least in my config (RTX 3050 6GB). If the model is stored on my HDD spinning disk I'm sitting at just above 2 minutes per token using Qwen2.5-Coder-32B-Instruct with 4 bit compression but if I create a ramdisk and use that to store the model on a ramdisk ( I have 32 gigs of system memory so the 17.566 gigs the compressed model takes up easily fits) using the layer_shards_saving_path parameter for AutoModel.from_pretrained then I'm down to 13 seconds per token and which makes everything MUCH more usable.

If the whole model can fit in your gpu like @kingdoom1 or @parsa-pico you guys' should being 7B and 8B models with a 3090 and A4000 then it would be best to just use something else to run the models because at least from my experience AirLLM really is just good for using larger models than you have vram for and not really suited for the smaller models as I assume that there is some overhead in facilitating the vram management stuff even if the whole model can fit in vram at one time.

@ggaaooppeenngg I think this is what the prefetching parameter is for but I really don't know because with it set to true or false I'm not seeing much of a difference speed nor GPU usage wise. README says its only about 10% benefit. I don't have any hard evidence to support it but I assume that loading the model from the disk is still likely to be the biggest bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants