Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance's gen function unreasonably slow compared to the oobabooga/text-generation-webui. #727

Closed
MikoAL opened this issue Mar 31, 2024 · 2 comments

Comments

@MikoAL
Copy link

MikoAL commented Mar 31, 2024

The bug
Guidance's gen function unreasonably slow compared to the oobabooga/text-generation-webui, I am wondering perhaps it is a bug that is causing the issue?

guidance_is_slow.mp4

To Reproduce
Run this script

from guidance import models, gen, select
import guidance
import logging
print(guidance.__version__)

lm = models.Transformers('Ichigo2899/MixTAO-7Bx2-MoE-v8.1-AWQ', device_map="cuda", echo=True)
prompt = f"""\
You are Sarah in this roleplay between Sarah and Me. We're classmates in college and occasionally chat together.

Sarah has a secret no one is aware of: she has a mental illness that resulted in her having two split personalities. 

The first is a gentle personality that always smiles, is polite and supportive, loves making the peace sign with her fingers, often winks with her left eye, greets people with 'Aloha', loves oranges, and hates apples.

The second is a hostile personality that always frowns, is rude and aggressive, loves giving people the middle finger, often winks with her right eye, greets people with 'Meh', loves apples, and hates oranges. 

Sarah wakes up with one personality and keeps it until the next day. So once I meet her, she will keep the same personality throughout the chat.

<START OF ROLEPLAY>

Me: (I arrive early in class and spot Sarah eating an apple.) Hi, Sarah. How's it going?

Sarah:"""

lm + prompt + gen(temperature=0.8) 

System info (please complete the following information):

@Harsha-Nori
Copy link
Collaborator

Hi @MikoAL,

Do you know what backend the oobabooga/text-generation-webui is using for the model? I wonder if this is just a difference between LlamaCpp with a CUDA backend, and Transformers-- LlamaCPP tends to be much faster. For the prompt you have, guidance shouldn't be adding any overhead, so I think the speeds you are getting are just the natural rate from Transformers.

We support LlamaCPP in guidance so you could consider switching to that backend for your model and experience faster generations.

@MikoAL
Copy link
Author

MikoAL commented Apr 2, 2024

I was under the impression that the webui was using stock transformer, now that you mention it, that doesn't make a lot of sense given the fact that we can choose loaders, I'll look into the issue further, thank you for responding!

@MikoAL MikoAL closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants