-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support text-generation in InferenceClient #1513
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1513 +/- ##
==========================================
+ Coverage 77.97% 82.67% +4.69%
==========================================
Files 55 58 +3
Lines 5835 6332 +497
==========================================
+ Hits 4550 5235 +685
+ Misses 1285 1097 -188
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the backend forces details = True if decoder_input_details = True so maybe it should be the same here.
# Whether to prepend the prompt to the generated text | ||
return_full_text: bool = False | ||
# Stop generating tokens if a member of `stop_sequences` is generated | ||
stop: List[str] = field(default_factory=lambda: []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed we are not validating the stop sequences. I believe the max is 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up @StephenHodgson!
@OlivierDehaene could you confirm this 4-items limit? I found the same in the openapi.json specs but prefer to cross-check with you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_best_of
and max_stop_sequences
are both parameters that can be modified in TGI. The defaults are 2 and 4 but they can also be turned off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so let's not validate them client-side. Thanks for confirming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive conditional type response decided by the values of details/stream! Is this so that pydantic behaves correctly?
I didn't find a guide/example showcasing how you'd use it in practice. I think it'd be worth adding to the docs as a docstring example.
For example, four small examples showcasing the difference between the details
and stream
modes and self-explanatory results about their content. I'm sure a motivated user could go and search for the docs of TextGenerationResponse
an Token
, but something like this would get that info straight away in the doc page they'd be looking at:
This method has four different return possibilities, according to the value of the details
and stream
parameters passed to the method.
With details
and stream
as False
:
>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient()
>>> client.text_generation("What's up")
" with the weather?\nI'm sorry, I am an AI language model and do not have"
If details
is True
and stream
is False
:
>>> client.text_generation("What's up", details=True)
TextGenerationResponse(
generated_text=" with the weather?\nI'm sorry, I am an AI language model and do not have",
details=Details(
finish_reason=<FinishReason.Length: 'length'>,
generated_tokens=20,
seed=None,
prefill=[InputToken(id=1562, text='What', logprob=None), InputToken(id=18, text="'", logprob=-2.5390625), InputToken(id=94, text='s', logprob=-0.13061523), InputToken(id=510, text=' up', logprob=-4.2382812)],
tokens=[Token(id=335, text=' with', logprob=-1.4267578, special=False), Token(id=248, text=' the', logprob=-1.4677734, special=False), Token(id=5015, text=' weather', logprob=-3.15625, special=False), Token(id=42, text='?', logprob=-0.4638672, special=False), Token(id=193, text='\n', logprob=-0.06262207, special=False), Token(id=52, text='I', logprob=-0.22729492, special=False), Token(id=18, text="'", logprob=-0.0769043, special=False), Token(id=88, text='m', logprob=-0.0022010803, special=False), Token(id=6893, text=' sorry', logprob=-0.027435303, special=False), Token(id=23, text=',', logprob=-0.033081055, special=False), Token(id=295, text=' I', logprob=-0.7036133, special=False), Token(id=653, text=' am', logprob=-0.9658203, special=False), Token(id=267, text=' an', logprob=-0.3400879, special=False), Token(id=8317, text=' AI', logprob=-0.052856445, special=False), Token(id=3599, text=' language', logprob=-0.0051193237, special=False), Token(id=2308, text=' model', logprob=-0.0007414818, special=False), Token(id=273, text=' and', logprob=-0.019866943, special=False), Token(id=441, text=' do', logprob=-0.40722656, special=False), Token(id=416, text=' not', logprob=-0.00076293945, special=False), Token(id=413, text=' have', logprob=-0.011177063, special=False)],
best_of_sequences=None
)
)
...
Overall, it works very well. The tests look good. The cassettes could be offloaded to a dataset repo if you don't want to weigh down the repo, but no strong need to move it out as it's code only.
`Union[str, TextGenerationResponse, Iterable[str], Iterable[TextGenerationStreamResponse]]`: generated response. | ||
Format depends on the input. If `details=False` (the default), the generated text is returned as a string. If | ||
`details=False` and `stream=True`, an `Iterable[str]` is returned. If `details=True`, a [`~huggingface_hub.inference._text_generation.TextGenerationResponse`] | ||
object is returned, containing details about the generated text. Finally, if `details=True` and `stream=True` | ||
are passed, an iterable of [`~huggingface_hub.inference._text_generation.TextGenerationStreamResponse`] is passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hard to read when converted to the docs. Would it make sense to have it be a list with the different possibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LysandreJik for the review and suggesting some improvements in the docs! ❤️ I've made the requested changes (mainly the return type + adding some examples). Check it out in the docs.
About the return type with the
Yes that could be a possibility but for now I'd prefer not to, just for the sake of keeping it simple. The main advantage of having it directly in the repo is that updating the cassette is as simple as running pytest with |
Original implementation taken from the
text-generation-inference
Python client (see client library and repo) from @OlivierDehaene . A vast majority of the code comes from there so kudos goes to him 🙏 .Changes compared to the original implementation:
pydantic.dataclasses
instead ofBaseModel
BaseModel
butdataclasses
yes)huggingface_hub.InferenceClient
stream: bool
anddetails: bool
in thetext_generation
method instead of having different methods for each use caseIf model is not served with TGI backend (example: "gpt2"-like models), some parameters are ignored. The client always consider that TGI is enable but default back to a normal call if that's not the case. A warning is triggered for the user +
details=True
is not possible.Integration is now functional and locally tested.
Docs: text_generation and dataclasses descriptions.
.
TODO:
Normal use (with details)
Streaming (no details)