Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hot-fix] Handle [DONE] signal from TGI + remove logic for "non-TGI servers" #2410

Merged
merged 8 commits into from
Jul 23, 2024

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Jul 23, 2024

What's in this PR?:

  1. Related to fix: append DONE message to chat stream text-generation-inference#2221. TGI now returns a b"data: [DONE]" stop signal when iterating through generated tokens (stream=True). This PR adds support for this stop signal. Once this PR is merged, I'll make a hot-fix release since InferenceClient is currently broken on newest versions of TGI for text_generation and chat_completion.

  2. I also removed all the "non-tgi" logic in chat_completion since every model is served by TGI now, even when it's a transformers-only model (e.g. "microsoft/DialoGPT-small". This simplifies a lot the logic and will avoid hiding relevant errors to the users. In case the model is transformers-served and is not compatible with chat completion, a 422 unprocessable entity: template not found is returned.

  3. I also took the opportunity to rename some private helpers for consistency.

Better to hide whitespace changes to review this PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -860,10 +860,10 @@ def chat_completion(
stream=stream,
)
except HTTPError as e:
if e.response.status_code in (400, 404, 500):
if e.response.status_code in (400, 500):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should really be only 1 error code here.
Can you check on a real deployment or in TGI source code what happens on non existing template models ?

I think it should be a 4xx since it's not really a server error. No idea for which one is more appropriate.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check on a real deployment or in TGI source code what happens on non existing template models ?

It returns a

huggingface_hub.utils._errors.HfHubHTTPError: 422 Client Error: Unprocessable Entity for url: https://api-inference.huggingface.co/models/gpt2/v1/chat/completions (Request ID: 3Y0mKxT7AmdMSZsfNjcQA)

Template error: template not found

But in fact this is not even the problem. Since client-side rendering has been completely removed (in #2258), there is no point treating TGI-served and other models differently. I've updated the PR to simplify the logic which will finally avoid hiding errors to the user.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

src/huggingface_hub/inference/_common.py Outdated Show resolved Hide resolved
src/huggingface_hub/inference/_common.py Show resolved Hide resolved
tests/test_inference_client.py Outdated Show resolved Hide resolved
src/huggingface_hub/inference/_common.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and swift fix @Wauplin

@Wauplin Wauplin changed the title [hot-fix] Handle [DONE] signal from TGI [hot-fix] Handle [DONE] signal from TGI + remove logic for "non-TGI servers" Jul 23, 2024
@Wauplin Wauplin requested review from LysandreJik and Narsil July 23, 2024 14:23
@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 23, 2024

Sorry for the mess, I realized I should have done 2 PRs to fix these separately. @Narsil 's comment #2410 (comment) made me realize that we don't even need 2 different behaviors anymore in chat_completion (depending on TGI-served or not) since all model templates are rendered server-side anyway. This makes the implementation much cleaner.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't dive into the changes given the time sensitivity but from a quick look it looks good to me.

@Wauplin Wauplin merged commit 91fe78e into main Jul 23, 2024
17 checks passed
@Wauplin Wauplin deleted the hot-fix-inference-client-chat-completion branch July 23, 2024 14:37
Wauplin added a commit that referenced this pull request Jul 23, 2024
…ervers" (#2410)

* Handle [DONE] signal from TGI

* fix text_generation as well

* Handle error 404 correctly

* consistency + stop treating transformers-backed models differently

* fix test

* fix broken test on main

* fix test

* cleaner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants