-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wait-for-model header when sending request to Inference API #2318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh by the way, for the "stuck" feeling you were mentionning.
What about doing 1 query without the wait, and only adding the way after the first retry ?
2 Queries, but at least with a good error message.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Nice idea! Implemented it in 8ea3f1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Thanks for the review! |
Should fix #2175.
In the current implementation,
InferenceClient
sends a request every 1s as long as the model is unavailable (HTTP 503). This can lead users to be rate limited even though they don't consume the API (reported here). This PR adds"X-wait-for-model": "1"
as header which tell the server to wait for the model to be loaded before returning a response. This way the client doesn't make calls every X seconds for nothing. ThisX-wait-for-model
header is added only when requesting the serverless Inference API.EDIT: based on @Narsil's comment, header is added to the request only on the second call. This way, user don't reach the rate limit but we are still able to log a message to tell the user the model is not loaded yet.
cc @Narsil (from private slack thread)