-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement API for Inference Endpoints #1779
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice, the huggingface_hub
library is becoming so versatile! 🚀
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should know this for sure but what's the ≠ between paused & scaled to zero, again?
Another question: do you have an example of how to chain .wait
and an actual inference call?
would you do something like this?
create_inference_endpoint(
"my-endpoint-name6",
repository="gpt2",
).wait().client.text_generation("I am").scale_to_zero()
I'll let @philschmid confirm, not sure myself what's the difference.
Not exactly no since endpoint = create_inference_endpoint("my-endpoint-name6", repository="gpt2",...).wait()
endpoint.client.text_generation("I am")
endpoint.scale_to_zero() |
@philschmid @stevhliu I have added a guide about how to manage Inference Endpoints from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice guide, especially like the end-to-end example at the end that puts everything together in context :)
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits, but lgtm overall!
Nice job!
@@ -0,0 +1,48 @@ | |||
# Inference Endpoints | |||
|
|||
Inference Endpoints offers a secure production solution to easily deploy any `transformers`, `sentence-transformers`, and `diffusers` models from the Hub on a dedicated and autoscaling infrastructure managed by Hugging Face. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To deploy any kind of model, I don't think we're limited to the libraries you mention here.
cc @philschmid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took that part from the official documentation there: https://huggingface.co/docs/inference-endpoints/main/en/index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those models are what we defaultly support without custom handler or container. But not limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe wording could be improved, when I read this at first it's not clear to me that you can run any model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the wording in 1688851 to be more generic:
Inference Endpoints provides a secure production solution to easily deploy models on a dedicated and autoscaling infrastructure managed by Hugging Face. An Inference Endpoint is built from a model from the Hub. This page is a reference for
huggingface_hub
's integration with Inference Endpoints. For more information about the Inference Endpoints product, check out its official documentation.
Keeping it vague on purpose. If the user wants more details about supported models, the official Inference Endpoints documentation should be the appropriate location to list them. These docs are referenced twice from the huggingface_hub
's doc (from this PR) so should be fine.
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! Only left a few nits. The API is intuitive.
Co-authored-by: Lysandre Debut <hi@lysand.re>
Thanks @LysandreJik for the review! I think we are good to merge when CI is green then :) |
Failing tests are unrelated. Merging this! |
Implement #1541 (+fix #1605).
Ping @philschmid @jeffboudier Feedback is very welcomed if you see anything that can be improved product-wise :)
EDIT / TL;DR: here is the guide written in this PR.
This PR adds support for Inference Endpoints, following the Swagger API docs.
HfApi methods:
InferenceEndpoint object + methods:
client
/async_client
=> returnInferenceClient
objectwait()
=> wait until fully deployedupdate()
=> alias forupdate_inference_endpoint
resume()
=> alias forresume_inference_endpoint
pause()
=> alias forpause_inference_endpoint
scale_to_zero()
=> alias forscale_to_zero_inference_endpoint
delete()
=> alias fordelete_inference_endpoint
I intentionally did not implement metrics and logs endpoints yet as I don't think that should be a priority.
Listing/getting inference endpoints information is quite straightforward, given the user namespace or endpoint name. Same goes for resume/pause/scale_to_zero feature which AFAIK would be the most useful ones in scripts. Creating and updating endpoints is more difficult as the user needs to know exactly the configuration they want to use.
The main object returned by most methods is an
InferenceEndpoint
dataclass with useful information like name, status, url, model, framework, task, created_at/updated_at,... It also has 2 properties.client
and.async_client
to run inference.Regarding testing the API, I am not sure we can and we want to do that in the CI. Like for the Spaces API, it's quite hard (and costly) to do end-to-end tests on production and for little benefit IMO (as we can assume v2 endpoints will not be updated). I ran some tests locally to make sure everything runs as expected though.
Example:
Useful resources: