Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement API for Inference Endpoints #1779

Merged
merged 23 commits into from
Oct 30, 2023
Merged

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Oct 25, 2023

Implement #1541 (+fix #1605).

Ping @philschmid @jeffboudier Feedback is very welcomed if you see anything that can be improved product-wise :)

EDIT / TL;DR: here is the guide written in this PR.


This PR adds support for Inference Endpoints, following the Swagger API docs.

I intentionally did not implement metrics and logs endpoints yet as I don't think that should be a priority.

Listing/getting inference endpoints information is quite straightforward, given the user namespace or endpoint name. Same goes for resume/pause/scale_to_zero feature which AFAIK would be the most useful ones in scripts. Creating and updating endpoints is more difficult as the user needs to know exactly the configuration they want to use.

The main object returned by most methods is an InferenceEndpoint dataclass with useful information like name, status, url, model, framework, task, created_at/updated_at,... It also has 2 properties .client and .async_client to run inference.

Regarding testing the API, I am not sure we can and we want to do that in the CI. Like for the Spaces API, it's quite hard (and costly) to do end-to-end tests on production and for little benefit IMO (as we can assume v2 endpoints will not be updated). I ran some tests locally to make sure everything runs as expected though.


Example:

>>> from huggingface_hub import create_inference_endpoint

# Create endpoint
>>> endpoint = create_inference_endpoint(
...     "my-endpoint-name6",
...     repository="gpt2",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="cpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_size="medium",
...     instance_type="c6i"
... )

# pending creation => no url => cannot get client
>>> endpoint
InferenceEndpoint(name='my-endpoint-name6', namespace='Wauplin', repository='gpt2', status='pending', url=None)
>>> endpoint.client 
*** huggingface_hub._inference_endpoints.InferenceEndpointException: Cannot create a client for this endpoint as it is not yet deployed. Please wait for the endpoint to be deployed and try again.

# ... wait until it's initialized
>>> endpoint.wait()
InferenceEndpoint(name='my-endpoint-name6', namespace='Wauplin', repository='gpt2', status='running', url='https://kqehm5t0lfe628b2.us-east-1.aws.endpoints.huggingface.cloud')

# endpoint running => url => client available
>>> endpoint.client 
<InferenceClient(model='https://kqehm5t0lfe628b2.us-east-1.aws.endpoints.huggingface.cloud', timeout=None)>
>>> endpoint.client.text_generation("I am")
' not a fan of the idea of a "big-budget" movie. I think it\'s a'

# pause endpoint => no more url
>>> endpoint.pause()
InferenceEndpoint(name='my-endpoint-name6', namespace='Wauplin', repository='gpt2', status='paused', url=None)

Useful resources:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 25, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice, the huggingface_hub library is becoming so versatile! 🚀

docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
src/huggingface_hub/_inference_endpoints.py Outdated Show resolved Hide resolved
src/huggingface_hub/_inference_endpoints.py Outdated Show resolved Hide resolved
src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should know this for sure but what's the ≠ between paused & scaled to zero, again?

Another question: do you have an example of how to chain .wait and an actual inference call?

would you do something like this?

create_inference_endpoint(
    "my-endpoint-name6",
    repository="gpt2",
).wait().client.text_generation("I am").scale_to_zero()

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 26, 2023

I should know this for sure but what's the ≠ between paused & scaled to zero, again?

I'll let @philschmid confirm, not sure myself what's the difference.
What I understood is that a paused endpoint must be resumed manually while a scaled to zero endpoint is automatically restarted on next call (with cold start).

would you do something like this?

Not exactly no since InferenceEndpoint != InferenceClient objects.
What you can do:

endpoint = create_inference_endpoint("my-endpoint-name6", repository="gpt2",...).wait()
endpoint.client.text_generation("I am")
endpoint.scale_to_zero()

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 26, 2023

@philschmid @stevhliu I have added a guide about how to manage Inference Endpoints from huggingface_hub. Would you mind having a look at it? 🙏

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice guide, especially like the end-to-end example at the end that puts everything together in context :)

docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, but lgtm overall!

Nice job!

docs/source/en/guides/inference_endpoints.md Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
@@ -0,0 +1,48 @@
# Inference Endpoints

Inference Endpoints offers a secure production solution to easily deploy any `transformers`, `sentence-transformers`, and `diffusers` models from the Hub on a dedicated and autoscaling infrastructure managed by Hugging Face.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To deploy any kind of model, I don't think we're limited to the libraries you mention here.
cc @philschmid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took that part from the official documentation there: https://huggingface.co/docs/inference-endpoints/main/en/index

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those models are what we defaultly support without custom handler or container. But not limiting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe wording could be improved, when I read this at first it's not clear to me that you can run any model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the wording in 1688851 to be more generic:

Inference Endpoints provides a secure production solution to easily deploy models on a dedicated and autoscaling infrastructure managed by Hugging Face. An Inference Endpoint is built from a model from the Hub. This page is a reference for huggingface_hub's integration with Inference Endpoints. For more information about the Inference Endpoints product, check out its official documentation.

Keeping it vague on purpose. If the user wants more details about supported models, the official Inference Endpoints documentation should be the appropriate location to list them. These docs are referenced twice from the huggingface_hub's doc (from this PR) so should be fine.

docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
Wauplin and others added 2 commits October 27, 2023 11:49
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 27, 2023

Thanks @stevhliu and @McPatate for your feedback on the documentation, It helped a lot! I've made the requested changes :)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Only left a few nits. The API is intuitive.

docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Show resolved Hide resolved
docs/source/en/guides/inference_endpoints.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/inference_endpoints.md Outdated Show resolved Hide resolved
Co-authored-by: Lysandre Debut <hi@lysand.re>
@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 30, 2023

Thanks @LysandreJik for the review! I think we are good to merge when CI is green then :)
Thanks everyone here for the feedback on this PR! 🤗

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 30, 2023

Failing tests are unrelated. Merging this!

@Wauplin Wauplin merged commit 91d38dd into main Oct 30, 2023
12 of 16 checks passed
@Wauplin Wauplin deleted the 1541-inference-endpoints-api branch October 30, 2023 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong error to handle Paused or Scaled to Zero endpoints.
7 participants