v0.22.0: Chat completion, inference types and hub mixins!
Discuss about the release in our Community Tab. Feedback is welcome!! 🤗
✨ InferenceClient
Support for inference tools continues to improve in huggingface_hub
. At the menu in this release? A new chat_completion
API and fully typed inputs/outputs!
Chat-completion API!
A long-awaited API has just landed in huggingface_hub
! InferenceClient.chat_completion
follows most of OpenAI's API, making it much easier to integrate with existing tools.
Technically speaking it uses the same backend as the text-generation
task but requires a preprocessing step to format the list of messages into a single text prompt. The chat template is rendered server-side when models are powered by TGI, which is the case for most LLMs: Llama, Zephyr, Mistral, Gemma, etc. Otherwise, the templating happens client-side which requires minijinja
package to be installed. We are actively working on bridging this gap, aiming at rendering all templates server-side in the future.
>>> from huggingface_hub import InferenceClient
>>> messages = [{"role": "user", "content": "What is the capital of France?"}]
>>> client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")
# Batch completion
>>> client.chat_completion(messages, max_tokens=100)
ChatCompletionOutput(
choices=[
ChatCompletionOutputChoice(
finish_reason='eos_token',
index=0,
message=ChatCompletionOutputChoiceMessage(
content='The capital of France is Paris. The official name of the city is "Ville de Paris" (City of Paris) and the name of the country\'s governing body, which is located in Paris, is "La République française" (The French Republic). \nI hope that helps! Let me know if you need any further information.'
)
)
],
created=1710498360
)
# Stream new tokens one by one
>>> for token in client.chat_completion(messages, max_tokens=10, stream=True):
... print(token)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content='The', role='assistant'), index=0, finish_reason=None)], created=1710498504)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content=' capital', role='assistant'), index=0, finish_reason=None)], created=1710498504)
(...)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content=' may', role='assistant'), index=0, finish_reason=None)], created=1710498504)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content=None, role=None), index=0, finish_reason='length')], created=1710498504)
- Implement
InferenceClient.chat_completion
+ use new types for text-generation by @Wauplin in #2094 - Fix InferenceClient.text_generation for non-tgi models by @Wauplin in #2136
- #2153 by @Wauplin in #2153
Inference types
We are currently working towards more consistency in tasks definitions across the Hugging Face ecosystem. This is no easy job but a major milestone has recently been achieved! All inputs and outputs of the main ML tasks are now fully specified as JSONschema objects. This is the first brick needed to have consistent expectations when running inference across our stack: transformers (Python), transformers.js (Typescript), Inference API (Python), Inference Endpoints (Python), Text Generation Inference (Rust), Text Embeddings Inference (Rust), InferenceClient (Python), Inference.js (Typescript), etc.
Integrating those definitions will require more work but huggingface_hub
is one of the first tools to integrate them. As a start, all InferenceClient
return values are now typed dataclasses. Furthermore, typed dataclasses have been generated for all tasks' inputs and outputs. This means you can now integrate them in your own library to ensure consistency with the Hugging Face ecosystem. Specifications are open-source (see here) meaning anyone can access and contribute to them. Python's generated classes are documented here.
Here is a short example showcasing the new output types:
>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient()
>>> client.object_detection("people.jpg"):
[
ObjectDetectionOutputElement(
score=0.9486683011054993,
label='person',
box=ObjectDetectionBoundingBox(xmin=59, ymin=39, xmax=420, ymax=510)
),
...
]
Note that those dataclasses are backward-compatible with the dict-based interface that was previously in use. In the example above, both ObjectDetectionBoundingBox(...).xmin
and ObjectDetectionBoundingBox(...)["xmin"]
are correct, even though the former should be the preferred solution from now on.
- Generate inference types + start using output types by @Wauplin in #2036
- Add = None at optional parameters by @LysandreJik in #2095
- Fix inference types shared between tasks by @Wauplin in #2125
🧩 ModelHubMixin
ModelHubMixin
is an object that can be used as a parent class for the objects in your library in order to provide built-in serialization methods to upload and download pretrained models from the Hub. This mixin is adapted into a PyTorchHubMixin
that can serialize and deserialize any Pytorch model. The 0.22 release brings its share of improvements to these classes:
- Better support of init values. If you instantiate a model with some custom arguments, the values will be automatically stored in a config.json file and restored when reloading the model from pretrained weights. This should unlock integrations with external libraries in a much smoother way.
- Library authors integrating the hub mixin can now define custom metadata for their library: library name, tags, document url and repo url. These are to be defined only once when integrating the library. Any model pushed to the Hub using the library will then be easily discoverable thanks to those tags.
- A base modelcard is generated for each saved model. This modelcard includes default tags (e.g.
model_hub_mixin
) and custom tags from the library (see 2.). You can extend/modify this modelcard by overwriting thegenerate_model_card
method.
>>> import torch
>>> import torch.nn as nn
>>> from huggingface_hub import PyTorchModelHubMixin
# Define your Pytorch model exactly the same way you are used to
>>> class MyModel(
... nn.Module,
... PyTorchModelHubMixin, # multiple inheritance
... library_name="keras-nlp",
... tags=["keras"],
... repo_url="https://github.com/keras-team/keras-nlp",
... docs_url="https://keras.io/keras_nlp/",
... # ^ optional metadata to generate model card
... ):
... def __init__(self, hidden_size: int = 512, vocab_size: int = 30000, output_size: int = 4):
... super().__init__()
... self.param = nn.Parameter(torch.rand(hidden_size, vocab_size))
... self.linear = nn.Linear(output_size, vocab_size)
... def forward(self, x):
... return self.linear(x + self.param)
# 1. Create model
>>> model = MyModel(hidden_size=128)
# Config is automatically created based on input + default values
>>> model._hub_mixin_config
{"hidden_size": 128, "vocab_size": 30000, "output_size": 4}
# 2. (optional) Save model to local directory
>>> model.save_pretrained("path/to/my-awesome-model")
# 3. Push model weights to the Hub
>>> model.push_to_hub("my-awesome-model")
# 4. Initialize model from the Hub => config has been preserved
>>> model = MyModel.from_pretrained("username/my-awesome-model")
>>> model._hub_mixin_config
{"hidden_size": 128, "vocab_size": 30000, "output_size": 4}
# Model card has been correctly populated
>>> from huggingface_hub import ModelCard
>>> card = ModelCard.load("username/my-awesome-model")
>>> card.data.tags
["keras", "pytorch_model_hub_mixin", "model_hub_mixin"]
>>> card.data.library_name
"keras-nlp"
For more details on how to integrate these classes, check out the integration guide.
- Fix
ModelHubMixin
: pass config when__init__
accepts **kwargs by @Wauplin in #2058 - [PyTorchModelHubMixin] Fix saving model with shared tensors by @NielsRogge in #2086
- Correctly inject config in
PytorchModelHubMixin
by @Wauplin in #2079 - Fix passing kwargs in PytorchHubMixin by @Wauplin in #2093
- Generate modelcard in
ModelHubMixin
by @Wauplin in #2080 - Fix ModelHubMixin: save config only if doesn't exist by @Wauplin in #2105
- Fix ModelHubMixin - kwargs should be passed correctly when reloading by @Wauplin in #2099
- Fix ModelHubMixin when kwargs and config are both passed by @Wauplin in #2138
- ModelHubMixin overwrite config if preexistant by @Wauplin in #2142
🛠️ Misc improvements
HfFileSystem
download speed was limited by some internal logic in fsspec
. We've now updated the get_file
and read
implementations to improve their download speed to a level similar to hf_hub_download
.
We are aiming at moving all errors raised by huggingface_hub
into a single module huggingface_hub.errors
to ease the developer experience. This work has been started as a community contribution from @Y4suyuki.
HfApi
class now accepts a headers
parameters that is then passed to every HTTP call made to the Hub.
📚 More documentation in Korean!
💔 Breaking changes
-
The new types returned by
InferenceClient
methods should be backward compatible, especially to access values either as attributes (.my_field
) or as items (i.e.["my_field"]
). However, dataclasses and dicts do not always behave exactly the same so might notice some breaking changes. Those breaking changes should be very limited. -
ModelHubMixin
internals changed quite a bit, breaking some use cases. We don't think those use cases were in use and changing them should really benefit 99% of integrations. If you witness any inconsistency or error in your integration, please let us know and we will do our best to mitigate the problem. One of the biggest change is that the config values are not attached to the mixin instance asinstance.config
anymore but asinstance._model_hub_mixin
. The.config
attribute has been mistakenly introduced in0.20.x
so we hope it has not been used much yet. -
huggingface_hub.file_download.http_user_agent
has been removed in favor of the officially documenthuggingface_hub.utils.build_hf_headers
. It was a deprecated method since0.18.x
.
Small fixes and maintenance
⚙️ CI optimization
The CI pipeline has been greatly improved, especially thanks to the efforts from @bmuskalla. Most tests are now passing in under 3 minutes, against 8 to 10 minutes previously. Some long-running tests have been greatly simplified and all tests are now ran in parallel with python-xdist
, thanks to a complete decorrelation between them.
We are now also using the great uv
installer instead of pip
in our CI, which saves around 30-40s per pipeline.
- More optimized tests by @Wauplin in #2054
- Enable
python-xdist
on all tests by @bmuskalla in #2059 - do not list all models by @Wauplin in #2061
- update ruff by @Wauplin in #2071
- Use uv in CI to speed-up requirements install by @Wauplin in #2072
⚙️ fixes
- Fix Space variable when updatedAt is missing by @Wauplin in #2050
- Fix tests involving temp directory on macOS by @bmuskalla in #2052
- fix glob no magic by @lhoestq in #2056
- Point out that the token must have write scope by @bmuskalla in #2053
- Fix commonpath in read-only filesystem by @stevelaskaridis in #2073
- rm unnecessary early makedirs by @poedator in #2092
- Fix unhandled filelock issue by @Wauplin in #2108
- Handle .DS_Store files in _scan_cache_repos by @sealad886 in #2112
- Fix REPO_API_REGEX by @Wauplin in #2119
- Fix uploading to HF proxy by @Wauplin in #2120
- Fix --delete in huggingface-cli upload command by @Wauplin in #2129
- Explicitly fail on Keras3 by @Wauplin in #2107
- Fix serverless naming by @Wauplin in #2137
⚙️ internal
- tag as 0.22.0.dev + remove deprecated code by @Wauplin in #2049
- Some cleaning by @Wauplin in #2070
- Fix test test_delete_branch_on_missing_branch_fails by @Wauplin in #2088
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @Y4suyuki
- Start defining custom errors in one place (#2122)
- @bmuskalla
- Enable
python-xdist
on all tests by @bmuskalla in #2059
- Enable