You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently introduced the Nvidia NIM API for selected models. The recommended use is via the OAI client like this (with a specific fine-grained token for an enterprise org):
fromopenaiimportOpenAIclient=OpenAI(
base_url="https://huggingface.co/api/integrations/dgx/v1",
api_key="YOUR_FINE_GRAINED_TOKEN_HERE"
)
chat_completion=client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Count to 500"}
],
stream=True,
max_tokens=1024
)
# Iterate and print streamformessageinchat_completion:
print(message.choices[0].delta.content, end='')
How can users use this API with the HF inference client directly?
The InferenceClient.chat_completions docs provide this example snippet for OAI syntax (example 3):
# instead of `from openai import OpenAI`fromhuggingface_hubimportInferenceClient# instead of `client = OpenAI(...)`client=InferenceClient(
base_url=...,
api_key=...,
)
output=client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Count to 10"},
],
stream=True,
max_tokens=1024,
)
forchunkinoutput:
print(chunk.choices[0].delta.content)
When I transpose the logic from the NIM OAI code snippet to the code above, I get this:
# instead of `from openai import OpenAI`fromhuggingface_hubimportInferenceClient# instead of `client = OpenAI(...)`client=InferenceClient(
api_key="enterprise-org-token",
base_url="https://huggingface.co/api/integrations/dgx/v1",
)
output=client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Count to 10"},
],
stream=True,
max_tokens=1024,
)
forchunkinoutput:
print(chunk.choices[0].delta.content)
This throws this error:
---------------------------------------------------------------------------HTTPErrorTraceback (mostrecentcalllast)
File~/miniconda/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py:304, inhf_raise_for_status(response, endpoint_name)
303try:
-->304response.raise_for_status()
305exceptHTTPErrorase:
File~/miniconda/lib/python3.9/site-packages/requests/models.py:1024, inResponse.raise_for_status(self)
1023ifhttp_error_msg:
->1024raiseHTTPError(http_error_msg, response=self)
HTTPError: 400ClientError: BadRequestforurl: https://huggingface.co/api/integrations/dgx/v1/chat/completionsTheaboveexceptionwasthedirectcauseofthefollowingexception:
BadRequestErrorTraceback (mostrecentcalllast)
CellIn[48], line104# instead of `client = OpenAI(...)`5client=InferenceClient(
6api_key="hf_****",
7base_url="https://huggingface.co/api/integrations/dgx/v1",
8 )
--->10output=client.chat.completions.create(
11model="meta-llama/Meta-Llama-3-8B-Instruct",
12messages=[
13 {"role": "system", "content": "You are a helpful assistant."},
14 {"role": "user", "content": "Count to 10"},
15 ],
16stream=True,
17max_tokens=1024,
18 )
20forchunkinoutput:
21print(chunk.choices[0].delta.content)
File~/miniconda/lib/python3.9/site-packages/huggingface_hub/inference/_client.py:837, inInferenceClient.chat_completion(self, messages, model, stream, frequency_penalty, logit_bias, logprobs, max_tokens, n, presence_penalty, response_format, seed, stop, temperature, tool_choice, tool_prompt, tools, top_logprobs, top_p)
833# `model` is sent in the payload. Not used by the server but can be useful for debugging/routing.834# If it's a ID on the Hub => use it. Otherwise, we use a random string.835model_id=modelifnotis_urlandmodel.count("/") ==1else"tgi"-->837data=self.post(
838model=model_url,
839json=dict(
840model=model_id,
841messages=messages,
842frequency_penalty=frequency_penalty,
843logit_bias=logit_bias,
844logprobs=logprobs,
845max_tokens=max_tokens,
846n=n,
847presence_penalty=presence_penalty,
848response_format=response_format,
849seed=seed,
850stop=stop,
851temperature=temperature,
852tool_choice=tool_choice,
853tool_prompt=tool_prompt,
854tools=tools,
855top_logprobs=top_logprobs,
856top_p=top_p,
857stream=stream,
858 ),
859stream=stream,
860 )
862ifstream:
863return_stream_chat_completion_response(data) # type: ignore[arg-type]File~/miniconda/lib/python3.9/site-packages/huggingface_hub/inference/_client.py:304, inInferenceClient.post(self, json, data, model, task, stream)
301raiseInferenceTimeoutError(f"Inference call timed out: {url}") fromerror# type: ignore303try:
-->304hf_raise_for_status(response)
305returnresponse.iter_lines() ifstreamelseresponse.content306exceptHTTPErroraserror:
File~/miniconda/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py:358, inhf_raise_for_status(response, endpoint_name)
354elifresponse.status_code==400:
355message= (
356f"\n\nBad request for {endpoint_name} endpoint:"ifendpoint_nameisnotNoneelse"\n\nBad request:"357 )
-->358raiseBadRequestError(message, response=response) frome360elifresponse.status_code==403:
361message= (
362f"\n\n{response.status_code} Forbidden: {error_message}."363+f"\nCannot access content at: {response.url}."364+"\nIf you are trying to create or update content, "365+"make sure you have a token with the `write` role."366 )
BadRequestError:
Badrequest:
Describe the bug
We recently introduced the Nvidia NIM API for selected models. The recommended use is via the OAI client like this (with a specific fine-grained token for an enterprise org):
How can users use this API with the HF inference client directly?
The InferenceClient.chat_completions docs provide this example snippet for OAI syntax (example 3):
When I transpose the logic from the NIM OAI code snippet to the code above, I get this:
This throws this error:
Reproduction
No response
Logs
No response
System info
The text was updated successfully, but these errors were encountered: