How to use the HF Nvidia NIM API with the HF inference client? #2480

MoritzLaurer · 2024-08-22T12:32:16Z

Describe the bug

We recently introduced the Nvidia NIM API for selected models. The recommended use is via the OAI client like this (with a specific fine-grained token for an enterprise org):

from openai import OpenAI

client = OpenAI(
    base_url="https://huggingface.co/api/integrations/dgx/v1",
    api_key="YOUR_FINE_GRAINED_TOKEN_HERE"
)

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 500"}
    ],
    stream=True,
    max_tokens=1024
)

# Iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end='')

How can users use this API with the HF inference client directly?
The InferenceClient.chat_completions docs provide this example snippet for OAI syntax (example 3):

# instead of `from openai import OpenAI`
from huggingface_hub import InferenceClient

# instead of `client = OpenAI(...)`
client = InferenceClient(
    base_url=...,
    api_key=...,
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

When I transpose the logic from the NIM OAI code snippet to the code above, I get this:

# instead of `from openai import OpenAI`
from huggingface_hub import InferenceClient

# instead of `client = OpenAI(...)`
client = InferenceClient(
    api_key="enterprise-org-token",
    base_url="https://huggingface.co/api/integrations/dgx/v1",
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

This throws this error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py:304, in hf_raise_for_status(response, endpoint_name)
    303 try:
--> 304     response.raise_for_status()
    305 except HTTPError as e:

File ~/miniconda/lib/python3.9/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
   1023 if http_error_msg:
-> 1024     raise HTTPError(http_error_msg, response=self)

HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/integrations/dgx/v1/chat/completions

The above exception was the direct cause of the following exception:

BadRequestError                           Traceback (most recent call last)
Cell In[48], line 10
      4 # instead of `client = OpenAI(...)`
      5 client = InferenceClient(
      6     api_key="hf_****",
      7     base_url="https://huggingface.co/api/integrations/dgx/v1",
      8 )
---> 10 output = client.chat.completions.create(
     11     model="meta-llama/Meta-Llama-3-8B-Instruct",
     12     messages=[
     13         {"role": "system", "content": "You are a helpful assistant."},
     14         {"role": "user", "content": "Count to 10"},
     15     ],
     16     stream=True,
     17     max_tokens=1024,
     18 )
     20 for chunk in output:
     21     print(chunk.choices[0].delta.content)

File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/inference/_client.py:837, in InferenceClient.chat_completion(self, messages, model, stream, frequency_penalty, logit_bias, logprobs, max_tokens, n, presence_penalty, response_format, seed, stop, temperature, tool_choice, tool_prompt, tools, top_logprobs, top_p)
    833 # `model` is sent in the payload. Not used by the server but can be useful for debugging/routing.
    834 # If it's a ID on the Hub => use it. Otherwise, we use a random string.
    835 model_id = model if not is_url and model.count("/") == 1 else "tgi"
--> 837 data = self.post(
    838     model=model_url,
    839     json=dict(
    840         model=model_id,
    841         messages=messages,
    842         frequency_penalty=frequency_penalty,
    843         logit_bias=logit_bias,
    844         logprobs=logprobs,
    845         max_tokens=max_tokens,
    846         n=n,
    847         presence_penalty=presence_penalty,
    848         response_format=response_format,
    849         seed=seed,
    850         stop=stop,
    851         temperature=temperature,
    852         tool_choice=tool_choice,
    853         tool_prompt=tool_prompt,
    854         tools=tools,
    855         top_logprobs=top_logprobs,
    856         top_p=top_p,
    857         stream=stream,
    858     ),
    859     stream=stream,
    860 )
    862 if stream:
    863     return _stream_chat_completion_response(data)  # type: ignore[arg-type]

File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/inference/_client.py:304, in InferenceClient.post(self, json, data, model, task, stream)
    301         raise InferenceTimeoutError(f"Inference call timed out: {url}") from error  # type: ignore
    303 try:
--> 304     hf_raise_for_status(response)
    305     return response.iter_lines() if stream else response.content
    306 except HTTPError as error:

File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py:358, in hf_raise_for_status(response, endpoint_name)
    354 elif response.status_code == 400:
    355     message = (
    356         f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
    357     )
--> 358     raise BadRequestError(message, response=response) from e
    360 elif response.status_code == 403:
    361     message = (
    362         f"\n\n{response.status_code} Forbidden: {error_message}."
    363         + f"\nCannot access content at: {response.url}."
    364         + "\nIf you are trying to create or update content, "
    365         + "make sure you have a token with the `write` role."
    366     )

BadRequestError: 

Bad request:

Reproduction

No response

Logs

No response

System info

{'huggingface_hub version': '0.24.6',
 'Platform': 'Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.31',
 'Python version': '3.9.5',
 'Running in iPython ?': 'Yes',
 'iPython shell': 'ZMQInteractiveShell',
 'Running in notebook ?': 'Yes',
 'Running in Google Colab ?': 'No',
 'Token path ?': '/home/user/.cache/huggingface/token',
 'Has saved token ?': True,
 'Who am I ?': 'MoritzLaurer',
 'Configured git credential helpers': 'store',
 'FastAI': 'N/A',
 'Tensorflow': 'N/A',
 'Torch': 'N/A',
 'Jinja2': '3.1.4',
 'Graphviz': 'N/A',
 'keras': 'N/A',
 'Pydot': 'N/A',
 'Pillow': 'N/A',
 'hf_transfer': 'N/A',
 'gradio': 'N/A',
 'tensorboard': 'N/A',
 'numpy': '2.0.1',
 'pydantic': '2.8.2',
 'aiohttp': 'N/A',
 'ENDPOINT': 'https://huggingface.co',
 'HF_HUB_CACHE': '/home/user/.cache/huggingface/hub',
 'HF_ASSETS_CACHE': '/home/user/.cache/huggingface/assets',
 'HF_TOKEN_PATH': '/home/user/.cache/huggingface/token',
 'HF_HUB_OFFLINE': False,
 'HF_HUB_DISABLE_TELEMETRY': False,
 'HF_HUB_DISABLE_PROGRESS_BARS': None,
 'HF_HUB_DISABLE_SYMLINKS_WARNING': False,
 'HF_HUB_DISABLE_EXPERIMENTAL_WARNING': False,
 'HF_HUB_DISABLE_IMPLICIT_TOKEN': False,
 'HF_HUB_ENABLE_HF_TRANSFER': False,
 'HF_HUB_ETAG_TIMEOUT': 10,
 'HF_HUB_DOWNLOAD_TIMEOUT': 10}

MoritzLaurer added the bug Something isn't working label Aug 22, 2024

Wauplin mentioned this issue Aug 22, 2024

Fix InferenceClient for HF Nvidia NIM API #2482

Merged

Wauplin closed this as completed in #2482 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use the HF Nvidia NIM API with the HF inference client? #2480

How to use the HF Nvidia NIM API with the HF inference client? #2480

MoritzLaurer commented Aug 22, 2024 •

edited by Wauplin

Loading

How to use the HF Nvidia NIM API with the HF inference client? #2480

How to use the HF Nvidia NIM API with the HF inference client? #2480

Comments

MoritzLaurer commented Aug 22, 2024 • edited by Wauplin Loading

Describe the bug

Reproduction

Logs

System info

MoritzLaurer commented Aug 22, 2024 •

edited by Wauplin

Loading