Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the HF Nvidia NIM API with the HF inference client? #2480

Closed
MoritzLaurer opened this issue Aug 22, 2024 · 0 comments · Fixed by #2482
Closed

How to use the HF Nvidia NIM API with the HF inference client? #2480

MoritzLaurer opened this issue Aug 22, 2024 · 0 comments · Fixed by #2482
Labels
bug Something isn't working

Comments

@MoritzLaurer
Copy link
Contributor

MoritzLaurer commented Aug 22, 2024

Describe the bug

We recently introduced the Nvidia NIM API for selected models. The recommended use is via the OAI client like this (with a specific fine-grained token for an enterprise org):

from openai import OpenAI

client = OpenAI(
    base_url="https://huggingface.co/api/integrations/dgx/v1",
    api_key="YOUR_FINE_GRAINED_TOKEN_HERE"
)

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 500"}
    ],
    stream=True,
    max_tokens=1024
)

# Iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end='')

How can users use this API with the HF inference client directly?
The InferenceClient.chat_completions docs provide this example snippet for OAI syntax (example 3):

# instead of `from openai import OpenAI`
from huggingface_hub import InferenceClient

# instead of `client = OpenAI(...)`
client = InferenceClient(
    base_url=...,
    api_key=...,
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

When I transpose the logic from the NIM OAI code snippet to the code above, I get this:

# instead of `from openai import OpenAI`
from huggingface_hub import InferenceClient

# instead of `client = OpenAI(...)`
client = InferenceClient(
    api_key="enterprise-org-token",
    base_url="https://huggingface.co/api/integrations/dgx/v1",
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

This throws this error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py:304, in hf_raise_for_status(response, endpoint_name)
    303 try:
--> 304     response.raise_for_status()
    305 except HTTPError as e:

File ~/miniconda/lib/python3.9/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
   1023 if http_error_msg:
-> 1024     raise HTTPError(http_error_msg, response=self)

HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/integrations/dgx/v1/chat/completions

The above exception was the direct cause of the following exception:

BadRequestError                           Traceback (most recent call last)
Cell In[48], line 10
      4 # instead of `client = OpenAI(...)`
      5 client = InferenceClient(
      6     api_key="hf_****",
      7     base_url="https://huggingface.co/api/integrations/dgx/v1",
      8 )
---> 10 output = client.chat.completions.create(
     11     model="meta-llama/Meta-Llama-3-8B-Instruct",
     12     messages=[
     13         {"role": "system", "content": "You are a helpful assistant."},
     14         {"role": "user", "content": "Count to 10"},
     15     ],
     16     stream=True,
     17     max_tokens=1024,
     18 )
     20 for chunk in output:
     21     print(chunk.choices[0].delta.content)

File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/inference/_client.py:837, in InferenceClient.chat_completion(self, messages, model, stream, frequency_penalty, logit_bias, logprobs, max_tokens, n, presence_penalty, response_format, seed, stop, temperature, tool_choice, tool_prompt, tools, top_logprobs, top_p)
    833 # `model` is sent in the payload. Not used by the server but can be useful for debugging/routing.
    834 # If it's a ID on the Hub => use it. Otherwise, we use a random string.
    835 model_id = model if not is_url and model.count("/") == 1 else "tgi"
--> 837 data = self.post(
    838     model=model_url,
    839     json=dict(
    840         model=model_id,
    841         messages=messages,
    842         frequency_penalty=frequency_penalty,
    843         logit_bias=logit_bias,
    844         logprobs=logprobs,
    845         max_tokens=max_tokens,
    846         n=n,
    847         presence_penalty=presence_penalty,
    848         response_format=response_format,
    849         seed=seed,
    850         stop=stop,
    851         temperature=temperature,
    852         tool_choice=tool_choice,
    853         tool_prompt=tool_prompt,
    854         tools=tools,
    855         top_logprobs=top_logprobs,
    856         top_p=top_p,
    857         stream=stream,
    858     ),
    859     stream=stream,
    860 )
    862 if stream:
    863     return _stream_chat_completion_response(data)  # type: ignore[arg-type]

File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/inference/_client.py:304, in InferenceClient.post(self, json, data, model, task, stream)
    301         raise InferenceTimeoutError(f"Inference call timed out: {url}") from error  # type: ignore
    303 try:
--> 304     hf_raise_for_status(response)
    305     return response.iter_lines() if stream else response.content
    306 except HTTPError as error:

File ~/miniconda/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py:358, in hf_raise_for_status(response, endpoint_name)
    354 elif response.status_code == 400:
    355     message = (
    356         f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
    357     )
--> 358     raise BadRequestError(message, response=response) from e
    360 elif response.status_code == 403:
    361     message = (
    362         f"\n\n{response.status_code} Forbidden: {error_message}."
    363         + f"\nCannot access content at: {response.url}."
    364         + "\nIf you are trying to create or update content, "
    365         + "make sure you have a token with the `write` role."
    366     )

BadRequestError: 

Bad request:

Reproduction

No response

Logs

No response

System info

{'huggingface_hub version': '0.24.6',
 'Platform': 'Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.31',
 'Python version': '3.9.5',
 'Running in iPython ?': 'Yes',
 'iPython shell': 'ZMQInteractiveShell',
 'Running in notebook ?': 'Yes',
 'Running in Google Colab ?': 'No',
 'Token path ?': '/home/user/.cache/huggingface/token',
 'Has saved token ?': True,
 'Who am I ?': 'MoritzLaurer',
 'Configured git credential helpers': 'store',
 'FastAI': 'N/A',
 'Tensorflow': 'N/A',
 'Torch': 'N/A',
 'Jinja2': '3.1.4',
 'Graphviz': 'N/A',
 'keras': 'N/A',
 'Pydot': 'N/A',
 'Pillow': 'N/A',
 'hf_transfer': 'N/A',
 'gradio': 'N/A',
 'tensorboard': 'N/A',
 'numpy': '2.0.1',
 'pydantic': '2.8.2',
 'aiohttp': 'N/A',
 'ENDPOINT': 'https://huggingface.co',
 'HF_HUB_CACHE': '/home/user/.cache/huggingface/hub',
 'HF_ASSETS_CACHE': '/home/user/.cache/huggingface/assets',
 'HF_TOKEN_PATH': '/home/user/.cache/huggingface/token',
 'HF_HUB_OFFLINE': False,
 'HF_HUB_DISABLE_TELEMETRY': False,
 'HF_HUB_DISABLE_PROGRESS_BARS': None,
 'HF_HUB_DISABLE_SYMLINKS_WARNING': False,
 'HF_HUB_DISABLE_EXPERIMENTAL_WARNING': False,
 'HF_HUB_DISABLE_IMPLICIT_TOKEN': False,
 'HF_HUB_ENABLE_HF_TRANSFER': False,
 'HF_HUB_ETAG_TIMEOUT': 10,
 'HF_HUB_DOWNLOAD_TIMEOUT': 10}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant