Failed to do tool-call in ReAct agent example with 90B meta-reference-gpu server

### System Info

```
python -m "torch.utils.collect_env"
<frozen runpy>:128: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version: Could not collect
CMake version: version 3.26.5
Libc version: glibc-2.34

Python version: 3.12.0 | packaged by Anaconda, Inc. | (main, Oct  2 2023, 17:29:18) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.4.3-0_fbk15_zion_2630_gf27365f948db-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100
GPU 1: NVIDIA H100
GPU 2: NVIDIA H100
GPU 3: NVIDIA H100
GPU 4: NVIDIA H100
GPU 5: NVIDIA H100
GPU 6: NVIDIA H100
GPU 7: NVIDIA H100

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.2
/usr/lib64/libcudnn.so.9.7.1
/usr/lib64/libcudnn_adv.so.9.7.1
/usr/lib64/libcudnn_adv_infer.so.8.9.2
/usr/lib64/libcudnn_adv_train.so.8.9.2
/usr/lib64/libcudnn_cnn.so.9.7.1
/usr/lib64/libcudnn_cnn_infer.so.8.9.2
/usr/lib64/libcudnn_cnn_train.so.8.9.2
/usr/lib64/libcudnn_engines_precompiled.so.9.7.1
/usr/lib64/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/lib64/libcudnn_graph.so.9.7.1
/usr/lib64/libcudnn_heuristic.so.9.7.1
/usr/lib64/libcudnn_ops.so.9.7.1
/usr/lib64/libcudnn_ops_infer.so.8.9.2
/usr/lib64/libcudnn_ops_train.so.8.9.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             384
On-line CPU(s) list:                0-383
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9654 96-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 96
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU(s) scaling MHz:                 85%
CPU max MHz:                        3707.8120
CPU min MHz:                        1500.0000
BogoMIPS:                           4792.65
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          6 MiB (192 instances)
L1i cache:                          6 MiB (192 instances)
L2 cache:                           192 MiB (192 instances)
L3 cache:                           768 MiB (24 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-95,192-287
NUMA node1 CPU(s):                  96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] torch==2.6.0
[pip3] torchvision==0.21.0
[pip3] triton==3.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
```

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### 🐛 Describe the bug

Open a 3.2 90B meta_reference-gpu llama-stack server, then run the [ReAct agent example](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/react_agent.py):
```
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
import os
import uuid
import fire
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.client_tool import client_tool
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.lib.agents.react.agent import ReActAgent
@client_tool
def torchtune(query: str = "torchtune"):
    """
    Answer information about torchtune.
    :param query: The query to use for querying the internet
    :returns: Information about torchtune
    """
    dummy_response = """
            torchtune is a PyTorch library for easily authoring, finetuning and experimenting with LLMs.
            torchtune provides:
            PyTorch implementations of popular LLMs from Llama, Gemma, Mistral, Phi, and Qwen model families
            Hackable training recipes for full finetuning, LoRA, QLoRA, DPO, PPO, QAT, knowledge distillation, and more
            Out-of-the-box memory efficiency, performance improvements, and scaling with the latest PyTorch APIs
            YAML configs for easily configuring training, evaluation, quantization or inference recipes
            Built-in support for many popular dataset formats and prompt templates
    """
    return dummy_response
def main(host: str, port: int):
    client = LlamaStackClient(
        base_url=f"http://{host}:{port}",
        provider_data={"tavily_search_api_key": os.getenv("TAVILY_SEARCH_API_KEY")},
    )
    model = "meta-llama/Llama-3.2-90B-Vision-Instruct"
    agent = ReActAgent(
        client=client,
        model=model,
        builtin_toolgroups=["builtin::websearch"],
        client_tools=[torchtune],
        # json_response_format=True,
    )
    session_id = agent.create_session(f"ttest-session-{uuid.uuid4().hex}")
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": "Whats the best place in new york for a pizza slice at 2am ?",
            }
        ],
        session_id=session_id,
        stream=True,
    )
    for log in EventLogger().log(response):
        log.print()
    response2 = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": "What are the popular llms supported in torchtune?",
            }
        ],
        session_id=session_id,
        stream=True,
    )
    for log in EventLogger().log(response2):
        log.print()
if __name__ == "__main__":
    fire.Fire(main)
```

### Error logs

server side log:
```
ValueError: Non supported ToolPromptFormat ToolPromptFormat.python_list
Traceback (most recent call last):
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 206, in sse_generator
    async for item in event_gen:
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agents.py", line 164, in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 190, in create_and_execute_turn
    async for chunk in self._run_turn(request, turn_id):
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 279, in _run_turn
    async for chunk in self.run(
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 354, in run
    async for res in self._run(
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 512, in _run
    async for chunk in await self.inference_api.chat_completion(
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 210, in chat_completion
    return (chunk async for chunk in await provider.chat_completion(**params))
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/inline/inference/meta_reference/inference.py", line 277, in chat_completion
    request.messages = chat_completion_request_to_messages(request, self.llama_model.core_model_id.value)
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/utils/inference/prompt_adapter.py", line 304, in chat_completion_request_to_messages
    messages = augment_messages_for_tools_llama_3_1(request)
  File "/home/kaiwu/.conda/envs/omni/lib/python3.10/site-packages/llama_stack/providers/utils/inference/prompt_adapter.py", line 385, in augment_messages_for_tools_llama_3_1
    tool_gen = JsonCustomToolGenerator()
ValueError: Non supported ToolPromptFormat ToolPromptFormat.python_list
```
client error:
```
~/work/llama-stack-apps (computer-use)]$ python test_tool.py localhost 8321
`agent_config` is deprecated. Use inlined parameters instead.
`client_tools` is deprecated. Use `tools` instead.
inference> 400: Invalid value: Non supported ToolPromptFormat ToolPromptFormat.python_list
inference> 400: Invalid value: Non supported ToolPromptFormat ToolPromptFormat.python_list
```

### Expected behavior

Should be able to run ReAct agent with 3.2 vision model. Somehow the error is from `augment_messages_for_tools_llama_3_1`, I believe something is wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to do tool-call in ReAct agent example with 90B meta-reference-gpu server #1519

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to do tool-call in ReAct agent example with 90B meta-reference-gpu server #1519

Description

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions