Skip to content

Commit a0e0ff6

Browse files
authored
Merge branch 'main' into fix/issue-2586-logging-test-intermittent-failures
2 parents 2077a3a + 0bd4995 commit a0e0ff6

File tree

7 files changed

+19
-155
lines changed

7 files changed

+19
-155
lines changed

components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml

Lines changed: 0 additions & 41 deletions
This file was deleted.
Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,24 +14,26 @@
1414
# limitations under the License.
1515

1616
backend: pytorch
17-
tensor_parallel_size: 8
18-
moe_expert_parallel_size: 8
19-
max_batch_size: 1
20-
max_num_tokens: 8192
21-
max_seq_len: 8192
22-
print_iter_log: true
23-
disable_overlap_scheduler: true
17+
tensor_parallel_size: 4
18+
moe_expert_parallel_size: 4
19+
max_batch_size: 192
20+
max_num_tokens: 3072
21+
disable_overlap_scheduler: false
2422

2523
# Enable Speculative Decoding in the model engine
2624
speculative_config:
2725
decoding_type: Eagle
2826
max_draft_len: 3
2927
speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3
30-
eagle3_one_model: True
28+
eagle3_one_model: true
3129

3230
kv_cache_config:
33-
free_gpu_memory_fraction: 0.5
31+
free_gpu_memory_fraction: 0.2
3432
enable_block_reuse: false
3533

36-
cache_transceiver_config:
37-
backend: default
34+
cuda_graph_config:
35+
enable_padding: true
36+
batch_sizes: [1,2,3,4,5,6,7,8,16,32,48,64,128,190,191,192]
37+
38+
print_iter_log: true
39+

components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,23 +17,21 @@ backend: pytorch
1717
tensor_parallel_size: 4
1818
moe_expert_parallel_size: 4
1919
max_batch_size: 256
20-
max_num_tokens: 512
20+
max_num_tokens: 1024
2121
# 8704 = 8192 ISL + 512 OSL
2222
max_seq_len: 8704
2323
disable_overlap_scheduler: true
24-
enable_autotuner: false
2524

2625
# Enable Speculative Decoding in the model engine
2726
speculative_config:
2827
decoding_type: Eagle
29-
max_draft_len: 1
28+
max_draft_len: 3
3029
speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3
31-
eagle3_one_model: false
30+
eagle3_one_model: true
3231

3332
kv_cache_config:
3433
free_gpu_memory_fraction: 0.5
3534
enable_block_reuse: false
36-
dtype: fp8
3735

3836
cuda_graph_config:
3937
enable_padding: true

components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,19 +21,17 @@ max_num_tokens: 8192
2121
max_seq_len: 8192
2222
print_iter_log: true
2323
disable_overlap_scheduler: true
24-
enable_autotuner: false
2524

2625
# Enable Speculative Decoding in the model engine
2726
speculative_config:
2827
decoding_type: Eagle
29-
max_draft_len: 1
28+
max_draft_len: 3
3029
speculative_model_dir: nvidia/Llama-4-Maverick-17B-128E-Eagle3
31-
eagle3_one_model: false
30+
eagle3_one_model: true
3231

3332
kv_cache_config:
3433
free_gpu_memory_fraction: 0.5
3534
enable_block_reuse: false
36-
dtype: fp8
3735

3836
cache_transceiver_config:
3937
backend: default

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml

Lines changed: 0 additions & 38 deletions
This file was deleted.

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml

Lines changed: 0 additions & 43 deletions
This file was deleted.

components/backends/trtllm/llama4_plus_eagle.md

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -30,16 +30,7 @@ This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Specu
3030
For advanced control over how requests are routed between prefill and decode workers in disaggregated mode, refer to the [Disaggregation Strategy](./README.md#disaggregation-strategy) section.
3131

3232
## Notes
33-
* To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
34-
* Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
35-
* If you need to download model weights off huggingface, make sure you run the command `huggingface-cli login` and have access to the necessary gated models.
36-
37-
## Eagle3-one-model
38-
* Eagle3-one-model (`eagle3_one_model=True`) config is added in `engine_configs/llama4/eagle_one_model`. Build dynamo with the latest commit `66f299a` in TRTLLM 1.0.0.rc2 [Link](https://github.com/NVIDIA/TensorRT-LLM/commits/v1.0.0rc2/).
39-
* The configs in `engine_configs/llama4/eagle_one_model` are tested with 8xH100 cluster. Be sure to change the `NUM_GPUS_PER_NODE` accordingly or change TP/EP size in config. 1 8xH100 node for aggregated .yml file, 2 8xH100 for prefill/decode .yml file.
40-
* The current `./multinode/start_frontend_services.sh` may got ran `NUM_GPUS_PER_NODE` times depending on how srun/mpi is launched, beware that the frontend service only needs to be ran once.
41-
* Eagle3-one-model appends the eagle3 layer at the end of the TRTLLM engine, instead of sending base/draft requests between 2 engines. Visit TRTLLM for more information.
42-
33+
* Make sure the (`eagle3_one_model: true`) is set in the LLM API config inside the `engine_configs/llama4/eagle` folder.
4334

4435
## Setup
4536

@@ -66,7 +57,6 @@ export NUM_NODES=1
6657
export ENGINE_CONFIG="/mnt/engine_configs/llama4/eagle/eagle_agg.yaml"
6758
./multinode/srun_aggregated.sh
6859
```
69-
* Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
7060

7161
## Disaggregated Serving
7262

@@ -77,8 +67,6 @@ export NUM_DECODE_NODES=1
7767
export DECODE_ENGINE_CONFIG="/mnt/engine_configs/llama4/eagle/eagle_decode.yaml"
7868
./multinode/srun_disaggregated.sh
7969
```
80-
* Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
81-
8270

8371
## Example Request
8472

0 commit comments

Comments
 (0)