[TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler #6000

syuoni · 2025-07-14T09:39:49Z

[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler

Description

This PR supports guided decoding with overlap scheduler.

Core changes

For overlap scheduler, the original loop is:

Launch model forward iteration i (captured by CUDA graph)
Launch sampling iteration i
Sync sampling state iteration i-1

This PR changes the loop to:

Launch model forward iteration i (captured by CUDA graph)
Sync sampling state iteration i-1
[Optional] Guided decoding iteration i
- Get token ID (iteration i-1) on CPU and call XGrammar to generate bitmask, asynchronously copy it to device
- Launch logitsBitmaskKernel
Launch sampling iteration i

The key point is, we need to place "sync sampling state i-1" before "launch sampling iteration i", so that guided decoder can have the chance to get the correct token ID on CPU, generate the bitmask, and apply the bitmask before sampling i.

Normally we can assume model forward is much heavier than sampling, so this change should not affect the effectiveness of overlap scheduler.

Results

Please see the nsys timeline, the overlap scheduler is enabled with guided decoding in the loop.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

syuoni · 2025-07-14T09:40:01Z

/bot run

tensorrt-cicd · 2025-07-14T09:45:19Z

PR_Github #11799 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T16:50:01Z

PR_Github #11799 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8743 completed with status: 'SUCCESS'

syuoni · 2025-07-16T06:27:28Z

/bot run

tensorrt-cicd · 2025-07-16T06:33:00Z

PR_Github #12035 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-16T09:24:04Z

PR_Github #12035 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8937 completed with status: 'FAILURE'

syuoni · 2025-07-16T09:55:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-16T10:00:37Z

PR_Github #12076 [ run ] triggered by Bot

Funatiq · 2025-07-16T11:22:06Z

The key point is, we need to place "sync sampling state i-1" before "launch sampling iteration i", so that guided decoder can have the chance to get the correct token ID and apply the bitmask before sampling i.

Can you help me understand why this is required? Launch model forward iteration i also gets the token ID (from previous_tensors_device) without requiring the sync. Can we also use the device tensors for the guided decoder?

tensorrt_llm/_torch/pyexecutor/py_executor.py

syuoni · 2025-07-16T12:00:19Z

The key point is, we need to place "sync sampling state i-1" before "launch sampling iteration i", so that guided decoder can have the chance to get the correct token ID and apply the bitmask before sampling i.

Can you help me understand why this is required? Launch model forward iteration i also gets the token ID (from previous_tensors_device) without requiring the sync. Can we also use the device tensors for the guided decoder?

Hi @Funatiq , xgrammar (and also other guided decoding backends) is purely a cpu lib, it takes cpu token IDs as input, and generate cpu bitmask. Hence, the gpu token IDs don't work for xgrammar.

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni · 2025-07-16T12:37:23Z

/bot run

tensorrt-cicd · 2025-07-16T12:42:28Z

PR_Github #12084 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-16T12:42:30Z

PR_Github #12076 [ run ] completed with state ABORTED

nvbrantz

The execution flow looks correct: enqueue model inference 0 -> build mask 0 -> enqueue apply mask 0 -> enqueue sample 0 -> enqueue model inference 1 -> event sync sample 0 -> build mask 1 -> enqueue apply mask 1 -> enqueue sample 1 -> ...
Thanks!

tensorrt-cicd · 2025-07-16T17:46:38Z

PR_Github #12084 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8975 completed with status: 'SUCCESS'

QiJune

LGTM

…IDIA#6000) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni requested review from lowsfer, QiJune and nvbrantz July 14, 2025 09:39

syuoni requested a review from a team as a code owner July 14, 2025 09:39

syuoni force-pushed the guided-with-overlap branch from a09f6b3 to ef064c3 Compare July 16, 2025 06:27

syuoni requested a review from Funatiq July 16, 2025 06:42

Funatiq reviewed Jul 16, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

syuoni added 6 commits July 16, 2025 12:33

support guided + overlap

1c052a1

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

f9998b2

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

495745c

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix blocking

e164cc3

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

update doc

a483be8

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix rebase

4cb3d5a

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni force-pushed the guided-with-overlap branch from d7d7a72 to 4cb3d5a Compare July 16, 2025 12:34

nvbrantz approved these changes Jul 16, 2025

View reviewed changes

QiJune approved these changes Jul 17, 2025

View reviewed changes

Funatiq approved these changes Jul 17, 2025

View reviewed changes

syuoni merged commit 21efb50 into NVIDIA:main Jul 17, 2025
3 checks passed

This was referenced Jul 17, 2025

[TRTLLM-6091][docs] Update docs/trtllm sampler 1.0 #5833

Merged

Guided decoding is not supported with overlap scheduler #5858

Closed

syuoni changed the title ~~[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler~~ [TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler Jul 17, 2025

reasonsolo pushed a commit to reasonsolo/TensorRT-LLM that referenced this pull request Jul 21, 2025

[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (NV…

5c1fbd3

…IDIA#6000) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

amukkara mentioned this pull request Jul 21, 2025

[nvbugs/5361178] feat: json_schema support in trtllm-serve using xgrammar #6197

Closed

[TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler #6000

[TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler #6000

Uh oh!

Conversation

syuoni commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler

Description

Core changes

Results

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

syuoni commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

syuoni commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

syuoni commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

Funatiq commented Jul 16, 2025

Uh oh!

Uh oh!

syuoni commented Jul 16, 2025

Uh oh!

syuoni commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

nvbrantz left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

syuoni commented Jul 14, 2025 •

edited

Loading