Skip to content

Conversation

@RonaldBXu
Copy link
Contributor

@RonaldBXu RonaldBXu commented May 19, 2025

This PR adds the capability for llama4-type eagle heads to be used for speculative decoding in vLLM v1. This is my first major PR in vLLM, so feedback is appreciated : )

Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@sarckk
Copy link
Collaborator

sarckk commented May 19, 2025

cc @zixi-qi @morgendave

@RonaldBXu RonaldBXu closed this May 19, 2025
@RonaldBXu RonaldBXu reopened this May 19, 2025
@RonaldBXu RonaldBXu closed this May 19, 2025
@RonaldBXu RonaldBXu reopened this May 21, 2025
@RonaldBXu
Copy link
Contributor Author

Ready for review now

@WoosukKwon
Copy link
Collaborator

@RonaldBXu Looks good to me overall. Could you please add a test? Also, is there any available EAGLE head we can test this on?

@aarnphm
Copy link
Collaborator

aarnphm commented Jun 5, 2025

Could you please add a test? Also, is there any available EAGLE head we can test this on?

I found this from nvidia: https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3, but it seems they are using eagle3 architecture

@mergify mergify bot added the llama Related to Llama models label Jun 9, 2025
@RonaldBXu
Copy link
Contributor Author

Hi @WoosukKwon when you say add a test do you mean an e2e test like in https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py or https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py#L407? I think I'd have to open-source a compatible eagle head first, right?

Could you point me to other tests I could work on while I wait for approval for a compatible eagle head? Thanks!

RonaldBXu and others added 4 commits June 14, 2025 14:13
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 15, 2025
@WoosukKwon
Copy link
Collaborator

Hi @RonaldBXu, the PR looks good to me overall, but we'd like to have a test or at least a way to run the code.

Please refer to https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_eagle.py and

def test_eagle_correctness(

I think I'd have to open-source a compatible eagle head first, right?

Yes. We need an eagle head for Llama 4. Could we use https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3 (@aarnphm mentioned)?

@RonaldBXu
Copy link
Contributor Author

Thanks, I'll look at those tests. I don't think we can use that head since it is EAGLE3, but the good news is I got approval to release a compatible eagle head for my code. I should hopefully have it ready sometime next week!

RonaldBXu and others added 2 commits June 21, 2025 00:07
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@RonaldBXu
Copy link
Contributor Author

Hi @WoosukKwon , I added the tests. Just wanted to call out that for Llama4 Maverick, tp=1 was not sufficient (cuda out of memory error) so I made my test initialize the LLM on tp=8. Although I guess I could change it to Llama4 scout.. Please let me know what you think would be the best option here. Thanks!

Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@RonaldBXu
Copy link
Contributor Author

Oh wait, I know what the problem is.

@aarnphm
Copy link
Collaborator

aarnphm commented Jun 30, 2025

Yeah you should update the oracle.

Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@RonaldBXu
Copy link
Contributor Author

RonaldBXu commented Jun 30, 2025

I think the oracle is fine, the problem is that since the initialization test doesn't have a "method" field in the speculative config, the oracle would fallback to V0 (which is correct, so all the existing eagle models are being tested in V0). However, my implementation is not compatible with V0 (unlike the existing eagle models). I added 2 new field to the initialization tests, v1_only and speculative_method for my model.

Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@RonaldBXu
Copy link
Contributor Author

using maverick as the target model leads to OOM. changing to scout.

Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@RonaldBXu
Copy link
Contributor Author

It seems like we still get OOM. I'm reducing the max_model_len. I think the test is being run on small hardware so prob will still get OOM. Is there a way to designate a larger hardware for this specific test? Btw, this test works on my local machine now.

@RonaldBXu
Copy link
Contributor Author

RonaldBXu commented Jun 30, 2025

Also, I noticed there is an existing FIXME in the initialization test mentioning OOM/memory leaks. I wonder if this is related to me getting OOM in the CI.

@RonaldBXu
Copy link
Contributor Author

Hi @aarnphm what do you think?

@RonaldBXu
Copy link
Contributor Author

RonaldBXu commented Jul 8, 2025

Hi @aarnphm @benchislett @WoosukKwon what do you think is the best course of action here? The updated initialization test for my eagle head works locally, but I get OOM in the CI. Thanks

Edit: for now, I saw another test was skipped, so I added mine to the skip list.

RonaldBXu and others added 2 commits July 8, 2025 07:04
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
@aarnphm
Copy link
Collaborator

aarnphm commented Jul 9, 2025

I'm good with merging this in for now. We will probably have something in the works soon

@RonaldBXu
Copy link
Contributor Author

Hi @WoosukKwon could you review this again?

@RonaldBXu
Copy link
Contributor Author

RonaldBXu commented Jul 9, 2025

total_num_output_tokens: 39840
num_drafts: 27936
num_draft_tokens: 167616
num_accepted_tokens: 11801
mean acceptance length: 1.42
--------------------------------------------------
acceptance at token 0: 0.37
acceptance at token 1: 0.05
acceptance at token 2: 0.00
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00

Some acceptance rate results from running spec_decode.py

@mergify mergify bot added the new-model Requests to new models label Jul 10, 2025
@mergify
Copy link

mergify bot commented Jul 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RonaldBXu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 12, 2025
@mergify mergify bot removed the needs-rebase label Jul 12, 2025
@mergify
Copy link

mergify bot commented Jul 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RonaldBXu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 13, 2025
@mergify mergify bot removed the needs-rebase label Jul 13, 2025
@aarnphm
Copy link
Collaborator

aarnphm commented Jul 13, 2025

Hi @RonaldBXu, Thanks for all of the hard work. But from the other thread we might want to have Meta to help with the implementation here. Sorry about this.

@aarnphm aarnphm closed this Jul 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build llama Related to Llama models new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants