[Model] Update support for NemotronNAS models #15008

Naveassaf · 2025-03-18T06:29:47Z

Add support for latest (as of March 18, 2025) nemotron-nas type models. These models are currently of architecture DeciLMForCausalLM.

Practically:

Define has_noops models that may have decoder blocks with NOOP attention or MLPs.
Move modeling definition of DeciLMForCausalLM to be under nemotron_nas (the newer model type) instead of deci
Deprecate support for older DeciLMForCausalLM models. These were throughput-oriented models with variable GQA. THe implementation over-allocated KV caches and thus made them less interesting anyways.
Add the latest nemotron-nas model to model regression.

FIX #15068
FIX #15779

github-actions · 2025-03-18T06:29:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-03-18T09:32:56Z

cc @simon-mo regarding deprecating older DeciLMForCausalLM models

Naveassaf · 2025-03-19T14:33:16Z

@DarkLight1337 - thanks for immediately addressing our PR! A quick update and question, now that the model is HF public.

We intend to release a similar 253B "Nemotron Ultra" model. It will require slight changes during which we could return the support for older DeciLMForCausalLM architectures. Does that help make the decision regarding deprecation? Is it a serious blocker?

DarkLight1337 · 2025-03-19T14:41:26Z

Personally I think we should avoid deprecating older models unless they become incompatible with the current transformers version. But I'll leave it to @simon-mo to decide

simon-mo · 2025-03-20T23:47:04Z

Deprecate support for older DeciLMForCausalLM models.

Is your team the original author of DeciLM? If the model team is telling us to deprecate, we are happy to follow the direction that way.

We intend to release a similar 253B "Nemotron Ultra" model. It will require slight changes during which we could return the support for older DeciLMForCausalLM architectures.

Does this mean we want both DeciLM and the NemotronNAS architecture? How do you envision the model files will be layout in the repo?

Naveassaf · 2025-03-21T16:26:52Z

Yes, I am part of the DeciLM team (now part of Nvidia). We are content with deprecating the older "decilm, DeciLMForCausalLM" models in favor of the "nemotron-nas, DeciLMForCausalLM" for now.

We hope to release future models and versions as "nemotron-nas, NemotronNASForCausalLM" models, which will make supporting all architectures and model types listed above much cleaner under separate decilm.py and nemotron_nas.py modeling files. Currently supporting both is possible, though not very esthetic.

If you prefer, we can add a commit which will prevent deprecation, but be a bit hacky (will choose what DeciLMForCausalLM to create based on the provided config's fields).

And if you don't have a solid opinion, we'd prefer to go with the deprecation at this point and address older models in the future if necessary (while possibly enabling proper variable GQA kv caching s.t. older models reach their full throughput potential).

simon-mo · 2025-03-24T18:22:40Z

The simpler option works for me. We can note in the release note that older DeciLM is no longer supported and please use older version of vLLM for it.

Naveassaf · 2025-03-24T18:50:12Z

Great, thanks.

Then I have no more planned changes here (other than fixes to address future CR feedback).

WoosukKwon · 2025-03-25T18:23:03Z

@simon-mo @DarkLight1337 What will be the next step here?

simon-mo · 2025-03-25T23:03:45Z

@WoosukKwon are you okay with the changes related noop?

Ithanil · 2025-03-30T19:02:32Z

Works like a charm, thank you very much!

DarkLight1337 · 2025-03-31T06:27:23Z

vllm/config.py

Suggested change

return ModelRegistry.is_noops_model(architectures)

return self.registry.is_noops_model(architectures)

Minor nit. Also this is to ping @WoosukKwon

👍 good point. Fixed and rebased.

Thanks @Naveassaf . I can confirm that I can run this model on A100 (2) Outside the scope of this PR, did you try any reasoning parser for this model? DeepSeek R1 parser failed for obvious start end token mismatch

Hi @lokeshwer - glad to here this worked for you. I have not tried running with the R1 parser. If there is lacking functionality there, feel free to create an issue an tag me there so that the exact ask is refined and we can allocate engineering time to it.

If you'd like to get a sense of the effect of reasoning with this model, you can try the NIM preview endpoint and its "enable reasoning" toggle here: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1

Actually the reasoning works fine per se without specifying --enable-reasoning --reasoning-parser deepseek_r1 at all. So far the only difference in that regard compared to running an DS-R1 model (with --enable-reasoning --reasoning-parser deepseek_r1) seems to be that the reasoning tokens will then count towards the max_token limit specified with the request.

In case anyone reads this who needs guidance:
On 2xA100 and vllm 0.8.2 + this PR, I run my FP8-Dynamic quant (https://huggingface.co/Ithanil/Llama-3_3-Nemotron-Super-49B-v1-FP8-Dynamic) with --tensor-parallel-size 2 --pipeline-parallel-size 1 --gpu-memory-utilization 0.95 --trust-remote-code --max-model-len 131072 --enable-prefix-caching --enable-chunked-prefill --distributed-executor-backend ray and use temp 0.6 and top_p 0.95 with reasoning enabled and temp 0 with reasoning disabled (as per Nvidias suggestion). You will get about 45 Token/s for single request.

With system message turning on reasoning, we get llm response but the response should be parsed. For text <think>abc</think>xyz: - 'abc' goes to reasoning_content - 'xyz' goes to content Reference

Yes, but depending on the frontend that doesn't matter, except for that the reasoning tokens count towards the regular response tokens.

True. But switching between DeepSeek and NemotronNAS, we have to keep changing frontend. I think OpenAI compatible response format simplifies that. On related note, the parser assumes start and end to be a unique token, whereas NemotronNAS despite using format, the underlying tokenizer doesn't map them to unique prompt token but multiple tokens. Outright the parser won't work even if the response text look alike.

The comments so far make sense. I was curious if anyone added a reasoning parser so we don't have to sort through the <think> tokens on the frontend?

DarkLight1337 · 2025-03-31T09:17:46Z

Since @WoosukKwon is busy I'll just merge this for now. We can make further changes in another PR if necessary. Sorry for the delay!

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

Naveassaf · 2025-03-31T12:33:47Z

@DarkLight1337 - had to reformat a commit for DCO. Zero code changes. Could you click the merge button please? Thanks!

Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Naveassaf requested review from DarkLight1337 and ywang96 as code owners March 18, 2025 06:29

mergify bot added the documentation Improvements or additions to documentation label Mar 18, 2025

Naveassaf force-pushed the feat/support_nemotron_nas branch from 2d17480 to cd78380 Compare March 18, 2025 08:29

DarkLight1337 requested a review from simon-mo March 18, 2025 09:32

manitadayon mentioned this pull request Mar 27, 2025

DeciLMConfig object has no attribute ‘num_key_value_heads_per_layer’ #15625

Closed

DarkLight1337 mentioned this pull request Mar 31, 2025

[Bug]: Disagreement and misalignment between supported models in documentation and actual testing #15779

Closed

1 task

DarkLight1337 reviewed Mar 31, 2025

View reviewed changes

Naveassaf force-pushed the feat/support_nemotron_nas branch from cd78380 to 734dab0 Compare March 31, 2025 09:03

Naveassaf requested a review from DarkLight1337 March 31, 2025 09:04

DarkLight1337 approved these changes Mar 31, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) March 31, 2025 09:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2025

Naveassaf added 2 commits March 31, 2025 13:00

[Model] Update support for NemotronNAS models

f74c531

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

CR feedback

f088e48

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

auto-merge was automatically disabled March 31, 2025 10:02
Head branch was pushed to by a user without write access

Naveassaf force-pushed the feat/support_nemotron_nas branch from 734dab0 to f088e48 Compare March 31, 2025 10:02

DarkLight1337 merged commit 3aa2b6a into vllm-project:main Mar 31, 2025
35 checks passed

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

[Model] Update support for NemotronNAS models (vllm-project#15008)

8515473

Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Model] Update support for NemotronNAS models (vllm-project#15008)

d9130af

Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

brandonbiggs mentioned this pull request Apr 9, 2025

[Doc]: Supported Reasoning Models Missing NemotronNAS #16302

Open

1 task

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Model] Update support for NemotronNAS models (vllm-project#15008)

e4388a5

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

janekl mentioned this pull request Apr 30, 2025

vLLM==0.8.5 update NVIDIA-NeMo/NeMo#13350

Merged

8 tasks

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model] Update support for NemotronNAS models (vllm-project#15008)

158f2f9

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Model] Update support for NemotronNAS models (vllm-project#15008)

e7f69c0

Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

netanel-haber mentioned this pull request Aug 11, 2025

model: support nvidia/Llama-3_3-Nemotron-Super-49B-v1 sgl-project/sglang#9067

Merged

5 tasks

	return ModelRegistry.is_noops_model(architectures)
	return self.registry.is_noops_model(architectures)

Uh oh!

[Model] Update support for NemotronNAS models #15008

[Model] Update support for NemotronNAS models #15008

Uh oh!

Conversation

Naveassaf commented Mar 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 18, 2025

Uh oh!

DarkLight1337 commented Mar 18, 2025

Uh oh!

Naveassaf commented Mar 19, 2025

Uh oh!

DarkLight1337 commented Mar 19, 2025

Uh oh!

simon-mo commented Mar 20, 2025

Uh oh!

Naveassaf commented Mar 21, 2025

Uh oh!

simon-mo commented Mar 24, 2025

Uh oh!

Naveassaf commented Mar 24, 2025

Uh oh!

WoosukKwon commented Mar 25, 2025

Uh oh!

simon-mo commented Mar 25, 2025

Uh oh!

Ithanil commented Mar 30, 2025

Uh oh!

DarkLight1337 Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Naveassaf Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

lokeshwer Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Naveassaf Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Ithanil Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokeshwer Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ithanil Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

lokeshwer Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonbiggs Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Naveassaf commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Naveassaf commented Mar 18, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Mar 31, 2025 •

edited

Loading

Ithanil Mar 31, 2025 •

edited

Loading

lokeshwer Apr 1, 2025 •

edited

Loading

lokeshwer Apr 1, 2025 •

edited

Loading

DarkLight1337 commented Mar 31, 2025 •

edited

Loading