Skip to content

Conversation

@Naveassaf
Copy link
Contributor

@Naveassaf Naveassaf commented Mar 18, 2025

Add support for latest (as of March 18, 2025) nemotron-nas type models. These models are currently of architecture DeciLMForCausalLM.

Practically:

  • Define has_noops models that may have decoder blocks with NOOP attention or MLPs.
  • Move modeling definition of DeciLMForCausalLM to be under nemotron_nas (the newer model type) instead of deci
  • Deprecate support for older DeciLMForCausalLM models. These were throughput-oriented models with variable GQA. THe implementation over-allocated KV caches and thus made them less interesting anyways.
  • Add the latest nemotron-nas model to model regression.

FIX #15068
FIX #15779

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 18, 2025
@Naveassaf Naveassaf force-pushed the feat/support_nemotron_nas branch from 2d17480 to cd78380 Compare March 18, 2025 08:29
@DarkLight1337 DarkLight1337 requested a review from simon-mo March 18, 2025 09:32
@DarkLight1337
Copy link
Member

cc @simon-mo regarding deprecating older DeciLMForCausalLM models

@Naveassaf
Copy link
Contributor Author

@DarkLight1337 - thanks for immediately addressing our PR! A quick update and question, now that the model is HF public.

We intend to release a similar 253B "Nemotron Ultra" model. It will require slight changes during which we could return the support for older DeciLMForCausalLM architectures. Does that help make the decision regarding deprecation? Is it a serious blocker?

@DarkLight1337
Copy link
Member

Personally I think we should avoid deprecating older models unless they become incompatible with the current transformers version. But I'll leave it to @simon-mo to decide

@simon-mo
Copy link
Collaborator

Deprecate support for older DeciLMForCausalLM models.

Is your team the original author of DeciLM? If the model team is telling us to deprecate, we are happy to follow the direction that way.

We intend to release a similar 253B "Nemotron Ultra" model. It will require slight changes during which we could return the support for older DeciLMForCausalLM architectures.

Does this mean we want both DeciLM and the NemotronNAS architecture? How do you envision the model files will be layout in the repo?

@Naveassaf
Copy link
Contributor Author

Yes, I am part of the DeciLM team (now part of Nvidia). We are content with deprecating the older "decilm, DeciLMForCausalLM" models in favor of the "nemotron-nas, DeciLMForCausalLM" for now.

We hope to release future models and versions as "nemotron-nas, NemotronNASForCausalLM" models, which will make supporting all architectures and model types listed above much cleaner under separate decilm.py and nemotron_nas.py modeling files. Currently supporting both is possible, though not very esthetic.

If you prefer, we can add a commit which will prevent deprecation, but be a bit hacky (will choose what DeciLMForCausalLM to create based on the provided config's fields).

And if you don't have a solid opinion, we'd prefer to go with the deprecation at this point and address older models in the future if necessary (while possibly enabling proper variable GQA kv caching s.t. older models reach their full throughput potential).

Copy link
Collaborator

The simpler option works for me. We can note in the release note that older DeciLM is no longer supported and please use older version of vLLM for it.

@Naveassaf
Copy link
Contributor Author

Great, thanks.

Then I have no more planned changes here (other than fixes to address future CR feedback).

@WoosukKwon
Copy link
Collaborator

@simon-mo @DarkLight1337 What will be the next step here?

@simon-mo
Copy link
Collaborator

@WoosukKwon are you okay with the changes related noop?

@Ithanil
Copy link
Contributor

Ithanil commented Mar 30, 2025

Works like a charm, thank you very much!

vllm/config.py Outdated
Copy link
Member

@DarkLight1337 DarkLight1337 Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ModelRegistry.is_noops_model(architectures)
return self.registry.is_noops_model(architectures)

Minor nit. Also this is to ping @WoosukKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good point. Fixed and rebased.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Naveassaf . I can confirm that I can run this model on A100 (2) Outside the scope of this PR, did you try any reasoning parser for this model? DeepSeek R1 parser failed for obvious start end token mismatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lokeshwer - glad to here this worked for you. I have not tried running with the R1 parser. If there is lacking functionality there, feel free to create an issue an tag me there so that the exact ask is refined and we can allocate engineering time to it.

If you'd like to get a sense of the effect of reasoning with this model, you can try the NIM preview endpoint and its "enable reasoning" toggle here: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1

Copy link
Contributor

@Ithanil Ithanil Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the reasoning works fine per se without specifying --enable-reasoning --reasoning-parser deepseek_r1 at all. So far the only difference in that regard compared to running an DS-R1 model (with --enable-reasoning --reasoning-parser deepseek_r1) seems to be that the reasoning tokens will then count towards the max_token limit specified with the request.

In case anyone reads this who needs guidance:
On 2xA100 and vllm 0.8.2 + this PR, I run my FP8-Dynamic quant (https://huggingface.co/Ithanil/Llama-3_3-Nemotron-Super-49B-v1-FP8-Dynamic) with --tensor-parallel-size 2 --pipeline-parallel-size 1 --gpu-memory-utilization 0.95 --trust-remote-code --max-model-len 131072 --enable-prefix-caching --enable-chunked-prefill --distributed-executor-backend ray and use temp 0.6 and top_p 0.95 with reasoning enabled and temp 0 with reasoning disabled (as per Nvidias suggestion). You will get about 45 Token/s for single request.

Copy link

@lokeshwer lokeshwer Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With system message turning on reasoning, we get llm response but the response should be parsed. For text <think>abc</think>xyz: - 'abc' goes to reasoning_content - 'xyz' goes to content Reference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but depending on the frontend that doesn't matter, except for that the reasoning tokens count towards the regular response tokens.

Copy link

@lokeshwer lokeshwer Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But switching between DeepSeek and NemotronNAS, we have to keep changing frontend. I think OpenAI compatible response format simplifies that. On related note, the parser assumes start and end to be a unique token, whereas NemotronNAS despite using format, the underlying tokenizer doesn't map them to unique prompt token but multiple tokens. Outright the parser won't work even if the response text look alike.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments so far make sense. I was curious if anyone added a reasoning parser so we don't have to sort through the <think> tokens on the frontend?

@Naveassaf Naveassaf force-pushed the feat/support_nemotron_nas branch from cd78380 to 734dab0 Compare March 31, 2025 09:03
@Naveassaf Naveassaf requested a review from DarkLight1337 March 31, 2025 09:04
@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 31, 2025

Since @WoosukKwon is busy I'll just merge this for now. We can make further changes in another PR if necessary. Sorry for the delay!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 31, 2025 09:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
auto-merge was automatically disabled March 31, 2025 10:02

Head branch was pushed to by a user without write access

@Naveassaf Naveassaf force-pushed the feat/support_nemotron_nas branch from 734dab0 to f088e48 Compare March 31, 2025 10:02
@Naveassaf
Copy link
Contributor Author

@DarkLight1337 - had to reformat a commit for DCO. Zero code changes. Could you click the merge button please? Thanks!

@DarkLight1337 DarkLight1337 merged commit 3aa2b6a into vllm-project:main Mar 31, 2025
35 checks passed
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
@janekl janekl mentioned this pull request Apr 30, 2025
8 tasks
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

7 participants