-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Model] Update support for NemotronNAS models #15008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Update support for NemotronNAS models #15008
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
2d17480 to
cd78380
Compare
|
cc @simon-mo regarding deprecating older |
|
@DarkLight1337 - thanks for immediately addressing our PR! A quick update and question, now that the model is HF public. We intend to release a similar 253B "Nemotron Ultra" model. It will require slight changes during which we could return the support for older |
|
Personally I think we should avoid deprecating older models unless they become incompatible with the current transformers version. But I'll leave it to @simon-mo to decide |
Is your team the original author of DeciLM? If the model team is telling us to deprecate, we are happy to follow the direction that way.
Does this mean we want both DeciLM and the NemotronNAS architecture? How do you envision the model files will be layout in the repo? |
|
Yes, I am part of the DeciLM team (now part of Nvidia). We are content with deprecating the older "decilm, DeciLMForCausalLM" models in favor of the "nemotron-nas, DeciLMForCausalLM" for now. We hope to release future models and versions as "nemotron-nas, NemotronNASForCausalLM" models, which will make supporting all architectures and model types listed above much cleaner under separate decilm.py and nemotron_nas.py modeling files. Currently supporting both is possible, though not very esthetic. If you prefer, we can add a commit which will prevent deprecation, but be a bit hacky (will choose what DeciLMForCausalLM to create based on the provided config's fields). And if you don't have a solid opinion, we'd prefer to go with the deprecation at this point and address older models in the future if necessary (while possibly enabling proper variable GQA kv caching s.t. older models reach their full throughput potential). |
|
The simpler option works for me. We can note in the release note that older DeciLM is no longer supported and please use older version of vLLM for it. |
|
Great, thanks. Then I have no more planned changes here (other than fixes to address future CR feedback). |
|
@simon-mo @DarkLight1337 What will be the next step here? |
|
@WoosukKwon are you okay with the changes related |
|
Works like a charm, thank you very much! |
vllm/config.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return ModelRegistry.is_noops_model(architectures) | |
| return self.registry.is_noops_model(architectures) |
Minor nit. Also this is to ping @WoosukKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good point. Fixed and rebased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Naveassaf . I can confirm that I can run this model on A100 (2) Outside the scope of this PR, did you try any reasoning parser for this model? DeepSeek R1 parser failed for obvious start end token mismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lokeshwer - glad to here this worked for you. I have not tried running with the R1 parser. If there is lacking functionality there, feel free to create an issue an tag me there so that the exact ask is refined and we can allocate engineering time to it.
If you'd like to get a sense of the effect of reasoning with this model, you can try the NIM preview endpoint and its "enable reasoning" toggle here: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the reasoning works fine per se without specifying --enable-reasoning --reasoning-parser deepseek_r1 at all. So far the only difference in that regard compared to running an DS-R1 model (with --enable-reasoning --reasoning-parser deepseek_r1) seems to be that the reasoning tokens will then count towards the max_token limit specified with the request.
In case anyone reads this who needs guidance:
On 2xA100 and vllm 0.8.2 + this PR, I run my FP8-Dynamic quant (https://huggingface.co/Ithanil/Llama-3_3-Nemotron-Super-49B-v1-FP8-Dynamic) with --tensor-parallel-size 2 --pipeline-parallel-size 1 --gpu-memory-utilization 0.95 --trust-remote-code --max-model-len 131072 --enable-prefix-caching --enable-chunked-prefill --distributed-executor-backend ray and use temp 0.6 and top_p 0.95 with reasoning enabled and temp 0 with reasoning disabled (as per Nvidias suggestion). You will get about 45 Token/s for single request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With system message turning on reasoning, we get llm response but the response should be parsed. For text <think>abc</think>xyz: - 'abc' goes to reasoning_content - 'xyz' goes to content Reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but depending on the frontend that doesn't matter, except for that the reasoning tokens count towards the regular response tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But switching between DeepSeek and NemotronNAS, we have to keep changing frontend. I think OpenAI compatible response format simplifies that. On related note, the parser assumes start and end to be a unique token, whereas NemotronNAS despite using format, the underlying tokenizer doesn't map them to unique prompt token but multiple tokens. Outright the parser won't work even if the response text look alike.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments so far make sense. I was curious if anyone added a reasoning parser so we don't have to sort through the <think> tokens on the frontend?
cd78380 to
734dab0
Compare
|
Since @WoosukKwon is busy I'll just merge this for now. We can make further changes in another PR if necessary. Sorry for the delay! |
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Head branch was pushed to by a user without write access
734dab0 to
f088e48
Compare
|
@DarkLight1337 - had to reformat a commit for DCO. Zero code changes. Could you click the merge button please? Thanks! |
Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Nave Assaf <nassaf@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Add support for latest (as of March 18, 2025)
nemotron-nastype models. These models are currently of architectureDeciLMForCausalLM.Practically:
has_noopsmodels that may have decoder blocks with NOOP attention or MLPs.DeciLMForCausalLMto be undernemotron_nas(the newer model type) instead ofdeciDeciLMForCausalLMmodels. These were throughput-oriented models with variable GQA. THe implementation over-allocated KV caches and thus made them less interesting anyways.nemotron-nasmodel to model regression.FIX #15068
FIX #15779