-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Support reasoning content for deepseek r1 #12473
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
54b910e
to
946179e
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure this is working correctly, can you add some tests to the codebase? Would also be great to add a section to the docs explaining its usage!
cc @mgoin @K-Mistele
vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py
Outdated
Show resolved
Hide resolved
Yeah, I think we can just use |
Will work on this. |
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
How this works together with Guided/Structured Outputs? |
@cksac Hi, please checkout #12468 (comment) I don’t think they can work together. Also, the DeepSeek API doesn’t do function calls or structured output for DeepSeek R1, just FYI. If you enable the feature with structured output, the output will be only the structured output. |
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
expected_reasoning: str, | ||
expected_content: str, | ||
): | ||
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain if we can accept this. I used a tokenizer to mimic the streaming output process, which might require some memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DeepSeek tokenizer can be instantiated locally from the zip file they provide here: https://api-docs.deepseek.com/quick_start/token_usage
Though maybe this is worth upstreaming to Transformers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use this:
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1")
output = tokenizer.tokenize(param_dict["output"])
I used opt-125m to conserve memory during testing because I was unsure about the resources available on the test machine. If we can tolerate around 500MB of memory usage, we can use the R1 tokenizer directly.
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
@DarkLight1337 I added some unit tests, did not find how to add e2e tests, do we need add e2e cases for this feature? could you please guide me there? |
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Thanks for your time.
Yes, OpenAI hides the reasoning content.
I hadn’t realized they support reasoning. Thanks for bringing that to my attention! I will take a look too. |
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
examples/online_serving/openai_chat_completion_with_reasoning.py
Outdated
Show resolved
Hide resolved
examples/online_serving/openai_chat_completion_with_reasoning_streaming.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Michael Goin <mgoin@redhat.com>
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Comments are addressed, please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work incorporating feedback and producing nicely documented code, LGTM! I'd like @DarkLight1337 or @K-Mistele to sign off as well before landing
parser.add_argument( | ||
"--reasoning-parser", | ||
type=str, | ||
metavar="{" + ",".join(valid_reasoning_parsers) + "}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why couldn't this just be choices=valid_reasoning_parsers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here’s the trick adapted from the tool parser argument.
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/cli_args.py#L216
I think it is to keep the consistency with other CLI arguments:
$ vllm serve --help
...
--scheduling-policy {fcfs,priority}
...
--task {auto,generate,embedding,embed,classify,score,reward}
...
--tokenizer-mode {auto,slow,mistral}
The Vercel AI SDK has a design akin to this PR. It employs a regex to parse the entire output for synchronous generation requests, and we also use this method. const regexp = new RegExp(`${openingTag}(.*?)${closingTag}`, 'gs');
const matches = Array.from(text.matchAll(regexp));
if (!matches.length) {
return { text, ...rest };
}
const reasoning = matches.map(match => match[1]).join(separator); For streaming requests, we also have a similar design that checks the start and end tokens in both the previous and delta text. |
Thanks for your review 🥰 |
Hey guys just now getting to this, glad to see it's been merged. For what it's worth, there have already been some discussions (ref. #11522 ) about refactoring tool parsers (possibly to use FSMs, which seems like it would be a good approach for managing tool reasoning parsing) instead of the current implementation which can tend to be buggy. It seems like the direction we will want to move towards is having a single parser module for any given (supported) model that provides a unified interface for both tool parsing and reasoning parsing if one or both are supported by the model. This would be more straightforward from a UX and a DX standpoint, and should result in cleaner code as well. |
I think it's reasonable after a glance at the issue. Reasoning can be understood as a special tool, and they can share the implementation. During my implementation process, I did find the tool parser code to be quite messy. |
I was wondering if json schema only for answer ( and not supressing the thinking tokens) is added in this PR? |
It is not supported. ref #12468 (comment) |
Fix #12468
Two CLI arguments are introduced for extensibility:
--enable-reasoning
and--reasoning-parser
.Test for full request:
Test for stream request:
These scripts are from https://api-docs.deepseek.com/guides/reasoning_model#api-example. However I encountered an issue with stream requests:
Have to hack the openai python package to pass it. The RESTful API works well.