-
Notifications
You must be signed in to change notification settings - Fork 57
CVS-175736-[OVEP] Enable stateful mode for Phi-silica models #821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CVS-175736-[OVEP] Enable stateful mode for Phi-silica models #821
Conversation
| if (gpu_or_npu) { | ||
| prefill_use_full_chat_history = true; | ||
| } | ||
| // bool gpu_or_npu = ((device.find("NPU") != std::string::npos) || (device.find("GPU") != std::string::npos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to discuss with ORT-GenAI team how to handle this logic
Co-author: Beheshti, Nazanin
0fe0302 to
1e132f3
Compare
| // check if there is input_ids tensors and if the tensor type is int64, | ||
| // because logic prefill_use_full_chat_history is only for specific inputs and data type | ||
| auto input_ids_opt = FindTensor("input_ids"); | ||
| if (gpu_or_npu && input_ids_opt.has_value() && input_ids_opt->get_element_type() != ov::element::i64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment contradicts the code. The comment says the tensor type is int64, while the condition checks if it's not. Please check if there's a bug
| } | ||
|
|
||
| if (ModelHasInputOutputNames(ov_model, "/model/embed_tokens/Gather_output_0")) { | ||
| main_input_name = "/model/embed_tokens/Gather_output_0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of hardcoding specific input names, which can change over time and break things. Could you please see whether there's a possibility to avoid that?
| main_input_name = "input_ids"; | ||
| } | ||
|
|
||
| if (ModelHasInputOutputNames(ov_model, "input_hidden_states")) { | ||
| main_input_name = "input_hidden_states"; | ||
| } | ||
|
|
||
| if (ModelHasInputOutputNames(ov_model, "/model/embed_tokens/Gather_output_0")) { | ||
| main_input_name = "/model/embed_tokens/Gather_output_0"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a lot of code duplication here. Please make a helper function that takes an array of strings as input and checks whether they're present in the model.
| key_value_input_names.push_back(name); | ||
| const auto& params = model->get_parameters(); | ||
| bool found = false; | ||
| for (auto i = 0; i < params.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder: watch for signed/unsigned mismatches. auto i = 0 deduces an int, which doesn't match the container .size() type
1e132f3 to
e50603d
Compare
|
Please attach a JIRA for this feature request in the PR description. |
e50603d to
25c6976
Compare
done, let me know if you have further question |
|
Hi, @ankitm3k any feedback on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables stateful mode for Phi-silica models by enhancing the OpenVINO provider's ability to recognize different LLM model types and properly handle their key-value cache inputs. This optimization reduces memory usage from 16GB to 3.7GB and improves performance from 1fps to 16fps for Phi-Silica workloads on OVEP GPU backend.
- Enhanced input name detection with prioritized candidate names for different model types
- Improved key-value cache input recognition to handle "keys" and "values" patterns beyond just "key_values"
- Added conditional logic for prefill optimization based on input tensor characteristics
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ov_stateful_patch_utils.cc | Added flexible input name detection and expanded key-value cache pattern matching |
| ov_interface.cc | Enhanced stateful request initialization with input type validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const auto& params = model->get_parameters(); | ||
| bool found = false; | ||
| for (size_t i = 0; i < params.size(); i++) { | ||
| auto param_name = params.at(i)->output(0).get_any_name(); |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using params.at(i) is less efficient than direct indexing with params[i]. Consider using range-based for loop or direct indexing for better performance.
| auto param_name = params.at(i)->output(0).get_any_name(); | |
| auto param_name = params[i]->output(0).get_any_name(); |
| if (!found) { | ||
| not_kv_inputs.push_back(input.get_any_name()); | ||
| not_kv_inputs.push_back(param_name); | ||
| } |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'found' variable is never reset to false between iterations, causing incorrect classification of subsequent parameters. Reset 'found = false' at the beginning of each loop iteration.
|
@Kotomi-Du -- Can you resolve the existing feedback on this PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else if (name.find("keys") != std::string::npos) { | ||
| key_value_input_names.push_back(name); | ||
| found = true; | ||
| break; | ||
| } else if (name.find("values") != std::string::npos) { | ||
| key_value_input_names.push_back(name); | ||
| found = true; | ||
| break; |
Copilot
AI
Oct 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for detecting 'keys' and 'values' patterns treats them separately with identical code blocks. This could lead to only finding one type of cache input when both should be collected. Consider restructuring to collect all matching key-value inputs rather than breaking after the first match.
@RyanMetcalfeInt8 addressed, also requested new copilot review which is not helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Recognize other LLM models specifically for phi-silica models to trigger the path of making stateful model.
Motivation and Context
Combined with #830 and #831 , these changes improved the memory footprint and performance for Phi-Silica workload. Without these changes, it consumed 16GB memory and got 1 fps when running the workload on OVEP GPU backend. After the change, the memory usage reduced to 3.7GB and the performance achieves 16fps.
If feature goes to new ABI?
Yes
Jira Ticket :
https://jira.devtools.intel.com/browse/CVS-175736