Skip to content

Conversation

@Kotomi-Du
Copy link

@Kotomi-Du Kotomi-Du commented Oct 3, 2025

Description

Recognize other LLM models specifically for phi-silica models to trigger the path of making stateful model.

Motivation and Context

Combined with #830 and #831 , these changes improved the memory footprint and performance for Phi-Silica workload. Without these changes, it consumed 16GB memory and got 1 fps when running the workload on OVEP GPU backend. After the change, the memory usage reduced to 3.7GB and the performance achieves 16fps.

If feature goes to new ABI?

Yes

Jira Ticket :

https://jira.devtools.intel.com/browse/CVS-175736

@Kotomi-Du Kotomi-Du changed the base branch from master to ovep-develop October 3, 2025 00:24
@Kotomi-Du Kotomi-Du marked this pull request as draft October 3, 2025 00:26
if (gpu_or_npu) {
prefill_use_full_chat_history = true;
}
// bool gpu_or_npu = ((device.find("NPU") != std::string::npos) || (device.find("GPU") != std::string::npos));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to discuss with ORT-GenAI team how to handle this logic

@Kotomi-Du Kotomi-Du force-pushed the make_stateful_phisilica branch from 0fe0302 to 1e132f3 Compare October 11, 2025 00:29
@Kotomi-Du Kotomi-Du marked this pull request as ready for review October 11, 2025 00:30
Comment on lines 365 to 368
// check if there is input_ids tensors and if the tensor type is int64,
// because logic prefill_use_full_chat_history is only for specific inputs and data type
auto input_ids_opt = FindTensor("input_ids");
if (gpu_or_npu && input_ids_opt.has_value() && input_ids_opt->get_element_type() != ov::element::i64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment contradicts the code. The comment says the tensor type is int64, while the condition checks if it's not. Please check if there's a bug

}

if (ModelHasInputOutputNames(ov_model, "/model/embed_tokens/Gather_output_0")) {
main_input_name = "/model/embed_tokens/Gather_output_0";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of hardcoding specific input names, which can change over time and break things. Could you please see whether there's a possibility to avoid that?

Comment on lines 72 to 81
main_input_name = "input_ids";
}

if (ModelHasInputOutputNames(ov_model, "input_hidden_states")) {
main_input_name = "input_hidden_states";
}

if (ModelHasInputOutputNames(ov_model, "/model/embed_tokens/Gather_output_0")) {
main_input_name = "/model/embed_tokens/Gather_output_0";
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of code duplication here. Please make a helper function that takes an array of strings as input and checks whether they're present in the model.

key_value_input_names.push_back(name);
const auto& params = model->get_parameters();
bool found = false;
for (auto i = 0; i < params.size(); i++) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: watch for signed/unsigned mismatches. auto i = 0 deduces an int, which doesn't match the container .size() type

@ankitm3k ankitm3k force-pushed the make_stateful_phisilica branch from 1e132f3 to e50603d Compare October 13, 2025 11:29
@ankitm3k
Copy link

ankitm3k commented Oct 13, 2025

Please attach a JIRA for this feature request in the PR description.

@Kotomi-Du Kotomi-Du force-pushed the make_stateful_phisilica branch from e50603d to 25c6976 Compare October 17, 2025 19:00
@Kotomi-Du
Copy link
Author

Please attach a JIRA for this feature request in the PR description.

done, let me know if you have further question

@MayureshV1 MayureshV1 changed the title [OVEP] Enable stateful mode for Phi-silica models CVS-175736-[OVEP] Enable stateful mode for Phi-silica models Oct 27, 2025
@Kotomi-Du
Copy link
Author

Hi, @ankitm3k any feedback on this PR?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables stateful mode for Phi-silica models by enhancing the OpenVINO provider's ability to recognize different LLM model types and properly handle their key-value cache inputs. This optimization reduces memory usage from 16GB to 3.7GB and improves performance from 1fps to 16fps for Phi-Silica workloads on OVEP GPU backend.

  • Enhanced input name detection with prioritized candidate names for different model types
  • Improved key-value cache input recognition to handle "keys" and "values" patterns beyond just "key_values"
  • Added conditional logic for prefill optimization based on input tensor characteristics

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ov_stateful_patch_utils.cc Added flexible input name detection and expanded key-value cache pattern matching
ov_interface.cc Enhanced stateful request initialization with input type validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const auto& params = model->get_parameters();
bool found = false;
for (size_t i = 0; i < params.size(); i++) {
auto param_name = params.at(i)->output(0).get_any_name();
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using params.at(i) is less efficient than direct indexing with params[i]. Consider using range-based for loop or direct indexing for better performance.

Suggested change
auto param_name = params.at(i)->output(0).get_any_name();
auto param_name = params[i]->output(0).get_any_name();

Copilot uses AI. Check for mistakes.
Comment on lines 154 to 156
if (!found) {
not_kv_inputs.push_back(input.get_any_name());
not_kv_inputs.push_back(param_name);
}
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'found' variable is never reset to false between iterations, causing incorrect classification of subsequent parameters. Reset 'found = false' at the beginning of each loop iteration.

Copilot uses AI. Check for mistakes.
@MayureshV1 MayureshV1 removed the request for review from preetha-intel October 28, 2025 20:25
@RyanMetcalfeInt8
Copy link

@Kotomi-Du -- Can you resolve the existing feedback on this PR? Thanks!

@Kotomi-Du Kotomi-Du requested a review from Copilot October 30, 2025 22:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +149 to +156
} else if (name.find("keys") != std::string::npos) {
key_value_input_names.push_back(name);
found = true;
break;
} else if (name.find("values") != std::string::npos) {
key_value_input_names.push_back(name);
found = true;
break;
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for detecting 'keys' and 'values' patterns treats them separately with identical code blocks. This could lead to only finding one type of cache input when both should be collected. Consider restructuring to collect all matching key-value inputs rather than breaking after the first match.

Copilot uses AI. Check for mistakes.
@Kotomi-Du
Copy link
Author

@Kotomi-Du -- Can you resolve the existing feedback on this PR? Thanks!

@RyanMetcalfeInt8 addressed, also requested new copilot review which is not helpful.

Copy link

@RyanMetcalfeInt8 RyanMetcalfeInt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Kotomi-Du Kotomi-Du merged commit b9a73f3 into intel:ovep-develop Nov 3, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants