Skip to content

[OpenVINO backend] supporting inference for gemma and mistral with ov backend #2310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Mohamed-Ashraf273
Copy link

@Mohamed-Ashraf273 Mohamed-Ashraf273 commented Jun 22, 2025

Description of the change

As a part of my GSoC25 project to support inference with the openvino backend for Gemma and Mistral,
This is my PR for supporting the Gemma and Mistral pipelines.

Reference

https://docs.openvino.ai/2025/index.html
https://keras.io/api/
https://keras.io/keras_hub/

Colab Notebook

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@github-actions github-actions bot added the Gemma Gemma model specific issues label Jun 22, 2025
@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the supporting_gemma_inference_with_ov_backend branch 26 times, most recently from 6576b03 to 074f0c2 Compare June 23, 2025 17:26
@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the supporting_gemma_inference_with_ov_backend branch 2 times, most recently from d748dd5 to f5470cd Compare June 24, 2025 13:36
@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the supporting_gemma_inference_with_ov_backend branch 4 times, most recently from 2053a10 to b4f8905 Compare June 24, 2025 14:44
@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the supporting_gemma_inference_with_ov_backend branch from b4f8905 to 2044408 Compare June 24, 2025 19:40
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments! But probably first question is around the changes to causal_lm and gemma_causal_lm. Why is this so backend specific? This is much more involved than changes for jax/torch/tensorflow

@@ -211,6 +237,10 @@ def test_all_presets(self):
input_data=self.input_data,
)

@pytest.mark.skipif(
keras.config.backend() == "openvino",
reason="OpenVINO is for inference only",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score is inference only, is there a reason this needs to be disabled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs roll operation, which is not supported, but I thought if I can make generate work for now, I'll enable this test.

@@ -158,7 +158,8 @@ def convert_preprocessing_inputs(x):
# If we have a string input, use tf.tensor.
if isinstance(x, np.ndarray) and x.dtype.type is np.str_:
return tf.convert_to_tensor(x)
x = ops.convert_to_tensor(x)
if keras.config.backend() != "openvino":
x = ops.convert_to_tensor(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does convert_to_tensor do on the openvino backend, should it just convert to numpy?

Copy link
Author

@Mohamed-Ashraf273 Mohamed-Ashraf273 Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the generate wrapper for OpenVINO, the output is already NumPy. To avoid unnecessary reconversion in post_process, I disabled that step for the OpenVINO backend.

@@ -89,6 +97,9 @@ def test_generate(self):
causal_lm.preprocessor = None
outputs = causal_lm.generate(prompt_ids, stop_token_ids=None)
# Assert prompt is in output in token id space.
if keras.config.backend() == "openvino":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would extending this for our asserts mean we could ditch some of these switch cases?

def convert_to_comparible_type(x):
"""Convert tensors to comparable types.
Any string are converted to plain python types. Any jax or torch tensors
are converted to numpy.
"""
if getattr(x, "dtype", None) == tf.string:
if isinstance(x, tf.RaggedTensor):
x = x.to_list()
if isinstance(x, tf.Tensor):
x = x.numpy() if x.shape.rank == 0 else x.numpy().tolist()
return tree.map_structure(lambda x: x.decode("utf-8"), x)
if isinstance(x, (tf.Tensor, tf.RaggedTensor)):
return x
if hasattr(x, "__array__"):
return ops.convert_to_numpy(x)
return x

if keras.config.backend() == "openvino":
"""Set all logits to a large negative number
to avoid NaNs produced by ov.einsum"""
logits = ops.ones_like(logits) * ops.convert_to_tensor(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems suspicious, should we be fixing the code not the test? maybe we need a openvino fix before calling einsum in some places? what is failing?

Copy link
Author

@Mohamed-Ashraf273 Mohamed-Ashraf273 Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, when I cancelled the splitting approach, it worked!

@@ -60,6 +64,10 @@ def test_causal_lm_basics(self):
expected_output_shape=(2, 8, 11),
)

@pytest.mark.skipif(
keras.config.backend() == "openvino",
reason="OpenVINO is for inference only",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also inference only, no gradient descent. should this be enabled?

cache=current_cache,
cache_update_index=cache_update_index,

use_openvino = keras.config.backend() == "openvino"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems quite hard to maintain and like something we should probably avoid. why do we need a pass that is so different than other backends? shouldn't the point of keras' multi-backend setup allow to just call layers normally?

import numpy as np
import openvino as ov
import openvino.runtime.opset14 as ov_opset
from keras.src.backend.openvino.core import OPENVINO_DTYPES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd like to avoid private imports from keras if we can, they are hard to maintain if keras changes it's code structure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I need them to make an OpenVINO model graph.

@@ -132,6 +137,113 @@ def make_generate_function(self):
return self.generate_function

self.generate_function = self.generate_step
if keras.config.backend() == "openvino":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to think if there's a way we can keep this resembling other backends a little more, this is fairly heavyweight.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve done my best to minimize backend-specific bias in the code.
I'd appreciate it if you could take another look.

@Mohamed-Ashraf273
Copy link
Author

Mohamed-Ashraf273 commented Jun 24, 2025

Left some initial comments! But probably first question is around the changes to causal_lm and gemma_causal_lm. Why is this so backend specific? This is much more involved than changes for jax/torch/tensorflow

hi @mattdangerw ,
I’ve been working on enabling inference for Gemma with OpenVINO by implementing the missing operations. The main challenge I ran into is that building the entire graph as a single model makes execution difficult (takes too long + RAM may overflow) , I'm still investigating why.
To address this, I introduced a subgraph-based approach by splitting the full graph at key points. I also added logic to store compiled subgraphs in CausalLM so they can be reused across generations instead of rebuilding them each time.

@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the supporting_gemma_inference_with_ov_backend branch from 164e2fd to 8f58d0b Compare June 24, 2025 23:33
@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the supporting_gemma_inference_with_ov_backend branch from 8f58d0b to 2e7bece Compare June 25, 2025 00:59
@Mohamed-Ashraf273 Mohamed-Ashraf273 changed the title [OpenVINO backend] supporting inference for gemma with ov backend [OpenVINO backend] supporting inference for gemma and mistral with ov backend Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Gemma Gemma model specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants