[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

barolo · 2024-02-12T15:00:05Z

OpenVINO Version

tag 2023.3.0

Operating System

Other (Please specify in description)

Device used for inference

GPU

Framework

ONNX

Model used

distil-whisper/distil-small.en

Issue description

When trying to extract word/token timestamps from audio OpenVino fails. Adding either return_timestamps="word"/"True", results in failure, without it transcription finishes successfully.

Step-by-step reproduction

I'm using the following code

import torch

from transformers import  pipeline, WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov


model_id = "distil-whisper/distil-small.en"

pt_model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
pt_model.eval();


def extract_input_features(sample_long):
    input_features = processor(
        sample_long["audio"]["array"],
        sampling_rate=sample_long["audio"]["sampling_rate"],
        return_tensors="pt",
    ).input_features
    return input_features

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")

sample_long = dataset[0]

input_features = extract_input_features(sample_long)
predicted_ids = pt_model.generate(input_features)

model_path = Path(model_id.replace('/', '_'))
ov_config = {"CACHE_DIR": ""}


if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )


core = ov.Core()

device = "GPU"

ov_model.to(device)
ov_model.compile()

ov_model.generation_config = pt_model.generation_config

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    return_timestamps="word",
    batch_size=12,
)



result = pipe("./4.wav")

import json
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)
print(result["text"])

Relevant log output

File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./drain_w.py", line 70, in <module>
    result = pipe("./4.wav")
             ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

Issue submission checklist

I'm reporting an issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

github-actions · 2024-11-29T00:24:42Z

This issue will be closed in a week because of 9 months of no activity.

github-actions · 2024-12-06T00:25:17Z

This issue was closed because it has been stalled for 9 months with no activity.

barolo added bug Something isn't working support_request labels Feb 12, 2024

ilya-lavrenov assigned eaidova Feb 12, 2024

avitial added the category: GPU OpenVINO GPU plugin label Feb 28, 2024

helena-intel mentioned this issue Mar 22, 2024

OVModelForSpeechSeq2Seq fails with return_timestamps="word". huggingface/optimum-intel#561

Open

github-actions bot added the Stale label Nov 29, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

barolo commented Feb 12, 2024

github-actions bot commented Nov 29, 2024

github-actions bot commented Dec 6, 2024

[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

Comments

barolo commented Feb 12, 2024

OpenVINO Version

Operating System

Device used for inference

Framework

Model used

Issue description

Step-by-step reproduction

Relevant log output

Issue submission checklist

github-actions bot commented Nov 29, 2024

github-actions bot commented Dec 6, 2024