Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

Closed
3 tasks done
barolo opened this issue Feb 12, 2024 · 2 comments
Closed
3 tasks done

[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps #22794

barolo opened this issue Feb 12, 2024 · 2 comments
Assignees
Labels
bug Something isn't working category: GPU OpenVINO GPU plugin Stale support_request

Comments

@barolo
Copy link

barolo commented Feb 12, 2024

OpenVINO Version

tag 2023.3.0

Operating System

Other (Please specify in description)

Device used for inference

GPU

Framework

ONNX

Model used

distil-whisper/distil-small.en

Issue description

When trying to extract word/token timestamps from audio OpenVino fails. Adding either return_timestamps="word"/"True", results in failure, without it transcription finishes successfully.

Step-by-step reproduction

I'm using the following code

import torch

from transformers import  pipeline, WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov


model_id = "distil-whisper/distil-small.en"

pt_model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
pt_model.eval();


def extract_input_features(sample_long):
    input_features = processor(
        sample_long["audio"]["array"],
        sampling_rate=sample_long["audio"]["sampling_rate"],
        return_tensors="pt",
    ).input_features
    return input_features

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")

sample_long = dataset[0]

input_features = extract_input_features(sample_long)
predicted_ids = pt_model.generate(input_features)

model_path = Path(model_id.replace('/', '_'))
ov_config = {"CACHE_DIR": ""}


if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )


core = ov.Core()

device = "GPU"

ov_model.to(device)
ov_model.compile()

ov_model.generation_config = pt_model.generation_config

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    return_timestamps="word",
    batch_size=12,
)



result = pipe("./4.wav")

import json
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)
print(result["text"])

Relevant log output

File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./drain_w.py", line 70, in <module>
    result = pipe("./4.wav")
             ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

Issue submission checklist

  • I'm reporting an issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.
Copy link
Contributor

This issue will be closed in a week because of 9 months of no activity.

@github-actions github-actions bot added the Stale label Nov 29, 2024
Copy link
Contributor

github-actions bot commented Dec 6, 2024

This issue was closed because it has been stalled for 9 months with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working category: GPU OpenVINO GPU plugin Stale support_request
Projects
None yet
Development

No branches or pull requests

3 participants