Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster-whisper (ctranslate2) as option for Whisper annotation workflow #1017

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

entn-at
Copy link
Contributor

@entn-at entn-at commented Apr 6, 2023

This PR adds a second Whisper annotation workflow that uses faster-whisper powered by CTranslate2's implementation (see https://github.com/entn-at/lhotse/tree/feature/whisper-ctranslate2). It's a lot faster and uses far less memory.

This implementation also obtains word start and end times. I'm still investigating whether they are accurate enough in general to be used as alignments.

@desh2608
Copy link
Collaborator

desh2608 commented Apr 6, 2023

Would it be possible to combine whisper and faster-whisper into a single CLI/method and then add faster-whisper as an optional flag enabled/disabled by default? The internals of the function call can be kept separate, but from a user perspective, it makes more sense since they have the same functionality. I'm thinking of it as 2 backends with the same user-facing wrapper.

@entn-at entn-at force-pushed the feature/whisper-ctranslate2 branch from 1610802 to 0f5a2e1 Compare April 8, 2023 01:53
@entn-at
Copy link
Contributor Author

entn-at commented Apr 8, 2023

Thanks for the quick initial review! I combined whisper and faster-whisper into a single CLI/method with a --faster-whisper flag. I also added additional off-by-default feature flags for faster-whisper:

  • --faster-whisper-add-alignments: Whether to use faster-whisper's built-in method for obtaining word alignments (using cross-attention pattern and dynamic time warping; generally not as accurate as forced alignment).
  • --faster-whisper-use-vad: Whether to apply speech activity detection (SileroVAD) before Whisper to reduce repetitions/spurious transcriptions (what is often referred to as "hallucinations").
  • --faster-whisper-num-workers: Number of workers for parallelization across multiple GPUs.

Quick benchmark on mini-librispeech dev-clean-2:

OpenAI Whisper, RTX2080Ti:

$ time lhotse workflows annotate-with-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    44m31.647s
user    46m5.540s
sys     0m10.869s

faster-whisper/ctranslate2, float16 on RTX2080Ti:

time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    18m15.743s
user    34m47.594s
sys     30m18.775s

faster-whisper allows parallelization across multiple GPUs. With --faster-whisper-num-workers 4 on 4x RTX2080Ti:

$ time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 --faster-whisper-num-workers 4 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    6m34.545s
user    35m50.779s
sys     25m48.421s

The only incompatibility with the current Whisper method is that faster-whisper doesn't expose a way to set the download location for the models. I submitted a PR to faster-whisper and once that's merged/published in a new version, the currently commented-out line 116 in faster_whisper.py can be changed to enable that. PR to faster-whisper has been merged.

@entn-at entn-at marked this pull request as ready for review April 8, 2023 06:19
@entn-at entn-at changed the title [WIP] Add annotation workflow for faster-whisper (ctranslate2) Add faster-whisper (ctranslate2) as option for Whisper annotation workflow Apr 8, 2023
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a great contribution, this is super interesting. The code looks good to me. I would love to enable it by default.

lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved
lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved
@pzelasko
Copy link
Collaborator

pzelasko commented Apr 11, 2023

I quickly compared the results between old and new whisper implementations on a 60s clip from AMI. In that clip, I noticed that faster-whisper tends to skip short, isolated, and noisy utterances such as "Okay" or "Thank you", probably due to VAD (which is OK I guess). However the time boundaries seem off when you compare it to the original implementation, please see the screenshot. Do you think it's possible to fix it? Maybe more accurate information is exposed somewhere in faster-whisper and it's just not being used here? Otherwise there's a lot of silence/non-speech included in the supervisions.

Note: the top plot is from the original Whisper, and the bottom plot is from faster-whisper.

image

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a few issues running it, can you make the suggested changes that fix this?

@@ -55,6 +56,36 @@ def workflows():
@click.option(
"-d", "--device", default="cpu", help="Device on which to run the inference."
)
@click.option(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to:

@click.option(
    "--faster-whisper/--normal-whisper",
    default=True,
    help="If True, use faster-whisper's implementation based on CTranslate2.",
)

Otherwise it can't be turned off.

)
@click.option(
"--faster-whisper-compute-type",
default="float16",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default="float16",
default="auto",

Otherwise it won't work on (some?) CPUs.

device_index=device_index,
compute_type=compute_type,
num_workers=num_workers,
download_root=download_root,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the change that enables this option is still not released in pip, I suggest a bit of workaround here, otherwise it cannot be ran:

+    opt_kwargs = {}
+    if download_root is not None:
+        opt_kwargs["download_root"] = download_root
     model = WhisperModel(
         model_name,
         device=device,
         device_index=device_index,
         compute_type=compute_type,
         num_workers=num_workers,
-        download_root=download_root,
+        **opt_kwargs,
     )
-    model.logger.setLevel(logging.WARNING)
+    if hasattr(model, "logger"):
+        model.logger.setLevel(logging.WARNING)

Note I also suggested a check for logger, on my installation model did not have the attribute logger defined.

@entn-at
Copy link
Contributor Author

entn-at commented May 4, 2023

Sorry for the delay, I've been quite busy. I'll pick this up shortly and address the requested changes.

@desh2608
Copy link
Collaborator

@entn-at any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants