[MM-55475] Performance tests #47

streamer45 · 2023-11-18T00:29:06Z

Summary

Attaching the results of preliminary performance tests. I selected the same instance type we use in production for recordings c6i.2xlarge and executed on all the models we include (tiny, base, small) with a base sample of 10 minutes.

For the default configuration of threads (NumCPU / 2) I also performed tests on a full hour of meeting sample.

The call samples were extracted from real developers meetings so the test should be as close as possible to a real use case with the caveat that it was a single track. In general though I wouldn't expect multiple tracks to cause significant overhead since it's unlikely for speech from different tracks to be overlapping for long periods.

What's likely causing some overhead is the number of speech segments that we get out of these tracks (due to the speech detection process). We can probably tune this further to try and minimize the number of contiguous samples. Right now we are using a value of 2 seconds of silence after which we split.

Overall the results show almost linear performance gains with the number of threads of execution.

Please let me know if you have any questions or concerns.

Ticket Link

https://mattermost.atlassian.net/browse/MM-55475

cpoile

Nice!
One q: Is there an accuracy measure for these? If I were a customer trying to decide which model to use, would I just have to try each out and see?

streamer45 · 2023-11-20T14:51:55Z

Nice! One q: Is there an accuracy measure for these? If I were a customer trying to decide which model to use, would I just have to try each out and see?

That's a good question. We can plan some accuracy tests but I'd expect that to be an effort on its own as we need to find some good samples (not just audio books or well known speeches).

At this point I'd probably refer them to the results for the original models (from the paper itself) since it doesn't seem there's anything "official" from whisper.cpp. But of course whisper.cpp isn't as accurate as the original implementation. There's a very good technical breakdown of why that's the case at ggerganov/whisper.cpp#1163 if you are interested.

Overall I think a customer would most likely start with the default and move up or down as needed. If you think that's confusing, we are still in time to just stick with a single model, but it felt nice to give some degree of performance/accuracy customization.

Include transcriptions performance test results

bf4ebd9

streamer45 added the 2: Dev Review Requires review by a core committer label Nov 18, 2023

streamer45 added this to the v0.5.0 milestone Nov 18, 2023

streamer45 requested a review from cpoile November 18, 2023 00:29

streamer45 self-assigned this Nov 18, 2023

cpoile approved these changes Nov 20, 2023

View reviewed changes

streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Nov 20, 2023

streamer45 merged commit e8e15bd into MM-53432 Nov 21, 2023
3 checks passed

streamer45 deleted the MM-55475 branch November 21, 2023 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MM-55475] Performance tests #47

[MM-55475] Performance tests #47

streamer45 commented Nov 18, 2023

cpoile left a comment

streamer45 commented Nov 20, 2023

[MM-55475] Performance tests #47

[MM-55475] Performance tests #47

Conversation

streamer45 commented Nov 18, 2023

Summary

Ticket Link

cpoile left a comment

Choose a reason for hiding this comment

streamer45 commented Nov 20, 2023