should we automatically update the model files that whisper uses? if so, at what frequency and with what mechanism? #23

jmartin-sul · 2024-10-01T22:21:33Z

this list has the URLs for retrieving models as setup for building the container: https://github.com/sul-dlss/speech-to-text/blob/main/whisper_models/urls.txt

see also https://github.com/sul-dlss/speech-to-text?tab=readme-ov-file#build

jmartin-sul · 2024-10-02T20:19:39Z

we don't currently have infrastructure that does exactly this -- all our automatic update tooling is built for ruby gems and python packages, things with their own dependency management tools.
we're not sure we always want the latest models anyway -- would likely want to do do some vetting to make sure that the new models perform as well as or better than the old ones.

possible storytime fodder

edsu · 2024-10-16T11:50:28Z

Perhaps we could add a unit test that compares the list of models:

https://github.com/openai/whisper/blob/main/whisper/__init__.py#L17-L32

with the ones in https://github.com/sul-dlss/speech-to-text/blob/main/whisper_models/urls.txt

Then when we update whisper, and there is a new model, the test will start to fail? We will need to remember that fixing the test requires rebuilding the Docker container...

jmartin-sul · 2024-12-05T20:30:02Z

discussed regression testing for model upgrades today as a tangent while troubleshooting some speech_to_text.py container crashes with @edsu and @peetucket. one path to explore might be:

add an optional infra integration test that pushed through a few large videos, and/or a large multi-lingual video, and maybe some more straight-forward english content. but some representative sample of both common and corner case type media.
have it not run by default when the test suite is run, but have some guidance (in e.g. the PR template) that encourages running it when the models are upgraded.
use something like the WER tool, and/or looking for specific passages that we expect to come out a certain way, to flag output that looks like it deviated too much from expectation. for long transcripts run through a non-deterministic machine learning model, we probably can't expect that any difference in output indicates regression.

it's also possible that we'll discover that an unexpected deviation is an improvement, in which case we'd probably want to update the test expectations going forward?

edsu · 2024-12-07T20:18:57Z

I like the sound of this. I think we could assemble some/all of the Pilot test data so it could be easily run by an integration test. In addition to doing some basic checks, we we could add some tools here to this repository to use the jiwer library to compare the results against an expected baseline? I suspect that there will be some human level evaluation that is needed.

jmartin-sul · 2024-12-11T19:25:57Z

came here to note some of what we discussed after standup today about regression testing. seems like all of what we talked about was already captured by the above comments. but it maybe is worth repeating that it's unlikely that evaluation of regression testing results will be totally automatable, in the same way that runs of CI or infra integration tests get a mechanical pass/fail. as ed says, it's likely that some amount of human evaluation will be needed to interpret test results.

we should also probably break out a separate regression testing ticket, as we've realized there are updates we need to regression test besides model updates: CUDA version, underlying GPU hardware, pytorch version, default settings changes, and many other things can each on their own possibly introduce significant changes to output (some of which will improve output, and some of which will make it worse).

see also today's post-standup discussion around segfaults, #68, and https://stackoverflow.com/questions/78196316/pytorch-segementation-fault-core-dumped-when-moving-pytorch-tensor-to-gpu

edsu · 2025-01-17T22:02:05Z

I noticed that the large-v3 image gets pulled in about 20-30 seconds in AWS Batch. We could ensure that the image is written to disk such that it is available for another docker run, which would mean subsequent jobs processed by the same ec2 instance would not need to pull down the model file? This way the model file would always be up to date?

jmartin-sul added question Further information is requested implementation question labels Oct 1, 2024

jmartin-sul mentioned this issue Oct 1, 2024

Add initial Docker container #9

Merged

peetucket mentioned this issue Dec 9, 2024

Whisper options in speech-to-text job sul-dlss/common-accessioning#1432

Closed

jmartin-sul mentioned this issue Dec 10, 2024

[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction #1

Open

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should we automatically update the model files that whisper uses? if so, at what frequency and with what mechanism? #23

should we automatically update the model files that whisper uses? if so, at what frequency and with what mechanism? #23

jmartin-sul commented Oct 1, 2024

jmartin-sul commented Oct 2, 2024

edsu commented Oct 16, 2024 •

edited

Loading

jmartin-sul commented Dec 5, 2024

edsu commented Dec 7, 2024

jmartin-sul commented Dec 11, 2024 •

edited

Loading

edsu commented Jan 17, 2025

should we automatically update the model files that whisper uses? if so, at what frequency and with what mechanism? #23

should we automatically update the model files that whisper uses? if so, at what frequency and with what mechanism? #23

Comments

jmartin-sul commented Oct 1, 2024

jmartin-sul commented Oct 2, 2024

edsu commented Oct 16, 2024 • edited Loading

jmartin-sul commented Dec 5, 2024

edsu commented Dec 7, 2024

jmartin-sul commented Dec 11, 2024 • edited Loading

edsu commented Jan 17, 2025

edsu commented Oct 16, 2024 •

edited

Loading

jmartin-sul commented Dec 11, 2024 •

edited

Loading