Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we automatically update the model files that whisper uses? if so, at what frequency and with what mechanism? #23

Open
jmartin-sul opened this issue Oct 1, 2024 · 6 comments
Labels
implementation question question Further information is requested

Comments

@jmartin-sul
Copy link
Member

this list has the URLs for retrieving models as setup for building the container: https://github.com/sul-dlss/speech-to-text/blob/main/whisper_models/urls.txt

see also https://github.com/sul-dlss/speech-to-text?tab=readme-ov-file#build

@jmartin-sul
Copy link
Member Author

  • we don't currently have infrastructure that does exactly this -- all our automatic update tooling is built for ruby gems and python packages, things with their own dependency management tools.
  • we're not sure we always want the latest models anyway -- would likely want to do do some vetting to make sure that the new models perform as well as or better than the old ones.

possible storytime fodder

@edsu
Copy link
Contributor

edsu commented Oct 16, 2024

Perhaps we could add a unit test that compares the list of models:

https://github.com/openai/whisper/blob/main/whisper/__init__.py#L17-L32

with the ones in https://github.com/sul-dlss/speech-to-text/blob/main/whisper_models/urls.txt

Then when we update whisper, and there is a new model, the test will start to fail? We will need to remember that fixing the test requires rebuilding the Docker container...

@jmartin-sul
Copy link
Member Author

discussed regression testing for model upgrades today as a tangent while troubleshooting some speech_to_text.py container crashes with @edsu and @peetucket. one path to explore might be:

  • add an optional infra integration test that pushed through a few large videos, and/or a large multi-lingual video, and maybe some more straight-forward english content. but some representative sample of both common and corner case type media.
  • have it not run by default when the test suite is run, but have some guidance (in e.g. the PR template) that encourages running it when the models are upgraded.
  • use something like the WER tool, and/or looking for specific passages that we expect to come out a certain way, to flag output that looks like it deviated too much from expectation. for long transcripts run through a non-deterministic machine learning model, we probably can't expect that any difference in output indicates regression.

it's also possible that we'll discover that an unexpected deviation is an improvement, in which case we'd probably want to update the test expectations going forward?

@edsu
Copy link
Contributor

edsu commented Dec 7, 2024

I like the sound of this. I think we could assemble some/all of the Pilot test data so it could be easily run by an integration test. In addition to doing some basic checks, we we could add some tools here to this repository to use the jiwer library to compare the results against an expected baseline? I suspect that there will be some human level evaluation that is needed.

@jmartin-sul
Copy link
Member Author

jmartin-sul commented Dec 11, 2024

came here to note some of what we discussed after standup today about regression testing. seems like all of what we talked about was already captured by the above comments. but it maybe is worth repeating that it's unlikely that evaluation of regression testing results will be totally automatable, in the same way that runs of CI or infra integration tests get a mechanical pass/fail. as ed says, it's likely that some amount of human evaluation will be needed to interpret test results.

we should also probably break out a separate regression testing ticket, as we've realized there are updates we need to regression test besides model updates: CUDA version, underlying GPU hardware, pytorch version, default settings changes, and many other things can each on their own possibly introduce significant changes to output (some of which will improve output, and some of which will make it worse).

see also today's post-standup discussion around segfaults, #68, and https://stackoverflow.com/questions/78196316/pytorch-segementation-fault-core-dumped-when-moving-pytorch-tensor-to-gpu

@edsu
Copy link
Contributor

edsu commented Jan 17, 2025

I noticed that the large-v3 image gets pulled in about 20-30 seconds in AWS Batch. We could ensure that the image is written to disk such that it is available for another docker run, which would mean subsequent jobs processed by the same ec2 instance would not need to pull down the model file? This way the model file would always be up to date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implementation question question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants