Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO job should just include ID #20

Closed
edsu opened this issue Sep 27, 2024 · 1 comment
Closed

TODO job should just include ID #20

edsu opened this issue Sep 27, 2024 · 1 comment
Assignees

Comments

@edsu
Copy link
Contributor

edsu commented Sep 27, 2024

Blocked by #3

To make coordination easier the SQS message sent to the TODO queue should simply include an id and any options for controlling whisper, and not the list of files:

{
  "id": "abc123",
  "options": {
    "model": "large"
  }
} 

If there are no files in the S3 bucket at s3://speech-to-text/media/abc123/ then an error message will be included when the job is put into the DONE queue.

@jmartin-sul
Copy link
Member

after some mostly inconclusive discussion between @edsu, @peetucket, and i the last couple days on whether to go forward with this or not, we think we've decided to close it for now?

in favor of closing:

  • no need to rework the file list logic that's already implemented
  • if file list logic changes, it'll be in common-accessioning, which more of the team is familiar with, and so it should be easier for more folks to deal with bugs or feature requests in common-accessioning than in the speech-to-text python code.
  • more stuff explicitly stated in job messages that we can look at, which might make debugging easier if it looks like the wrong files are getting processed
    • related, didn't come up in discussion, but i just realized: if something doesn't get written to the bucket for processing, but should've and is included in the file list, we'll get an error instead of a silent skip. but i suspect that we'd get a loud error if we encountered any sort of typical upload failure to S3, since that's what we've seen in e.g. preservation, so this point may actually be moot 🤷

in favor of keeping open:

  • simpler job messages (but we don't expect to have huge file lists for STTing for any one object, so not sure this was a practical advantage)

but no one seemed to feel strongly on any of the above reasons, and this isn't a huge change, so it's easy to reopen and run with it if we later think of something compelling that hasn't occurred to us yet.

let me know if i got any of that wrong!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants