feat: add `recoverer` for failed tasks #3821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

michaelkedar merged 8 commits into google:master from michaelkedar:recoverer

Aug 26, 2025

Member

michaelkedar commented Aug 20, 2025 •

edited

Loading

Added a new recoverer (open to renaming) GKE worker in charge of recovering from database writing errors. It subscribes to the existing failed-tasks topic (which is the dead-letter topic for the importer/worker tasks topic - I'm pretty sure those were being discarded since it had no subscriptions).

currently tasks coming from the importer/clusterfuzz are just logged by the recoverer (this will show up in prod)
Failures from reading/writiing to the GCS bucket backend on staging now publish corresponding messages failed-tasks. The recoverer will attempt to rectify these errors.
Bunch of terraform/cloud deploy/cloud build changes to support this.
I also did a little refactor - created a utils file with a get_google_cloud_project() function that multiple things end up using.

Honestly, I have no idea how much of this code is actually going to run in practice - we haven't really seen any of these GCS errors on staging as of yet. Maybe this will never end up running 🙃

In another PR, I'll make a cron job that goes through all the recently-updated Vulnerability entities to make sure they have GCS records associated with them (sending problem entities to the recoverer to fix)

michaelkedar added 3 commits

August 20, 2025 15:40


          I'll never recover from this

51db9d4


          quoth the raven

d7d9c34


          Merge branch 'master' into recoverer

497d2d2

michaelkedar requested a review from another-rex

August 21, 2025 01:26

michaelkedar added 3 commits

August 21, 2025 13:27


          unsubscribe

690881a


          no context security

626cd17


          born in the wrong generation

b24772f

another-rex reviewed

View reviewed changes

gcp/workers/recoverer/recoverer.py Show resolved Hide resolved

gcp/workers/recoverer/recoverer.py Show resolved Hide resolved

gcp/workers/recoverer/recoverer.py Outdated Show resolved Hide resolved

gcp/workers/recoverer/recoverer.py Outdated

+              def handle_gcs_retry(message: pubsub_v1.types.PubsubMessage) -> bool:
+                """Handle a failed GCS write."""
+                # Check that the record hasn't been written/updated in the meantime.

Contributor

another-rex Aug 21, 2025

Move the comment down to the line where the check actually takes place

gcp/workers/recoverer/recoverer.py Show resolved Hide resolved

gcp/workers/recoverer/recoverer.py

+                if not vuln_id:
+                  logging.error('gcs_missing: message missing id attribute: %s', message)
+                  return True
+                # Re-put the Bug to regenerate the GCS & Datastore entities

Contributor

another-rex Aug 21, 2025

Should we check if it has been added to GCS before reputting?

Member Author

michaelkedar Aug 25, 2025

In this case, it's ok if we don't, since the re-put will completely regenerate the bug with fresh data.

gcp/workers/recoverer/recoverer.py

Comment on lines +129 to +143

+                      alias_group = osv.AliasGroup.query(
+                          osv.AliasGroup.bug_ids == vuln_id).get()
+                      if alias_group is None:
+                        aliases = []
+                        aliases_modified = datetime.datetime.now(datetime.UTC)
+                      else:
+                        aliases = sorted(set(alias_group.bug_ids) - {vuln_id})
+                        aliases_modified = alias_group.last_modified
+                      # Only update the modified time if it's actually being modified
+                      if vuln_proto.aliases != aliases:
+                        vuln_proto.aliases[:] = aliases
+                        if aliases_modified > modified:
+                          modified = aliases_modified
+                        else:
+                          modified = datetime.datetime.now(datetime.UTC)

Contributor

another-rex Aug 21, 2025

Probably fine to leave here for now, but is this code the same as the code in the alias group service? We probably should extract it.

Member Author

michaelkedar Aug 25, 2025

Yeah - it's pretty similar to alias/upstream.
It's hard to share code between these two at the moment due to how the dockerfiles are built.

michaelkedar added 2 commits

August 25, 2025 14:12


          Merge branch 'master' of github.com:google/osv.dev into recoverer

7f391e7


          reeview

d8aaec9

another-rex approved these changes

View reviewed changes

michaelkedar merged commit f0e713f into google:master

16 checks passed

michaelkedar mentioned this pull request

Improve the OSV.dev database situation #3850

Open

michaelkedar mentioned this pull request

feat(staging): record checker #3957

Merged

michaelkedar added a commit that referenced this pull request


          feat(staging): record checker (#3957)

cf0618b

Adds a cron job to validate the consistency of records between GCS and
datastore with the new database format (#3850). This sends messages to
the recoverer (#3821) to attempt to repair.

The record-checker stores information from its previous runs in
datastore in a new JobData entity - where the datastore Key is the name
of the metadata, and the value is stored in a `value` property.

I've written this in Go to because it's fairly simple and to set a
precedent for writing / migrating other components away from python. I
made a copy of the `logging` submodule from `vulnfeeds` into the new
toplevel `go` directory - we'll want to consolidate these soon. There's
also room for refactoring the record checker code have some reusable
components for e.g. setting up datastore, gcs, or pub/sub.

I added a validation test script to make sure datastore entities written
in Python are compatible with the go definitions, and vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet