Skip to content

Conversation

@michaelkedar
Copy link
Member

@michaelkedar michaelkedar commented Aug 20, 2025

Added a new recoverer (open to renaming) GKE worker in charge of recovering from database writing errors. It subscribes to the existing failed-tasks topic (which is the dead-letter topic for the importer/worker tasks topic - I'm pretty sure those were being discarded since it had no subscriptions).

  • currently tasks coming from the importer/clusterfuzz are just logged by the recoverer (this will show up in prod)
  • Failures from reading/writiing to the GCS bucket backend on staging now publish corresponding messages failed-tasks. The recoverer will attempt to rectify these errors.
  • Bunch of terraform/cloud deploy/cloud build changes to support this.
  • I also did a little refactor - created a utils file with a get_google_cloud_project() function that multiple things end up using.

Honestly, I have no idea how much of this code is actually going to run in practice - we haven't really seen any of these GCS errors on staging as of yet. Maybe this will never end up running 🙃

In another PR, I'll make a cron job that goes through all the recently-updated Vulnerability entities to make sure they have GCS records associated with them (sending problem entities to the recoverer to fix)


def handle_gcs_retry(message: pubsub_v1.types.PubsubMessage) -> bool:
"""Handle a failed GCS write."""
# Check that the record hasn't been written/updated in the meantime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the comment down to the line where the check actually takes place

if not vuln_id:
logging.error('gcs_missing: message missing id attribute: %s', message)
return True
# Re-put the Bug to regenerate the GCS & Datastore entities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if it has been added to GCS before reputting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it's ok if we don't, since the re-put will completely regenerate the bug with fresh data.

Comment on lines +129 to +143
alias_group = osv.AliasGroup.query(
osv.AliasGroup.bug_ids == vuln_id).get()
if alias_group is None:
aliases = []
aliases_modified = datetime.datetime.now(datetime.UTC)
else:
aliases = sorted(set(alias_group.bug_ids) - {vuln_id})
aliases_modified = alias_group.last_modified
# Only update the modified time if it's actually being modified
if vuln_proto.aliases != aliases:
vuln_proto.aliases[:] = aliases
if aliases_modified > modified:
modified = aliases_modified
else:
modified = datetime.datetime.now(datetime.UTC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine to leave here for now, but is this code the same as the code in the alias group service? We probably should extract it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's pretty similar to alias/upstream.
It's hard to share code between these two at the moment due to how the dockerfiles are built.

@michaelkedar michaelkedar merged commit f0e713f into google:master Aug 26, 2025
16 checks passed
michaelkedar added a commit that referenced this pull request Sep 18, 2025
Adds a cron job to validate the consistency of records between GCS and
datastore with the new database format (#3850). This sends messages to
the recoverer (#3821) to attempt to repair.

The record-checker stores information from its previous runs in
datastore in a new JobData entity - where the datastore Key is the name
of the metadata, and the value is stored in a `value` property.

I've written this in Go to because it's fairly simple and to set a
precedent for writing / migrating other components away from python. I
made a copy of the `logging` submodule from `vulnfeeds` into the new
toplevel `go` directory - we'll want to consolidate these soon. There's
also room for refactoring the record checker code have some reusable
components for e.g. setting up datastore, gcs, or pub/sub.

I added a validation test script to make sure datastore entities written
in Python are compatible with the go definitions, and vice versa.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants