Skip to content

Conversation

@michaelkedar
Copy link
Member

@michaelkedar michaelkedar commented Jul 22, 2025

This PR begins our database table migrations!
This adds a few new entities into Datastore:

  • Vulnerability: containing a small amount of information of a vulnerability, with the (overall) modified date, as well as some raw fields that are overwritten by our enrichment.
  • ListedVulnerability: Minimal vulnerability information that is needed for our website's /list page
  • AffectedVersions: An index for API matching, which should help simplify our API logic and improve performance.
    Additionally, full vulnerabilities are stored in the GCS bucket, which will eventually become the new source of truth for the vulnerability records (deprecating the Bug entity)

To populate these I've added a new _post_put_hook on the Bug entity, and added some logic to the alias & upstream cron jobs to update the entities/GCS also.

I've only set this to run on the test instance (and when running tests with the emulator) because I don't want this to block releases on prod if it turns out this has significant performance impacts (especially with the blocking writing to the bucket in the alias & upstream crons).

Some other misc things to note:

  • I'm writing both .pb and .json files to to bucket, because the proto files are a bit easier to work with in code, but the JSON is more usable externally. It's probably better to just write the .pb and have the exporter create the .json files
  • I had to remove the transactions from the reput tools, because ndb doesn't like nested transactions
  • I have written my own (minimal) mocker for the google.cloud.storage operations, and added that to the datastore emulator starter code
  • I've made a few changes to some tests that were creating incomplete Bugs (which caused issues when trying to convert them for writing)
  • I have a TODO to create another worker/cron job that retries failed GCS writes - that's probably the next step.
  • Before we merge this, we should pause the alias/upstream cron jobs, then re-put every Bug (which will take some time) to avoid noisy error logs.

@michaelkedar michaelkedar marked this pull request as ready for review July 30, 2025 05:51
@another-rex
Copy link
Contributor

Before we merge this, we should pause the alias/upstream cron jobs, then re-put every Bug (which will take some time) to avoid noisy error logs.

The reputting is after merging right?

Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

I'm not sure how to review models_test.py though...

Comment on lines 108 to 119
bucket = gcs.get_osv_bucket()
pb_blob = bucket.get_blob(os.path.join(gcs.VULN_PB_PATH, vuln_id + '.pb'))
if pb_blob is None:
if osv.Vulnerability.get_by_id(vuln_id) is not None:
logging.error('vulnerability not in GCS - %s', vuln_id)
# TODO(michaelkedar): send pub/sub message to reimport
return
try:
vuln_proto = osv.vulnerability_pb2.Vulnerability.FromString(
pb_blob.download_as_bytes())
except Exception:
logging.exception('failed to download %s protobuf from GCS', vuln_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all of this be abstracted to a single get by ID function? I feel like this would be used in quite a few different places, and if we ever decide to put the .pb directly in datastore or another db we can just change the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the new gcs module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created functions in the gcs module for reading and writing

# TODO: Raise exception
return
if alias_group is None:
modified = datetime.datetime.now(datetime.UTC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should modified be unconditionally now? If aliases doesn't change we probably don't want to update modified date right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the function doc does have that assumption that something was just deleted, but if it's easy / there's no additional side effects, we should just only update when there was an alias group in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the only way _update_vuln_with_group will be called in main with alias_group = None is if the vuln was in an alias group already.

osv/models.py Outdated


class AffectedVersions(ndb.Model):
"""AffectedVersions entry."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please expand this a bit more, like the ListedVulnerability model

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added what it is to be used for to the docstring

(vulnerability_pb2.Severity.Type.Name(sev.type), sev.score))

search_indices = set()
search_indices.update(_tokenize(vulnerability.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Since we are doing this big reput anyway, we could just .lower() all of this and make our search case insensitive which we should have done.

But probably a good idea to not change behavior too much for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly lower()ed - _tokenize() does lower-case the tokens. The only thing not made lowercase is the repo URLs, but they probably are already lowercase anyway.

@michaelkedar michaelkedar merged commit a881d9e into google:master Aug 5, 2025
16 checks passed
michaelkedar added a commit that referenced this pull request Aug 6, 2025
In #3708, writing both JSON and protos is proving to be fairly
expensive.
Just write the `.pb`s - we can make other jobs make the JSON files
later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants