feat: Begin populating new database fields #3708

michaelkedar · 2025-07-22T05:32:03Z

This PR begins our database table migrations!
This adds a few new entities into Datastore:

Vulnerability: containing a small amount of information of a vulnerability, with the (overall) modified date, as well as some raw fields that are overwritten by our enrichment.
ListedVulnerability: Minimal vulnerability information that is needed for our website's /list page
AffectedVersions: An index for API matching, which should help simplify our API logic and improve performance.
Additionally, full vulnerabilities are stored in the GCS bucket, which will eventually become the new source of truth for the vulnerability records (deprecating the Bug entity)

To populate these I've added a new _post_put_hook on the Bug entity, and added some logic to the alias & upstream cron jobs to update the entities/GCS also.

I've only set this to run on the test instance (and when running tests with the emulator) because I don't want this to block releases on prod if it turns out this has significant performance impacts (especially with the blocking writing to the bucket in the alias & upstream crons).

Some other misc things to note:

I'm writing both .pb and .json files to to bucket, because the proto files are a bit easier to work with in code, but the JSON is more usable externally. It's probably better to just write the .pb and have the exporter create the .json files
I had to remove the transactions from the reput tools, because ndb doesn't like nested transactions
I have written my own (minimal) mocker for the google.cloud.storage operations, and added that to the datastore emulator starter code
I've made a few changes to some tests that were creating incomplete Bugs (which caused issues when trying to convert them for writing)
I have a TODO to create another worker/cron job that retries failed GCS writes - that's probably the next step.
Before we merge this, we should pause the alias/upstream cron jobs, then re-put every Bug (which will take some time) to avoid noisy error logs.

another-rex · 2025-08-01T01:08:17Z

Before we merge this, we should pause the alias/upstream cron jobs, then re-put every Bug (which will take some time) to avoid noisy error logs.

The reputting is after merging right?

another-rex

Looking great!

I'm not sure how to review models_test.py though...

gcp/workers/alias/alias_computation.py

another-rex · 2025-08-01T01:16:21Z

gcp/workers/alias/alias_computation.py

+  bucket = gcs.get_osv_bucket()
+  pb_blob = bucket.get_blob(os.path.join(gcs.VULN_PB_PATH, vuln_id + '.pb'))
+  if pb_blob is None:
+    if osv.Vulnerability.get_by_id(vuln_id) is not None:
+      logging.error('vulnerability not in GCS - %s', vuln_id)
+      # TODO(michaelkedar): send pub/sub message to reimport
+    return
+  try:
+    vuln_proto = osv.vulnerability_pb2.Vulnerability.FromString(
+        pb_blob.download_as_bytes())
+  except Exception:
+    logging.exception('failed to download %s protobuf from GCS', vuln_id)


Can all of this be abstracted to a single get by ID function? I feel like this would be used in quite a few different places, and if we ever decide to put the .pb directly in datastore or another db we can just change the implementation.

Maybe in the new gcs module

created functions in the gcs module for reading and writing

another-rex · 2025-08-01T01:20:06Z

gcp/workers/alias/alias_computation.py

+      # TODO: Raise exception
+      return
+    if alias_group is None:
+      modified = datetime.datetime.now(datetime.UTC)


Should modified be unconditionally now? If aliases doesn't change we probably don't want to update modified date right?

I guess the function doc does have that assumption that something was just deleted, but if it's easy / there's no additional side effects, we should just only update when there was an alias group in the first place.

I'm pretty sure the only way _update_vuln_with_group will be called in main with alias_group = None is if the vuln was in an alias group already.

gcp/workers/alias/alias_computation.py

another-rex · 2025-08-01T01:35:58Z

osv/models.py

+
+
+class AffectedVersions(ndb.Model):
+  """AffectedVersions entry."""


Same here, please expand this a bit more, like the ListedVulnerability model

added what it is to be used for to the docstring

osv/models.py

another-rex · 2025-08-01T01:42:22Z

osv/models.py

+          (vulnerability_pb2.Severity.Type.Name(sev.type), sev.score))
+
+    search_indices = set()
+    search_indices.update(_tokenize(vulnerability.id))


Hmm... Since we are doing this big reput anyway, we could just .lower() all of this and make our search case insensitive which we should have done.

But probably a good idea to not change behavior too much for now.

It is mostly lower()ed - _tokenize() does lower-case the tokens. The only thing not made lowercase is the repo URLs, but they probably are already lowercase anyway.

osv/models.py

In #3708, writing both JSON and protos is proving to be fairly expensive. Just write the `.pb`s - we can make other jobs make the JSON files later.

michaelkedar added 13 commits July 22, 2025 15:28

add new entities

06fd5d6

_post_put_hook for new entities

0e0d77e

I can has bucket

5690a9e

mocking bucket

477ff90

fix failing tests

d47b349

we all scream for upstream

9081e34

alias

e685064

test model

36d4892

oss-vdb-test has a bucket

b3e2e79

remove transactions from reput tools

5e30e10

linter? I hardly know 'er

9e53a0c

disable cache on the reput helpers

9b18546

Merge branch 'master' into new-datastore

f65ea16

michaelkedar marked this pull request as ready for review July 30, 2025 05:51

michaelkedar requested review from another-rex, cuixq, hogo6002, jess-lowe and oliverchang July 30, 2025 05:51

michaelkedar added 2 commits July 30, 2025 16:18

run the models test

67eb8f0

Merge branch 'master' into new-datastore

faff5f0

another-rex reviewed Aug 1, 2025

View reviewed changes

address review comments

620e4bf

another-rex approved these changes Aug 5, 2025

View reviewed changes

michaelkedar merged commit a881d9e into google:master Aug 5, 2025
16 checks passed

michaelkedar mentioned this pull request Aug 6, 2025

fix: only write live proto vulns to bucket #3772

Merged

michaelkedar added a commit that referenced this pull request Aug 6, 2025

fix: only write live proto vulns to bucket (#3772)

6b7a2be

In #3708, writing both JSON and protos is proving to be fairly expensive. Just write the `.pb`s - we can make other jobs make the JSON files later.

michaelkedar mentioned this pull request Aug 28, 2025

Improve the OSV.dev database situation #3850

Open



		class AffectedVersions(ndb.Model):
		"""AffectedVersions entry."""

feat: Begin populating new database fields #3708

feat: Begin populating new database fields #3708

Uh oh!

Conversation

michaelkedar commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

another-rex commented Aug 1, 2025

Uh oh!

another-rex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelkedar commented Jul 22, 2025 •

edited

Loading