feat: Have importer write Bugs before sending to worker #4344

michaelkedar · 2025-11-12T23:21:06Z

Added a some logic to make the importer update the database when it receives updates for the data sources, instead of having to wait for the worker to finish enumerating versions / git commits.

To prevent churn, the importer checks if the affected packages have changed in the upstream source first. If they haven't it won't overwrite them (so we can keep our previous worker enrichment). The worker will still run for every record, however.

Bugs put by the importer will use the modified date from the source, while bugs modified by the worker now use the time the worker finished (instead of the source time like before).

I've also modified the REST test to reduce an absolutely massive vulnerability record, and made the osv.utcnow() return a time in the future instead of in 2021, which was causing weirdness with time going backwards.

This whole logic is a bit temporary - I eventually want to rewrite the importer (in go) to stop relying on the Bug entities, and remove much of the OSS-Fuzz-specific behaviour.

michaelkedar · 2025-11-12T23:40:49Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant change to the import process by having the importer write Bug entities directly to the database. This is a great initiative to reduce the latency for new vulnerability data. The implementation includes preprocessing and checksum logic to prevent overwriting worker enrichments, and it's supported by a comprehensive set of new tests. My main feedback focuses on opportunities to improve maintainability by reducing code duplication. Several new functions in importer.py replicate logic from worker.py or could be better placed in shared modules. Addressing this duplication will make the codebase more robust and easier to manage in the long run.

gcp/workers/importer/importer.py

gemini-code-assist · 2025-11-12T23:42:35Z

gcp/workers/importer/importer.py

+def preprocess_vuln(vuln: vulnerability_pb2.Vulnerability):
+  """Do preprocessing steps on vulnerability that the worker does."""
+  # Duplicating parts of _do_update() in worker.py
+  # maybe_normalize_package_names:
+  for affected in vuln.affected:
+    if not affected.package.ecosystem:
+      continue
+    affected.package.name = osv.ecosystems.maybe_normalize_package_names(
+        affected.package.name, affected.package.ecosystem)
+  # skipping fix_invalid_ghsa because I don't think it's a problem anymore.
+  # filter_unknown_ecoystems:
+  filtered = []
+  for affected in vuln.affected:
+    if not affected.HasField('package'):
+      filtered.append(affected)
+    elif osv.ecosystems.is_known(affected.package.ecosystem):
+      filtered.append(affected)
+    else:
+      logging.error('%s contains unknown ecosystem "%s"', vuln.id,
+                    affected.package.ecosystem)
+  del vuln.affected[:]
+  vuln.affected.extend(filtered)


As noted in the comment on line 1160, this function duplicates parts of _do_update() in worker.py. To avoid code duplication and ensure consistency between the importer and the worker, this preprocessing logic should be extracted into a shared utility function or module that both can import and use.

Sharing code between the worker and importer is a little bit cumbersome.
I don't really want to spend too much effort refactoring when I want to rewrite it all anyway.

gemini-code-assist · 2025-11-12T23:42:35Z

gcp/workers/importer/importer.py

+def compute_raw_affected_checksum(vuln: vulnerability_pb2.Vulnerability):
+  """Computes a checksum of the affected array from a vulnerability."""
+  # Grab the minimum information out of afftected and sort.
+  #TODO(michaelkedar): This should be in models.py
+  aff = []
+  # (ecocystem, package, [
+  #   (type, repo, [(introduced, fixed, last_affected, limit)])
+  # ], [versions])
+  for affected in vuln.affected:
+    ecosystem = affected.package.ecosystem
+    package = affected.package.name
+    versions = sorted(affected.versions)
+    ranges = []
+    for r in affected.ranges:
+      t = r.type
+      repo = r.repo
+      events = sorted(
+          (e.introduced, e.fixed, e.last_affected, e.limit) for e in r.events)
+      ranges.append((t, repo, events))
+    ranges.sort()
+    aff.append((ecosystem, package, ranges, versions))
+  aff.sort()
+  b = json.dumps(aff).encode()
+  return osv.sha256_bytes(b)


I agree with the TODO on line 1247. This function computes a checksum based on the raw Vulnerability data structure. It would be more appropriate to locate this function within osv/models.py alongside the data models it operates on, or in a dedicated utility module for vulnerability data processing. This would improve code organization and make the function's purpose clearer.

gemini-code-assist · 2025-11-12T23:42:35Z

osv/models.py

+  # A checksum of the Affected packages and versions for the importer.
+  # TODO(michaelkedar): move this to Vulnerability entity
+  affected_checksum: str = ndb.TextProperty()


The TODO comment here is spot on. The affected_checksum is calculated from the raw vulnerability data before any worker enrichment. Placing it on the Bug model, which represents the enriched and processed data, is a bit counter-intuitive. It would be more logical to have this on a Vulnerability entity that represents the raw data. While this might be a pragmatic choice for now, it's a good idea to plan for moving this as part of future refactoring.

This will be added when the importer is refactored to use the Vulnerability entities

another-rex

Nice!

gcp/workers/importer/importer.py

michaelkedar added 5 commits November 12, 2025 14:13

omg this file is huge

540db46

I've been to the year 3000

cac6096

do the thing

7a880e8

don't go back

9254b2e

🪵

be57146

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

deduplicate

b90d6f4

michaelkedar marked this pull request as ready for review November 13, 2025 02:54

Merge branch 'master' into fast-import

3b187d2

michaelkedar requested review from another-rex, cuixq and jess-lowe November 13, 2025 02:57

michaelkedar mentioned this pull request Nov 13, 2025

Improve the OSV.dev database situation #3850

Open

another-rex reviewed Nov 14, 2025

View reviewed changes

gcp/workers/importer/importer.py Outdated Show resolved Hide resolved

gcp/workers/importer/importer.py Show resolved Hide resolved

gcp/workers/importer/importer.py Show resolved Hide resolved

gcp/workers/importer/importer.py Outdated Show resolved Hide resolved

review comments

7c96827

another-rex approved these changes Nov 18, 2025

View reviewed changes

Merge branch 'master' into fast-import

dfbd3be

michaelkedar merged commit c9c78ed into google:master Nov 19, 2025
17 checks passed

michaelkedar deleted the fast-import branch November 19, 2025 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Have importer write Bugs before sending to worker #4344

feat: Have importer write Bugs before sending to worker #4344

Uh oh!

michaelkedar commented Nov 12, 2025 •

edited

Loading

Uh oh!

michaelkedar commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

michaelkedar Nov 13, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

michaelkedar Nov 13, 2025

Uh oh!

another-rex left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Have importer write Bugs before sending to worker #4344

feat: Have importer write Bugs before sending to worker #4344

Uh oh!

Conversation

michaelkedar commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelkedar commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

michaelkedar Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

michaelkedar Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

another-rex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelkedar commented Nov 12, 2025 •

edited

Loading