Skip to content

Conversation

@michaelkedar
Copy link
Member

@michaelkedar michaelkedar commented Nov 12, 2025

Added a some logic to make the importer update the database when it receives updates for the data sources, instead of having to wait for the worker to finish enumerating versions / git commits.

To prevent churn, the importer checks if the affected packages have changed in the upstream source first. If they haven't it won't overwrite them (so we can keep our previous worker enrichment). The worker will still run for every record, however.

Bugs put by the importer will use the modified date from the source, while bugs modified by the worker now use the time the worker finished (instead of the source time like before).

I've also modified the REST test to reduce an absolutely massive vulnerability record, and made the osv.utcnow() return a time in the future instead of in 2021, which was causing weirdness with time going backwards.

This whole logic is a bit temporary - I eventually want to rewrite the importer (in go) to stop relying on the Bug entities, and remove much of the OSS-Fuzz-specific behaviour.

@michaelkedar
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant change to the import process by having the importer write Bug entities directly to the database. This is a great initiative to reduce the latency for new vulnerability data. The implementation includes preprocessing and checksum logic to prevent overwriting worker enrichments, and it's supported by a comprehensive set of new tests. My main feedback focuses on opportunities to improve maintainability by reducing code duplication. Several new functions in importer.py replicate logic from worker.py or could be better placed in shared modules. Addressing this duplication will make the codebase more robust and easier to manage in the long run.

Comment on lines +1158 to +1179
def preprocess_vuln(vuln: vulnerability_pb2.Vulnerability):
"""Do preprocessing steps on vulnerability that the worker does."""
# Duplicating parts of _do_update() in worker.py
# maybe_normalize_package_names:
for affected in vuln.affected:
if not affected.package.ecosystem:
continue
affected.package.name = osv.ecosystems.maybe_normalize_package_names(
affected.package.name, affected.package.ecosystem)
# skipping fix_invalid_ghsa because I don't think it's a problem anymore.
# filter_unknown_ecoystems:
filtered = []
for affected in vuln.affected:
if not affected.HasField('package'):
filtered.append(affected)
elif osv.ecosystems.is_known(affected.package.ecosystem):
filtered.append(affected)
else:
logging.error('%s contains unknown ecosystem "%s"', vuln.id,
affected.package.ecosystem)
del vuln.affected[:]
vuln.affected.extend(filtered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the comment on line 1160, this function duplicates parts of _do_update() in worker.py. To avoid code duplication and ensure consistency between the importer and the worker, this preprocessing logic should be extracted into a shared utility function or module that both can import and use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing code between the worker and importer is a little bit cumbersome.
I don't really want to spend too much effort refactoring when I want to rewrite it all anyway.

Comment on lines +1244 to +1267
def compute_raw_affected_checksum(vuln: vulnerability_pb2.Vulnerability):
"""Computes a checksum of the affected array from a vulnerability."""
# Grab the minimum information out of afftected and sort.
#TODO(michaelkedar): This should be in models.py
aff = []
# (ecocystem, package, [
# (type, repo, [(introduced, fixed, last_affected, limit)])
# ], [versions])
for affected in vuln.affected:
ecosystem = affected.package.ecosystem
package = affected.package.name
versions = sorted(affected.versions)
ranges = []
for r in affected.ranges:
t = r.type
repo = r.repo
events = sorted(
(e.introduced, e.fixed, e.last_affected, e.limit) for e in r.events)
ranges.append((t, repo, events))
ranges.sort()
aff.append((ecosystem, package, ranges, versions))
aff.sort()
b = json.dumps(aff).encode()
return osv.sha256_bytes(b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

I agree with the TODO on line 1247. This function computes a checksum based on the raw Vulnerability data structure. It would be more appropriate to locate this function within osv/models.py alongside the data models it operates on, or in a dedicated utility module for vulnerability data processing. This would improve code organization and make the function's purpose clearer.

Comment on lines +326 to +328
# A checksum of the Affected packages and versions for the importer.
# TODO(michaelkedar): move this to Vulnerability entity
affected_checksum: str = ndb.TextProperty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO comment here is spot on. The affected_checksum is calculated from the raw vulnerability data before any worker enrichment. Placing it on the Bug model, which represents the enriched and processed data, is a bit counter-intuitive. It would be more logical to have this on a Vulnerability entity that represents the raw data. While this might be a pragmatic choice for now, it's a good idea to plan for moving this as part of future refactoring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be added when the importer is refactored to use the Vulnerability entities

@michaelkedar michaelkedar marked this pull request as ready for review November 13, 2025 02:54
Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@michaelkedar michaelkedar merged commit c9c78ed into google:master Nov 19, 2025
17 checks passed
@michaelkedar michaelkedar deleted the fast-import branch November 19, 2025 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants