-
Notifications
You must be signed in to change notification settings - Fork 274
feat: Have importer write Bugs before sending to worker #4344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant change to the import process by having the importer write Bug entities directly to the database. This is a great initiative to reduce the latency for new vulnerability data. The implementation includes preprocessing and checksum logic to prevent overwriting worker enrichments, and it's supported by a comprehensive set of new tests. My main feedback focuses on opportunities to improve maintainability by reducing code duplication. Several new functions in importer.py replicate logic from worker.py or could be better placed in shared modules. Addressing this duplication will make the codebase more robust and easier to manage in the long run.
| def preprocess_vuln(vuln: vulnerability_pb2.Vulnerability): | ||
| """Do preprocessing steps on vulnerability that the worker does.""" | ||
| # Duplicating parts of _do_update() in worker.py | ||
| # maybe_normalize_package_names: | ||
| for affected in vuln.affected: | ||
| if not affected.package.ecosystem: | ||
| continue | ||
| affected.package.name = osv.ecosystems.maybe_normalize_package_names( | ||
| affected.package.name, affected.package.ecosystem) | ||
| # skipping fix_invalid_ghsa because I don't think it's a problem anymore. | ||
| # filter_unknown_ecoystems: | ||
| filtered = [] | ||
| for affected in vuln.affected: | ||
| if not affected.HasField('package'): | ||
| filtered.append(affected) | ||
| elif osv.ecosystems.is_known(affected.package.ecosystem): | ||
| filtered.append(affected) | ||
| else: | ||
| logging.error('%s contains unknown ecosystem "%s"', vuln.id, | ||
| affected.package.ecosystem) | ||
| del vuln.affected[:] | ||
| vuln.affected.extend(filtered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the comment on line 1160, this function duplicates parts of _do_update() in worker.py. To avoid code duplication and ensure consistency between the importer and the worker, this preprocessing logic should be extracted into a shared utility function or module that both can import and use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing code between the worker and importer is a little bit cumbersome.
I don't really want to spend too much effort refactoring when I want to rewrite it all anyway.
| def compute_raw_affected_checksum(vuln: vulnerability_pb2.Vulnerability): | ||
| """Computes a checksum of the affected array from a vulnerability.""" | ||
| # Grab the minimum information out of afftected and sort. | ||
| #TODO(michaelkedar): This should be in models.py | ||
| aff = [] | ||
| # (ecocystem, package, [ | ||
| # (type, repo, [(introduced, fixed, last_affected, limit)]) | ||
| # ], [versions]) | ||
| for affected in vuln.affected: | ||
| ecosystem = affected.package.ecosystem | ||
| package = affected.package.name | ||
| versions = sorted(affected.versions) | ||
| ranges = [] | ||
| for r in affected.ranges: | ||
| t = r.type | ||
| repo = r.repo | ||
| events = sorted( | ||
| (e.introduced, e.fixed, e.last_affected, e.limit) for e in r.events) | ||
| ranges.append((t, repo, events)) | ||
| ranges.sort() | ||
| aff.append((ecosystem, package, ranges, versions)) | ||
| aff.sort() | ||
| b = json.dumps(aff).encode() | ||
| return osv.sha256_bytes(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the TODO on line 1247. This function computes a checksum based on the raw Vulnerability data structure. It would be more appropriate to locate this function within osv/models.py alongside the data models it operates on, or in a dedicated utility module for vulnerability data processing. This would improve code organization and make the function's purpose clearer.
| # A checksum of the Affected packages and versions for the importer. | ||
| # TODO(michaelkedar): move this to Vulnerability entity | ||
| affected_checksum: str = ndb.TextProperty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment here is spot on. The affected_checksum is calculated from the raw vulnerability data before any worker enrichment. Placing it on the Bug model, which represents the enriched and processed data, is a bit counter-intuitive. It would be more logical to have this on a Vulnerability entity that represents the raw data. While this might be a pragmatic choice for now, it's a good idea to plan for moving this as part of future refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be added when the importer is refactored to use the Vulnerability entities
another-rex
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Added a some logic to make the importer update the database when it receives updates for the data sources, instead of having to wait for the worker to finish enumerating versions / git commits.
To prevent churn, the importer checks if the affected packages have changed in the upstream source first. If they haven't it won't overwrite them (so we can keep our previous worker enrichment). The worker will still run for every record, however.
Bugs put by the importer will use the modified date from the source, while bugs modified by the worker now use the time the worker finished (instead of the source time like before).
I've also modified the REST test to reduce an absolutely massive vulnerability record, and made the
osv.utcnow()return a time in the future instead of in 2021, which was causing weirdness with time going backwards.This whole logic is a bit temporary - I eventually want to rewrite the importer (in go) to stop relying on the Bug entities, and remove much of the OSS-Fuzz-specific behaviour.