Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: refactor annotator as ABC, add NDJSON annotation + support optional side effects #502

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jsstevenson
Copy link
Contributor

@jsstevenson jsstevenson commented Feb 7, 2025

close #474

Refactor the Annotator class to better support configurable outputs and side effects beyond an annotated VCF. Specifically, the class is now an Abstract Base Class, AbstractVcfAnnotator, and implementations must define three methods:

  • on_vrs_object: perform filtering/transformation/side effects on VRS objects that have been translated from VCF coords. For example, attach additional extensions or mappings, or upload to a DB. Called every time the translator produces a successful VRS allele.
  • on_vrs_object_collection: do something with the aggregation of all VRS alleles collected during VCF annotation, such as dump them to a file. Called once after VCF ingestion is complete, but only if the class variable collect_alleles is True.
  • raise_for_output_args: double-check that some kind of output has been declared in annotate(). This is here because there was a similar check in annotate() previously. The idea is to force a fast failure if you aren't going to be producing any kind of output. I don't feel particularly tied to keeping this, though.

The existing pickle file dump is modified slightly to use VRS IDs as keys (unsure why the other thing was being used previously, could change back if necessary). It's refactored to be an optional add-on, and an additional option to output an NDJSON dump is added. A basic implementation incorporating all of this is defined in the class VcfAnnotator. The CLI is updated to use this class.

Some potential issues:

  • In general, reliance on kwargs to pass arguments to the child class methods -- it's a little clunky obviously, both with use and documentation.
  • since retaining all constructed alleles might be costly for speed and memory, it can be disabled/enabled by an implementation with the class variable "collect_alleles". It's disabled by default, so if you tried to add functionality like dumping to a file and missed that you need to change the class variable, it wouldn't do anything. Ideally there would be some way to raise an abc error of some kind, idk.
  • Previously, the VRS allele collection that's retained while ingesting VCFs (it was named vrs_data before, I'm trying a name like allele_collection?) was retaining stringified dict dumps of alleles (i.e. str(allele.model_dump(exclude_none=True))). I'm not totally sure why this was the case, so I changed it to just hold onto the pydantic objects and defer decisions about serialization etc to on_vrs_object_collection. I am not sure if this causes memory issues with extremely large VCFs.

@jsstevenson jsstevenson added the priority:medium Medium priority label Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:medium Medium priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add callback support for VCF annotator
1 participant