feat!: refactor annotator as ABC, add NDJSON annotation + support optional side effects #502
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
close #474
Refactor the Annotator class to better support configurable outputs and side effects beyond an annotated VCF. Specifically, the class is now an Abstract Base Class,
AbstractVcfAnnotator
, and implementations must define three methods:on_vrs_object
: perform filtering/transformation/side effects on VRS objects that have been translated from VCF coords. For example, attach additional extensions or mappings, or upload to a DB. Called every time the translator produces a successful VRS allele.on_vrs_object_collection
: do something with the aggregation of all VRS alleles collected during VCF annotation, such as dump them to a file. Called once after VCF ingestion is complete, but only if the class variablecollect_alleles
isTrue
.raise_for_output_args
: double-check that some kind of output has been declared inannotate()
. This is here because there was a similar check inannotate()
previously. The idea is to force a fast failure if you aren't going to be producing any kind of output. I don't feel particularly tied to keeping this, though.The existing pickle file dump is modified slightly to use VRS IDs as keys (unsure why the other thing was being used previously, could change back if necessary). It's refactored to be an optional add-on, and an additional option to output an NDJSON dump is added. A basic implementation incorporating all of this is defined in the class
VcfAnnotator
. The CLI is updated to use this class.Some potential issues:
kwargs
to pass arguments to the child class methods -- it's a little clunky obviously, both with use and documentation.vrs_data
before, I'm trying a name likeallele_collection
?) was retaining stringified dict dumps of alleles (i.e.str(allele.model_dump(exclude_none=True))
). I'm not totally sure why this was the case, so I changed it to just hold onto the pydantic objects and defer decisions about serialization etc toon_vrs_object_collection
. I am not sure if this causes memory issues with extremely large VCFs.