Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats issue to be reconsidered for the beacon integration #132

Open
wm75 opened this issue Oct 28, 2022 · 0 comments
Open

Stats issue to be reconsidered for the beacon integration #132

wm75 opened this issue Oct 28, 2022 · 0 comments

Comments

@wm75
Copy link
Member

wm75 commented Oct 28, 2022

Copied this over from #131.
Once the beacon integration goes live and sees some use, the limitations described here should be revisited.


The import method uses the following info fields

  • AN will map to callCount in the beacon DB. Has 2 * num_called as a fallback (num_called is calculated from VCF)
  • AF will map to frequency in the beacon DB. Has AC / AN as a fallback
  • VT will map to varianttype in the beacon DB. Database field is nullable, so it still imports fine without this
  • AC will map to alleleCount in the beacon DB. Will break the import when missing (for this dataset) - I added a line that AC is required.

There is an option min_ac for filtering out variants that were seen less than a minimal amount (1 by default). I currently set this to 0 - setting this to anything higher than 0 will also break the import for anything that does not contain VT (and maybe others too)

The import is a bit "python-esc" 😅

It has an _unpack method, that reads the INFO fields into nested lists.
While inserting variants list entries are just accessed by indices, leading to "index out of bounds" exceptions whenever something is not set.
There is a try/catch block around the whole for each variant in variants loop that catches these exceptions, cancelling 1000 variant imports a pop.


There is also a whole other block that is calling for SVTYPE and MATEID info field. I just never had any data with variant.is_sv == true


On another note, the same variant is never added twice duo to ON CONFLICT (datasetId, chromosome, start, reference, alternate) DO NOTHING.
In an ideal world we would increment sample and allele counts and recalculate the allele frequency.

But I´d argue that its not that big of a deal, since the datasets uploaded by users are arbitrary and therefore allele frequency across this data has not much meaning anyway.


TL;DR;

Had to add AC info field as a requirement.

The import routine that comes with beacon-python was written for a specific kind of dataset. It does the job for now, but if the feature sees some use we will write our own importer to handle all kinds of data (as suggested in the docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant