Stats issue to be reconsidered for the beacon integration #132

wm75 · 2022-10-28T12:56:59Z

Copied this over from #131.
Once the beacon integration goes live and sees some use, the limitations described here should be revisited.

The import method uses the following info fields

AN will map to callCount in the beacon DB. Has 2 * num_called as a fallback (num_called is calculated from VCF)
AF will map to frequency in the beacon DB. Has AC / AN as a fallback
VT will map to varianttype in the beacon DB. Database field is nullable, so it still imports fine without this
AC will map to alleleCount in the beacon DB. Will break the import when missing (for this dataset) - I added a line that AC is required.

There is an option min_ac for filtering out variants that were seen less than a minimal amount (1 by default). I currently set this to 0 - setting this to anything higher than 0 will also break the import for anything that does not contain VT (and maybe others too)

The import is a bit "python-esc" 😅

It has an _unpack method, that reads the INFO fields into nested lists.
While inserting variants list entries are just accessed by indices, leading to "index out of bounds" exceptions whenever something is not set.
There is a try/catch block around the whole for each variant in variants loop that catches these exceptions, cancelling 1000 variant imports a pop.

There is also a whole other block that is calling for SVTYPE and MATEID info field. I just never had any data with variant.is_sv == true

On another note, the same variant is never added twice duo to ON CONFLICT (datasetId, chromosome, start, reference, alternate) DO NOTHING.
In an ideal world we would increment sample and allele counts and recalculate the allele frequency.

But I´d argue that its not that big of a deal, since the datasets uploaded by users are arbitrary and therefore allele frequency across this data has not much meaning anyway.

TL;DR;

Had to add AC info field as a requirement.

The import routine that comes with beacon-python was written for a specific kind of dataset. It does the job for now, but if the feature sees some use we will write our own importer to handle all kinds of data (as suggested in the docs).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats issue to be reconsidered for the beacon integration #132

Stats issue to be reconsidered for the beacon integration #132

wm75 commented Oct 28, 2022

Stats issue to be reconsidered for the beacon integration #132

Stats issue to be reconsidered for the beacon integration #132

Comments

wm75 commented Oct 28, 2022