-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from pyvcf to cyvcf2 for VCF parsing #146
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
Can see you've stuck to the original logic with how to use or ignore VCF records, which totally makes sense. I'm thinking this is a good time to change some of that logic - see comments.
src/mykrobe/_vcf/models.py
Outdated
valid = False | ||
try: | ||
if sample["GT_CONF"] <= 1: | ||
gt_conf = record.format("GT_CONF")[i][0] | ||
if gt_conf <= 1: # todo: magic number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even sure we should be doing this at all. I'm wondering if historically there was a reason for it because some tool was being used to make VCFs where a conf of 1 was significant? I'd vote to not even look for GT_CONF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some value in ignoring GT_CONF==0 positions, as in those cases we have no idea what the genotype is i guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, see your point. The counter proposal is: document that all you need is GT field, and all variants with non-ref GT get used. Leave it up to the user to make a filtered VCF as input to mykrobe? Plus, is it better to accidentally include an extra background snp as opposed to silently exclude snps because the user didn't put all the fields mykrobe's looking for in their VCF?
I don't mind though, path of least resistance is leave it how it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to agree with Martin on this (remove any gt conf checks). This VCF should realistically be pre-curated. If we are going to do any kind of curating, we should be opening up all of these options to users. And that is somewhere I don't think we want/need to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I agree
Ok, so the logic simplifies down to if the record isn't homozygous alt, then it's invalid. I also realised that numpy wasn't listed in the As for the failing appveyor: this is a windows/python3.9 issue I think. We pin numpy v1.15.0 in appveyor Line 45 in 14621ad
Not sure if this is necesary or if we can relax it? Basically numpy don't have python 3.9 wheels for "older" versions of numpy (yet), which means pip tries to compile it from source, which then requires some BLAS library which doesn't come with windows by default. |
Relax, or not even specify, the version? Expect it will be ok. numpy is only being used for poisson and binomial distributions, and its log10 function. |
Ok, the appveyor file has been reset. @martinghunt you can squash (will remove all those rubbish appveyor commits) this PR when you're ready. |
Closes #142
I also found (and fixed) a bug with the ONT preset. Basically, it was being set after the genotyper and coverage parsers were constructed. As a result, the genotyper was being set with the default error rate and ploidy, which were Illumina.
I noticed this after testing this PR with a fastq from my TB data and noticed a heap of HET calls. After the fix, the results of v0.10.0 and this PR are identical for predict.