Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from pyvcf to cyvcf2 for VCF parsing #146

Merged
merged 43 commits into from
Mar 29, 2022

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Feb 8, 2022

Closes #142

I also found (and fixed) a bug with the ONT preset. Basically, it was being set after the genotyper and coverage parsers were constructed. As a result, the genotyper was being set with the default error rate and ploidy, which were Illumina.

I noticed this after testing this PR with a fastq from my TB data and noticed a heap of HET calls. After the fix, the results of v0.10.0 and this PR are identical for predict.

@mbhall88 mbhall88 requested a review from martinghunt February 8, 2022 07:22
Copy link
Member

@martinghunt martinghunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

Can see you've stuck to the original logic with how to use or ignore VCF records, which totally makes sense. I'm thinking this is a good time to change some of that logic - see comments.

src/mykrobe/_vcf/models.py Outdated Show resolved Hide resolved
src/mykrobe/_vcf/models.py Outdated Show resolved Hide resolved
valid = False
try:
if sample["GT_CONF"] <= 1:
gt_conf = record.format("GT_CONF")[i][0]
if gt_conf <= 1: # todo: magic number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure we should be doing this at all. I'm wondering if historically there was a reason for it because some tool was being used to make VCFs where a conf of 1 was significant? I'd vote to not even look for GT_CONF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some value in ignoring GT_CONF==0 positions, as in those cases we have no idea what the genotype is i guess?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, see your point. The counter proposal is: document that all you need is GT field, and all variants with non-ref GT get used. Leave it up to the user to make a filtered VCF as input to mykrobe? Plus, is it better to accidentally include an extra background snp as opposed to silently exclude snps because the user didn't put all the fields mykrobe's looking for in their VCF?
I don't mind though, path of least resistance is leave it how it is now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree with Martin on this (remove any gt conf checks). This VCF should realistically be pre-curated. If we are going to do any kind of curating, we should be opening up all of these options to users. And that is somewhere I don't think we want/need to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I agree

@mbhall88
Copy link
Member Author

mbhall88 commented Feb 9, 2022

Ok, so the logic simplifies down to if the record isn't homozygous alt, then it's invalid.

I also realised that numpy wasn't listed in the install_requires. Biopython requires it so we were getting it via that, but better to be explicit.

As for the failing appveyor: this is a windows/python3.9 issue I think. We pin numpy v1.15.0 in appveyor

- '%BASH% -lc "pip3 install tox requests numpy==1.15"'

Not sure if this is necesary or if we can relax it? Basically numpy don't have python 3.9 wheels for "older" versions of numpy (yet), which means pip tries to compile it from source, which then requires some BLAS library which doesn't come with windows by default.
Two options would be to relax the version, or use on of the prebuilt wheels mentioned in the linked stackoverflow answer.

@martinghunt
Copy link
Member

Relax, or not even specify, the version? Expect it will be ok. numpy is only being used for poisson and binomial distributions, and its log10 function.

@mbhall88
Copy link
Member Author

Ok, the appveyor file has been reset. @martinghunt you can squash (will remove all those rubbish appveyor commits) this PR when you're ready.

@martinghunt martinghunt merged commit afa500e into Mykrobe-tools:master Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move away from pyvcf
3 participants