Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--freq-file not using all available SNPs present in both study and freq-file VCF #29

Open
stephenturner opened this issue Oct 22, 2018 · 5 comments

Comments

@stephenturner
Copy link
Contributor

I'm having an issue supplying allele frequencies as answered in #28. Unfortunately I can't provide example data, but I'll try to provide as much as I can without supplying sensitive data.

background

I have a study VCF (bgzip+tabix indexed). I'm using your set of "reliable" SNPs that happen to be on my platform. Some stats on my study VCF:

$ bcftools stats study.vcf.gz | grep ^SN | head -n2
SN      0       number of samples:      65
SN      0       number of records:      17508

Some stats on my frequencies VCF (also bgzip'd and tabix indexed):

$ bcftools stats frequencies.vcf.gz | grep "^SN" | head -n2
SN      0       number of samples:      0
SN      0       number of records:      17160

Just to be sure, I'm using bcftools isec to check the number of SNPs intersecting between these files.

bcftools isec study.vcf.gz frequencies.vcf.gz --prefix isec

According to the readme produced with isec, I'm primarily interested in how many SNPs are in 0002 and 0003 - the SNPs shared between the files.

$ cat isec/README.txt
This file was produced by vcfisec.
The command line was:   bcftools isec  --prefix isec study.vcf.gz frequencies.vcf.gz

Using the following file names:
isec/0000.vcf   for records private to  study.vcf.gz
isec/0001.vcf   for records private to  frequencies.vcf.gz
isec/0002.vcf   for records from study.vcf.gz shared by both    study.vcf.gz frequencies.vcf.gz
isec/0003.vcf   for records from frequencies.vcf.gz shared by both      study.vcf.gz frequencies.vcf.gz

I have 16,137 overlapping.

$ bcftools stats isec/0002.vcf | grep "^SN" | head -n2
SN      0       number of samples:      65
SN      0       number of records:      16137
$ bcftools stats isec/0003.vcf | grep "^SN" | head -n2
SN      0       number of samples:      0
SN      0       number of records:      16137

problem demonstration

However, when I run akt kin supplying this VCF with frequency data, its telling me that it's only using 9 SNPs in total.

$ akt kin -M 1 --freq-file frequencies.vcf.gz study.vcf.gz > /dev/null
Taking allele frequencies from frequencies.vcf.gz using INFO/AF
WARNING: threading is disabled
WARNING: threading is disabled
WARNING: threading is disabled
Using 0 threads
WARNING: threading is disabled
65 samples
Reading genotypes...done.
Kept 9 markers out of 9 in panel.
9/9 of study markers were in the sites file
Calculating kinship values...done.

I'm using akt 0.3.2.

$ akt

Program:        akt (Ancestry and Kinship Tools)
Version:        0.3.2

I also tried converting everything to BCF:

bcftools view study.vcf.gz -Ob -o study.bcf
bcftools index study.bcf
bcftools view frequencies.vcf.gz -Ob -o frequencies.bcf
bcftools index frequencies.bcf

And running akt kin again, but this time, I run into a different error. Not sure what this is about.

$ akt kin -M 1 --freq-file frequencies.bcf study.bcf > /dev/null
Taking allele frequencies from frequencies.bcf using INFO/AF
WARNING: threading is disabled
WARNING: threading is disabled
WARNING: threading is disabled
Using 0 threads
WARNING: threading is disabled
[E::tbx_index_load2] Invalid index header for frequencies.bcf
[E::bcf_sr_regions_init] Could not parse the file frequencies.bcf, using the columns 1,2[,-1]
ERROR: Failed to read the regions: frequencies.bcf
Exiting...

I did ensure that the frequencies VCF and the study VCF are both sorted, but that didn't appear to be the issue. Any idea what could be happening here?

@jaredo
Copy link
Contributor

jaredo commented Oct 23, 2018

hmmm...hard to say what the problem is without an example. I did try this on the test data and it runs okay:

$ ../akt kin -M 1 -F ../data/wgs.grch37.vcf.gz ALL.cgi_multi_sample.20130725.pruned.snps.bcf > kin.txt
Taking allele frequencies from ../data/wgs.grch37.vcf.gz using INFO/AF
WARNING: threading is disabled
WARNING: threading is disabled
Using WARNING: threading is disabled
0 threads
WARNING: threading is disabled
433 samples
Reading genotypes...done.
Kept 16485 markers out of 16485 in panel.
16485/16485 of study markers were in the sites file
Calculating kinship values...done.

That indexing error is worrying. What does bcftools stats study.bcf frequencies.bcf | grep SN return?

@stephenturner
Copy link
Contributor Author

$ bcftools stats study.bcf | grep ^SN
SN      0       number of samples:      65
SN      0       number of records:      17508
SN      0       number of no-ALTs:      0
SN      0       number of SNPs: 17508
SN      0       number of MNPs: 0
SN      0       number of indels:       0
SN      0       number of others:       0
SN      0       number of multiallelic sites:   0
SN      0       number of multiallelic SNP sites:       0
$ bcftools stats frequencies.bcf | grep ^SN
SN      0       number of samples:      0
SN      0       number of records:      16137
SN      0       number of no-ALTs:      0
SN      0       number of SNPs: 16137
SN      0       number of MNPs: 0
SN      0       number of indels:       0
SN      0       number of others:       0
SN      0       number of multiallelic sites:   0
SN      0       number of multiallelic SNP sites:       0

I'm thinking the error had something to do with duplicate entries in the frequencies VCF. When I normalized with bcftools norm -d snps the problem went away.

@jaredo
Copy link
Contributor

jaredo commented Oct 23, 2018

FYI you can run stats on two files at once (bcftools stats file1.vcf.gz file2.vcf.gz) and it will calculate the intersect/complement counts.

Duplicate entries would likely cause problems for akt (although I am not quite sure what the behavior would be). Did you try re-running after removing the duplicates?

@stephenturner
Copy link
Contributor Author

Thanks for the tip on bcftools stats / multiple files.

RE after removing dupes - yes, I did, and that worked.

Also a note. I'm using bcftools +fill-tags to fill the AF/MAF tags in the VCF. Problem is with a monomorphic SNP, there's no AF=0 -- there's just no AF tag at all. akt doesn't appear to run at all when given a vcf to --freq-file where one or more SNPs is missing the AF tag. My quick and dirty hack was to bcftools view withMonomorphs.vcf.gz | grep "^#\|AF=" > polymorphic.vcf.

@jaredo
Copy link
Contributor

jaredo commented Oct 25, 2018

Thanks for troubleshooting this. It seems like the best course of action here is to warn or maybe just exit with error. It is on my to do list!

bpow added a commit to bpow/akt that referenced this issue May 22, 2020
There doesn't seem to be any reason not to allow analysis from a
freq_file while also restricting to certain regions.

This also removes the auto-usage of the frq_file as the regions
in bcf_sr, which works around some unused SNPs as described in Illumina#29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants