Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with find-sites #96

Open
ashotmarg opened this issue Mar 3, 2022 · 15 comments
Open

Error with find-sites #96

ashotmarg opened this issue Mar 3, 2022 · 15 comments

Comments

@ashotmarg
Copy link

Hi Brent,

Thanks for Somalier, it's quite fast and straightforward to use!
At the moment I am also trying to extract sites more tailored to my dataset with "find-sites". While it worked fine for one of my vcf files, it failed for another, which was produced from a different pipeline with slightly different INFO and FORMAT fields.

For the failed one I am getting the error below: I am running somalier from it's singularity image and this is just a chr21 extract of my main vcf.


$ singularity exec ~/bin/somalier.sif somalier find-sites --min-AN=2 --min-AF=0.2 --snp-dist=10 test.vcf
INFO: Converting SIF file to temporary sandbox...
somalier version: 0.2.15
on chrom:chr21
290 candidate variants
tables.nim(262) []
Error: unhandled exception: key not found: chr21 [KeyError]
INFO: Cleaning up image...


Could you elaborate for which fields in the vcf is "find-sites" looking for? My vcf seems to have the relevant ones such as AC, AF, AN...
E.g., INFO and FORMAT fields:
AC=2;AF=0.333;AN=6;DP=16;ExcessHet=1.0474;FS=0;MLEAC=1;MLEAF=0.167;QD=22;SOR=2.303 GT:AD:DP:GQ:PL

Thanks
Ashot

@brentp
Copy link
Owner

brentp commented Mar 3, 2022

Hi, thanks for reporting. Will you give it a try with this binary? It will give us more information about exactly where the error is occuring

somalier_dbg.gz

@brentp
Copy link
Owner

brentp commented Mar 3, 2022

My apologies. I didn't overwrite a previous somalier_dbg I had. Please try this one.
somalier_dbg.gz

@ashotmarg
Copy link
Author

Thanks for the prompt feedback! Here is what I get with the whole vcf file.

$ ./somalier_dbg find-sites batch1.Sentieon.biSNPs.vcf.gz
somalier version: 0.2.15
on chrom:chr1
on chrom:chr2
on chrom:chr3
on chrom:chr4
on chrom:chr5
on chrom:chr6
on chrom:chr7
on chrom:chr8
on chrom:chr9
on chrom:chr10
on chrom:chr11
on chrom:chr12
on chrom:chr13
on chrom:chr14
on chrom:chr15
on chrom:chr16
on chrom:chr17
on chrom:chr18
on chrom:chr19
on chrom:chr20
on chrom:chr21
on chrom:chr22
on chrom:chrX
on chrom:chrY
on chrom:chrM
160251 candidate variants
/home/brentp/src/somalier/src/somalier.nim(266) somalier
/home/brentp/src/somalier/src/somalier.nim(253) main
/home/brentp/src/somalier/src/somalierpkg/findsites.nim(249) findsites_main
/nim/lib/pure/collections/tables.nim(858) []
/nim/lib/pure/collections/tables.nim(262) []
Error: unhandled exception: key not found: chrX [KeyError]


Thanks in advance!

@brentp
Copy link
Owner

brentp commented Mar 3, 2022

I see the problem. It is fixed in this binary
somalier_dbg2.gz

I was assuming that the table would have entries for each chromosome which is obviously not going to happen in every case, including yours.

@ashotmarg
Copy link
Author

So now the error message is gone but not sure it's working properly, most likely these things were also present in my original run, I just didn't check thoroughly. I seem to have SNPs which are not far (10000 bp) apart from each other after giving "--snp-dist=10000", there are many cases like this below. In addition I don't have any SNPs from e.g., chr8, and I know that there are many SNPs matching the parameters --min-AN=1 --min-AF=0.4 --snp-dist=10000
Also, in the sites.vcf.gz the program is still giving output from chrX/Y as well but they are not filtered...

chr7 4620726 . A C 100 . AC=3;AF=0.5
chr7 4620753 . T G 100 . AC=3;AF=0.5

chr7 4625241 . A G 100 . AC=3;AF=0.5
chrX 2781514 . C A 100 . AC=3;AF=0.5
chrX 2781635 . G A 100 . AC=1;AF=0.167
chrX 2781927 . A G 100 . AC=3;AF=0.5
chrX 2781986 . T C 100 . AC=3;AF=0.5
chrX 2782161 . A G 100 . AC=1;AF=0.167
chrX 2782261 . T A 100 . AC=1;AF=0.167

chrX 2782567 . C T 100 . AC=2;AF=0.333
chrX 2782572 . A G 100 . AC=3;AF=0.5

Here is the run command/output:


$ ./somalier_dbg2 find-sites --min-AN=1 --min-AF=0.4 --snp-dist=10000 batch1.Sentieon.biSNPs.vcf.gz
somalier version: 0.2.15
on chrom:chr1
on chrom:chr2
on chrom:chr3
on chrom:chr4
on chrom:chr5
on chrom:chr6
on chrom:chr7
on chrom:chr8
on chrom:chr9
on chrom:chr10
on chrom:chr11
on chrom:chr12
on chrom:chr13
on chrom:chr14
on chrom:chr15
on chrom:chr16
on chrom:chr17
on chrom:chr18
on chrom:chr19
on chrom:chr20
on chrom:chr21
on chrom:chr22
on chrom:chrX
on chrom:chrY
on chrom:chrM
319120 candidate variants
sorted and filtered to 152805 variants. now dropping INFOs and writing
[somalier] wrote 65535 variants to:sites.vcf.gz


As a more general approach, as an alternative, I can also e.g., filter my vcf file for AC values I choose and then use "plink --indep" to remove linked sites, right?

@ashotmarg
Copy link
Author

I will also try to just use a vcf with only autosomes, not sure if the sex chromosomes are interfering with the overall calculations.

@brentp
Copy link
Owner

brentp commented Mar 3, 2022

Hi, sorry for your trouble.
Don't use plink to remove sites, somalier should do that for you. Also, only the autosomes are used in the relatedness calculations, but the X and Y are used to validate sex.
I see that I had some uncommitted changes in addition to the fix. Those previous changes are causing the problem you see. I think I see the easy fix, but want to try to verify so you're not running my debug cycle for me. If you give me a few hours, I think this will be resolved.

@ashotmarg
Copy link
Author

Thanks Brent, no worries and sounds good ;)

@brentp
Copy link
Owner

brentp commented Mar 3, 2022

OK. I think this will work correctly for you.
I would lower the snp-dist and the min-af if you need to get more sites.
somalier_dbg3.gz

brentp added a commit that referenced this issue Mar 3, 2022
see #96

this also relplaces a vector of variants with a vector of integers
making for faster linear-search and less memory.
@ashotmarg
Copy link
Author

Thanks, we are getting close but probably not fully there yet!
Please see the output below, in short I think we still have all the unfiltered sex chromosomes. I.e., the output should only have ca. 26000 sites, but there are ca. 176000 sites in the sites.vcf.gz. In total we have ca. 150000 sex chromosomes and they are unfiltered.


chrX 2782161 . A G 100 . AC=1;AF=0.167
chrX 2782261 . T A 100 . AC=1;AF=0.167



$ ./somalier_dbg3 find-sites --min-AN=1 --min-AF=0.4 --snp-dist=50000 batch1.Sentieon.biSNPs.vcf.gz
somalier version: 0.2.16
on chrom:chr1
on chrom:chr2
on chrom:chr3
on chrom:chr4
on chrom:chr5
on chrom:chr6
on chrom:chr7
on chrom:chr8
on chrom:chr9
on chrom:chr10
on chrom:chr11
on chrom:chr12
on chrom:chr13
on chrom:chr14
on chrom:chr15
on chrom:chr16
on chrom:chr17
on chrom:chr18
on chrom:chr19
on chrom:chr20
on chrom:chr21
on chrom:chr22
on chrom:chrX
on chrom:chrY
on chrom:chrM
319120 candidate variants
sorted and filtered to 27062 variants. now dropping INFOs and writing
[somalier] wrote 27062 variants to:sites.vcf.gz
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep -v "^#"|wc
176682 1413456 7003099
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chrX"|wc
134008 1072064 5350439
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chrY"|wc
15612 124896 606859
[ashmar@i-07-l0003 results.batch1]$ less sites.vcf.gz |grep "^chr1"|wc
12048 96384 469129


@brentp
Copy link
Owner

brentp commented Mar 3, 2022

Hi, it's intentionally getting more values on X to evaluate sex. This is too many, in your case, and I'll make a change for that, but it will not affect your relatedness calculations.

I'll update this message:

sorted and filtered to 27062 variants. now dropping INFOs and writing

to indicate autosomal.

Would these changes address your concerns?

@brentp
Copy link
Owner

brentp commented Mar 3, 2022

The other thing to clarify is that the sex chromosome have different filtering because it's only for quality and linkage doesn't matter.

brentp added a commit that referenced this issue Mar 3, 2022
indicate number of autosomal variants.
limit number of X and y chromosomes as we don't need so many
brentp added a commit that referenced this issue Mar 3, 2022
indicate number of autosomal variants.
limit number of X and y chromosomes as we don't need so many
@brentp
Copy link
Owner

brentp commented Mar 3, 2022

And here's another binary with those changes in case you want to try it. It only updates the message and caps the number of X and y variants it will use.
somalier_dbg4.gz

@ashotmarg
Copy link
Author

Super, thanks a lot Brent for the prompt response, I think it looks great! Yes, the X/Y chromosomes are now also filtered for quality.

@ashotmarg
Copy link
Author

Hi Brent, I've been using the binary of the Somalier (somalier_dbg4.gz) that you shared with me (in this thread above), and it's the Somalier's version: 0.2.16. However, I realised that the latest version in the releases and the docker file is actually the v0.2.15. I know that there are only minor changes in the versions, but I was wondering whether you will also push that latest v0.2.16 to docker and github? Thanks, Ashot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants