Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some rsIDs in original VCF not in imputed VCF #51

Open
CholoTook opened this issue Mar 8, 2022 · 7 comments
Open

Some rsIDs in original VCF not in imputed VCF #51

CholoTook opened this issue Mar 8, 2022 · 7 comments

Comments

@CholoTook
Copy link

Looking only at chromosome 3, some 2k rsIDs (from 23andMe data) are not found in the 2.8G SNPs in imputed chromosome 3... Why should SNPs be dropped from the input in the output?

Many thanks.

@richarddurbin
Copy link
Owner

We can only impute at sites shared with the reference panel. The 2k rsIDs you refer to must not be in the reference panel you are using - you can easily check that by comparing the lists of sites. Indeed, pbwt tells you how many sites are being used I believe.

@CholoTook
Copy link
Author

Thanks for this info.

This was the explanation that I was guessing. However, these SNPs missing from the imputation result are in my data, they don't need to be imputed, so it's strange that they don't find their way into the result don't you think?

These are the 'anchors' that the other data is derived from, so dropping any of them is a problem (I'm guessing).

Obviously I can merge the files, but that's a bit of a pain.

Interestingly, I'm processing the 23andMe (v3) file along with the imputation results, and I find that there are 1,806 rsIDs that can be added back to the imputation results by matching chromosome and position. e.g. they are in the imputation panel after all, but they don't have the correct rsID or the rsID is somehow dropped at one stage or another.

So of the 2,479 'missing' SNPs, 1,806 can be 'found' leaving 673 'anchors' missing from chromosome 3.

On a related note, I see some rsIDs with multiple positions in the results (this is different from the variations with > 1 alt allele we discussed elsewhere). e.g.

3	4942430	rs71634747	G	A	.	PASS	RefPanelAF=0.445673;AN=2;AC=1;INFO=1	GT:ADS:DS:GP	1|0:1,0:1:0,1,0
3	4942432	rs71634747	C	G	.	PASS	RefPanelAF=0.248907;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0

When I check that rsID, I can see that it was merged with two other rsIDs at these locations:
https://www.ncbi.nlm.nih.gov/snp/rs71634747
https://www.ncbi.nlm.nih.gov/snp/rs724135
https://www.ncbi.nlm.nih.gov/snp/rs3762784

Which sort of makes sense, sort of not... from somewhere the two 'new' locations for this rsID have been found (4942430 and 4942432), but the new rsIDs have not. i.e. it's the same old rsID with the new locations.

I find about 30 of these cases in chromosome 3.

Sorry if this is overly pedantic... I'm honestly not sure how else to work!

Many thanks,
Dan.

@CholoTook
Copy link
Author

So I just checked and all but 2 of the 31 rsIDs in the imputation results for chromosome 3 have been merged into two separate rsIDs, and all the distances between them are less than 10bp. Although this isn't a big problem it's still confusing regarding new position / old rsID.

I guess this may be a bug / version mix up somewhere?

The two rsIDs with two positions and no apparent cause in dbSNP are:

Here is how they appear in the file:

3	89056114	rs71632245	C	T	.	PASS	RefPanelAF=0.519495;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
3	89056116	rs71632245	A	G	.	PASS	RefPanelAF=0.543132;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1
--
3	89056114	rs71632245	C	T	.	PASS	RefPanelAF=0.519495;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
3	89056116	rs71632245	A	G	.	PASS	RefPanelAF=0.543132;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1

I assume this is a version inconsistency with the data somewhere.

@CholoTook
Copy link
Author

More possibly related 'weirdness':

3	103279	rs555415488	G	A	.	PASS	RefPanelAF=0.000323375;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
3	103279	.	G	T	.	PASS	RefPanelAF=0.000107792;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0

@CholoTook
Copy link
Author

And just for completeness...

3	50253604	rs750257636	C	A	.	PASS	RefPanelAF=7.69941e-05;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
3	50253604	rs587737274	C	G	.	PASS	RefPanelAF=0.000323375;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
--
3	178156578	rs10212245	T	G	.	PASS	RefPanelAF=0.00203265;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
3	178156578	rs200836430	T	C	.	PASS	RefPanelAF=0.117062;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0

@CholoTook
Copy link
Author

@richarddurbin Sorry, I know the above details are a pain to 'parse', but hopefully the problems are clear enough. Please let me know if any of the above issues are unclear. Here is the python code that I used to pull everything out:
https://github.com/Geromics/covcheck/blob/wip/report-v2/Research/debug_imputation_results.py

Please let me know if I should log these problems elsewhere.

I have some questions about phasing / imputation in general, I wonder if you or one of your colleagues could spare some time to talk me through some details?

Many thanks,
Dan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants