Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate records in the released VCF file #139

Open
lacek opened this issue Sep 29, 2023 · 3 comments
Open

Duplicate records in the released VCF file #139

lacek opened this issue Sep 29, 2023 · 3 comments

Comments

@lacek
Copy link

lacek commented Sep 29, 2023

Referring to Ensembl/VEP_plugins#638, it is found in spliceai_scores.masked.snv.hg38.vcf.gz that there are some variants having different scores (even they are having the same gene symbol), e.g.:

...
2	241813895	.	A	T	.	.	SpliceAI=T|NEU4|0.00|0.00|0.11|0.00|31|0|-2|33
2	241813895	.	A	T	.	.	SpliceAI=T|NEU4|0.00|0.00|0.70|0.00|-28|0|-2|33
...
19	39885875	.	G	C	.	.	SpliceAI=C|FCGBP|0.18|0.94|0.00|0.00|25|-3|25|-23
19	39885875	.	G	C	.	.	SpliceAI=C|FCGBP|0.25|0.00|0.00|0.00|25|-3|25|-23
...

Are these cases expected? If so how should we interpret such records?

@kishorejaganathan
Copy link
Contributor

This is because we did all the scoring in hg19, and the hg38 scores were provided via liftover. When two different positions in hg19 map to the same position in hg38, you see a duplication. If you just stick to the list of genes here (https://github.com/Illumina/SpliceAI/blob/master/spliceai/annotations/grch38.txt), you will not run into this issue. My recommendation would be to rerun the scores using the tool for such examples to get the correct hg38 score and bypass liftover related issues.

@lacek
Copy link
Author

lacek commented Oct 3, 2023

@kishorejaganathan When filtering by the list of grch38 genes, there are still records of the same variant and gene with different scores, e.g.:

2	1223255	.	A	C	.	.	SpliceAI=C|SNTG2|0.00|0.00|0.01|0.00|-3|-20|-2|49
2	1223255	.	A	C	.	.	SpliceAI=C|SNTG2|0.00|0.00|0.01|0.00|-3|-20|-2|-21
2	1223255	.	A	G	.	.	SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|50|-20|-21|49
2	1223255	.	A	G	.	.	SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|-3|-20|-21|5
2	1223255	.	A	G	.	.	SpliceAI=G|SNTG2|0.00|0.00|0.00|0.00|-3|-20|-21|18
2	1223255	.	A	T	.	.	SpliceAI=T|SNTG2|0.00|0.00|0.09|0.00|12|-20|-2|49
2	1223255	.	A	T	.	.	SpliceAI=T|SNTG2|0.00|0.00|0.62|0.00|12|50|-2|-21
2	1223255	.	A	T	.	.	SpliceAI=T|SNTG2|0.00|0.00|0.66|0.00|12|-20|-2|-21
17	1153657	.	A	C	.	.	SpliceAI=C|ABR|0.00|0.00|0.00|0.00|4|14|4|-25
17	1153657	.	A	C	.	.	SpliceAI=C|ABR|0.00|0.00|0.01|0.00|-44|3|4|-21
17	1153657	.	A	G	.	.	SpliceAI=G|ABR|0.00|0.00|0.00|0.00|43|-15|4|33
17	1153657	.	A	G	.	.	SpliceAI=G|ABR|0.00|0.00|0.01|0.00|3|-44|4|33
17	1153657	.	A	T	.	.	SpliceAI=T|ABR|0.00|0.00|0.04|0.00|43|-44|4|33
17	1153657	.	A	T	.	.	SpliceAI=T|ABR|0.00|0.00|0.19|0.00|43|-44|4|33

@kishorejaganathan
Copy link
Contributor

Ah, thanks for bringing this to my attention. I accepted all genes which had same number of exons and matching exon lengths between the two annotations. These genes meet that criteria but have some liftover issues in introns. You can ignore such genes or run SpliceAI with hg38 annotations from scratch instead to avoid liftover issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants