Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent ancestry predictions between Somalier and Peddy #103

Open
mpj5142 opened this issue Aug 17, 2022 · 5 comments
Open

Inconsistent ancestry predictions between Somalier and Peddy #103

mpj5142 opened this issue Aug 17, 2022 · 5 comments

Comments

@mpj5142
Copy link

mpj5142 commented Aug 17, 2022

Hello,

I am currently using your Somalier software to check for relatedness and uniformly calculate ancestry PCAs across several cohorts for a meta-analysis. Our cohorts are mostly individuals with known European ancestry; however, Somalier's ancestry function calls most samples as AMR super-group based on the 1K Genomes dataset.
AMPAD_affy_preimpute somalier_ancestry

Other members of my lab have previously used your Peddy software for the same calculations, so I went back and checked the results with that software using the same underlying dataset (only change was to remove "chr" from the VCF file). Here, the results look as expected, with most samples labeled as the EUR super-group.
AMPAD_affy_preimpute pca_check

For reference, here is the code I used for each software:
./somalier extract -d AMPAD_affy_preimpute/ --sites sites.hg38.vcf.gz -f Homo_sapiens_assembly38.fasta ROSMAP_affy_preimpute_hg38.vcf.gz

./somalier ancestry --labels ancestry-labels-1kg.tsv --n-pcs=10 -o AMPAD_affy_preimpute 1kg-somalier/*.somalier ++ AMPAD_affy_preimpute/*.somalier

python -m peddy --sites hg38 --plot --prefix AMPAD_affy_preimpute ROSMAP_affy_preimpute_hg38_nochr.vcf.gz ROSMAP_affy_genotypes_hg38_final.fam

I was wondering if you have come across this issue before, or would have any insights into the different results? (I can send over an example VCF if you would like to trouble-shoot; it will be a different cohort than the plots above, as those are restricted data.) Thanks!

@brentp
Copy link
Owner

brentp commented Aug 17, 2022

Hi, I have known there are some issues with the somalier ancestry setup. You can trust the peddy ones (as you note) much more.
For somalier, I would use --n-pcs 4 or less. It treats each PC equally, even though the first few explain much more variance.
That should work for easy cases (which yours appears to be), the problem will be that somalier will confidently predict ancestry even in the true ancestry is one it has never been trained on.

@mpj5142
Copy link
Author

mpj5142 commented Aug 17, 2022

Thanks Brent! Unfortunately, reducing the PCAs to 4 still resulted in most samples being called as AMR.

I will note that I did not input any of the known ancestries for our samples when running Somalier--I can try this to see if the samples with missing ancestry will be imputed better, although Peddy is already giving results more in-line with our expectations, so I may just stick with those results.

Thanks again for your help!

AMPAD_affy_preimpute somalier_ancestry

@brentp
Copy link
Owner

brentp commented Aug 17, 2022

Is your data from sequencing? Exome? WGS? Or from a chip?

@mpj5142
Copy link
Author

mpj5142 commented Aug 17, 2022

This is array data--I filtered for some basic QC steps (i.e. genotype call rate) before running Somalier. However, I obtained similar results when running the software on SNP array data imputed from TOPMED as well as WES and WGS-based datasets.

@cgroza
Copy link

cgroza commented Jul 2, 2024

Hi,

I am experiencing the same issues.

I am using the 1KGP dataset as the reference, and somalier labels most of my samples as AMR.
I have been using 5 PCs, but even when plotting just PC1 and PC2, the samples appear to be clustered in the wrong location (they should be clustering with AFR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants