Inconsistent ancestry predictions between Somalier and Peddy #103

mpj5142 · 2022-08-17T16:18:26Z

Hello,

I am currently using your Somalier software to check for relatedness and uniformly calculate ancestry PCAs across several cohorts for a meta-analysis. Our cohorts are mostly individuals with known European ancestry; however, Somalier's ancestry function calls most samples as AMR super-group based on the 1K Genomes dataset.

Other members of my lab have previously used your Peddy software for the same calculations, so I went back and checked the results with that software using the same underlying dataset (only change was to remove "chr" from the VCF file). Here, the results look as expected, with most samples labeled as the EUR super-group.

For reference, here is the code I used for each software:
./somalier extract -d AMPAD_affy_preimpute/ --sites sites.hg38.vcf.gz -f Homo_sapiens_assembly38.fasta ROSMAP_affy_preimpute_hg38.vcf.gz

./somalier ancestry --labels ancestry-labels-1kg.tsv --n-pcs=10 -o AMPAD_affy_preimpute 1kg-somalier/*.somalier ++ AMPAD_affy_preimpute/*.somalier

python -m peddy --sites hg38 --plot --prefix AMPAD_affy_preimpute ROSMAP_affy_preimpute_hg38_nochr.vcf.gz ROSMAP_affy_genotypes_hg38_final.fam

I was wondering if you have come across this issue before, or would have any insights into the different results? (I can send over an example VCF if you would like to trouble-shoot; it will be a different cohort than the plots above, as those are restricted data.) Thanks!

The text was updated successfully, but these errors were encountered:

brentp · 2022-08-17T16:23:47Z

Hi, I have known there are some issues with the somalier ancestry setup. You can trust the peddy ones (as you note) much more.
For somalier, I would use --n-pcs 4 or less. It treats each PC equally, even though the first few explain much more variance.
That should work for easy cases (which yours appears to be), the problem will be that somalier will confidently predict ancestry even in the true ancestry is one it has never been trained on.

mpj5142 · 2022-08-17T16:53:35Z

Thanks Brent! Unfortunately, reducing the PCAs to 4 still resulted in most samples being called as AMR.

I will note that I did not input any of the known ancestries for our samples when running Somalier--I can try this to see if the samples with missing ancestry will be imputed better, although Peddy is already giving results more in-line with our expectations, so I may just stick with those results.

Thanks again for your help!

brentp · 2022-08-17T17:13:53Z

Is your data from sequencing? Exome? WGS? Or from a chip?

mpj5142 · 2022-08-17T17:28:18Z

This is array data--I filtered for some basic QC steps (i.e. genotype call rate) before running Somalier. However, I obtained similar results when running the software on SNP array data imputed from TOPMED as well as WES and WGS-based datasets.

cgroza · 2024-07-02T12:07:24Z

Hi,

I am experiencing the same issues.

I am using the 1KGP dataset as the reference, and somalier labels most of my samples as AMR.
I have been using 5 PCs, but even when plotting just PC1 and PC2, the samples appear to be clustered in the wrong location (they should be clustering with AFR).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent ancestry predictions between Somalier and Peddy #103

Inconsistent ancestry predictions between Somalier and Peddy #103

mpj5142 commented Aug 17, 2022 •

edited

Loading

brentp commented Aug 17, 2022

mpj5142 commented Aug 17, 2022

brentp commented Aug 17, 2022

mpj5142 commented Aug 17, 2022

cgroza commented Jul 2, 2024

Inconsistent ancestry predictions between Somalier and Peddy #103

Inconsistent ancestry predictions between Somalier and Peddy #103

Comments

mpj5142 commented Aug 17, 2022 • edited Loading

brentp commented Aug 17, 2022

mpj5142 commented Aug 17, 2022

brentp commented Aug 17, 2022

mpj5142 commented Aug 17, 2022

cgroza commented Jul 2, 2024

mpj5142 commented Aug 17, 2022 •

edited

Loading