Request for Assistance with Replicating INDEL Imputation Accuracy Trends Using GLIMPSE2 from "Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes"(DOI: 10.1038/s41588-023-01438-3) #249

hardworking555 · 2024-12-23T09:07:35Z

Dear Dr. Rubinacci,

I hope this message finds you well.

Your May 2023 publication, "Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes" (DOI: 10.1038/s41588-023-01438-3), has been highly influential for my research. I am particularly interested in Extended Data Fig. 5(a), which reports the imputation accuracy at INDEL sites.

To replicate the trends in your figure, I implemented the following approach:

Reference Panel Construction:
I compiled a reference population consisting of 1,602 pigs with both SNP and INDEL genotype data. This high-quality reference set served as the basis for subsequent imputation analyses.

Target Dataset Preparation:
From 30x whole-genome sequencing (WGS) data of 10 pigs, I downsampled the sequencing reads to generate datasets with coverage levels of 0.1x, 0.3x, 0.5x, 0.7x, and 1x. My objective was to evaluate how imputation accuracy at INDEL sites changes with increasing coverage.

Imputation Pipeline:
I adopted a stepwise imputation strategy:

Pre-phasing with SHAPEIT5:
I pre-phased the reference panel using SHAPEIT5 with --Ne 150 as recommended in standard guidelines.

Imputation with GLIMPSE2:
I processed the low-coverage target datasets with GLIMPSE2, using the following workflow: chunking the genome, splitting the reference, imputing each target region, and ligating the output. Throughout the pipeline, I applied recommended parameters, including Ne = 150. Additionally, I set the --call-indels parameter. Finally, I merged all imputed chunks into a consolidated dataset.

Performance Evaluation:
I evaluated the imputed genotypes at INDEL sites by comparing them against the true genotypes derived from the original 30x WGS data. The following metrics were used:

Concordance Rate: The proportion of correctly imputed genotypes.
Pearson Correlation Coefficient: Correlation between true and imputed genotypes, encoded as 0, 1, and 2.

However, the results I obtained did not reflect the increasing trend in imputation accuracy shown in your Extended Data Fig. 5(a). For example:

At 0.1x, concordance rate ≈ 0.8600, correlation ≈ 0.5956.
At 0.3x, concordance rate ≈ 0.8596, correlation ≈ 0.5812.
At 0.5x, concordance rate ≈ 0.8666, correlation ≈ 0.5597.
At 0.7x, concordance rate ≈ 0.8665, correlation ≈ 0.6004.
At 1x, concordance rate ≈ 0.8755, correlation ≈ 0.6158.

These figures do not exhibit the expected upward trend in accuracy as coverage increases, which is in contrast to the patterns illustrated in your figure.

I would be very grateful for your insights on the following points:

Parameters and Optimizations:
Were there any specific parameters or optimizations in GLIMPSE2 or the pre-phasing step that were essential for achieving the trends observed in your figure?

Preprocessing and Filtering:
Did you apply any additional preprocessing, filtering, or variant selection criteria prior to imputation, particularly for INDEL sites?

Data Thresholds and Context:
Since the original data underlying Fig. 5(a) is not fully detailed in the supplementary materials, could you kindly share any further context or thresholds critical for replicating the observed trend?

I have carefully followed standard imputation workflows, but I suspect that there may be subtle factors I have not yet considered.
Thank you very much for your time and consideration. I greatly appreciate your expertise and any advice you may offer. I look forward to your response and learning from your experience.

The text was updated successfully, but these errors were encountered:

srubinacci · 2025-02-12T09:37:34Z

Hi,

INDEL sites are challenging to call in your reference panel due to higher genotyping errors, potential STRs, and other factors. They are also difficult to phase accurately. I would recommend focusing your performance analysis on SNPs rather than INDELs. Additionally, there is a significant variation in the size and quality of reference panels, especially across different species, which can further affect results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Assistance with Replicating INDEL Imputation Accuracy Trends Using GLIMPSE2 from "Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes"(DOI: 10.1038/s41588-023-01438-3) #249

Request for Assistance with Replicating INDEL Imputation Accuracy Trends Using GLIMPSE2 from "Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes"(DOI: 10.1038/s41588-023-01438-3) #249

hardworking555 commented Dec 23, 2024

srubinacci commented Feb 12, 2025

Request for Assistance with Replicating INDEL Imputation Accuracy Trends Using GLIMPSE2 from "Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes"(DOI: 10.1038/s41588-023-01438-3) #249

Request for Assistance with Replicating INDEL Imputation Accuracy Trends Using GLIMPSE2 from "Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes"(DOI: 10.1038/s41588-023-01438-3) #249

Comments

hardworking555 commented Dec 23, 2024

srubinacci commented Feb 12, 2025