Blog

Jeremy Li, Director of Data Science - Aug 15, 2023

Investigating the impact of reference panel errors on imputation quality and genotype accuracy

Genotype imputation has become a vital tool for geneticists, enabling cost-effective inference of complete genome sequences from partial data like SNP arrays or low-pass WGS. This technique boosts statistical power in association studies, refines fine-mapping, and resolves compatibility issues across different platforms.

Genotype imputation relies on two key elements; the haplotype reference panel (HRP) and the imputation algorithm. The literature surrounding genotype imputation has historically focused mainly on the imputation algorithm, with a typical comparative study investigating the relative performance of various imputation algorithms while holding the reference panel constant. However, the role of the reference panel itself on overall imputation performance is equally, if not more, important than the choice among the many high-performing algorithms. Even though it is intuitive that the quality of a reference panel should play a role in the accuracy of imputation, it has remained unclear to what extent common errors during panel creation (e.g., genotyping and phase error) lead to suboptimal imputation performance.

A recent preprint from the Gencove data science team dives into how common errors in reference panels impact imputation accuracy and the interplay between the two across a spectrum of sequencing coverages.

To understand this impact the team introduced variations in the reference panel by inducing varying degrees of 1) phase errors, 2) genotyping errors, and 3) by randomly pruning variants from the panel. They performed imputation on genetic data from diverse individuals at varying sequencing coverages (0.5x, 1.0x, and 2.0x) using the perturbed reference panels. The evaluation was conducted using the r^2 metric for the entire cohort as well as subsets stratified by ancestry.

Genotype error has the greatest impact on imputation accuracy

Imputation accuracy was observed to be highly sensitive to phase and genotype errors, particularly at lower allele frequencies, with genotype error having the greatest impact. While the density of the reference panel, within the evaluated range, exhibited a relatively marginal effect on accuracy. Additionally, the sensitivity of accuracy to perturbations decreased with increasing sequencing coverage due to the additional read evidence present in higher coverage sample.

Figure 1: Imputation accuracy by minor allele frequency for sequence data at 0.5x, 1.0x, and 2.0x coverages

Genome wide imputation accuracy improves substantially with strict genotype filtering

To validate the impact of reducing genotyping error, the team conducted an experiment using a strict genotype quality filter on the reference panel. The filtered panel's imputation accuracy was compared to the unfiltered panel at intersecting sites and all sites. The results demonstrated a substantial increase in genome-wide imputation accuracy (8% at 0.5x coverage and 13% at 2.0x) when using the filtered panel. Moreover, the comparison of filtered and unfiltered sites indicated that low genotype quality sites not only resulted in reduced imputation accuracy at those sites, but also had a trans-effect on performance, decreasing imputation accuracy at the remaining, unfiltered sites.

Figure 2: Imputation r2 values across minor allele frequencies and imputation r2 deltas (Δr2) across minor allele frequencies

This study underscores the critical role of comprehensive methodology in haplotype reference panel creation, placing a significant emphasis on accurate genotyping, especially for rare variant imputation. The findings emphasize that the quality of the reference panel, including factors like phase and genotype accuracy, holds a pivotal role in achieving reliable imputation results, while offering valuable insights into the interplay between reference panel quality and sequencing coverage, highlighting how these factors interact in the context of genotype imputation.

The results of this study underscore the crucial role of reference panel quality for genotype imputation and represents a comprehensive investigation into the effects of perturbations to reference panels on downstream imputation accuracy. As genomics research continues to advance, understanding and addressing reference panel quality and accuracy will be paramount for obtaining accurate and reliable insights from imputed genetic data.