Genotype imputation has become a vital tool for geneticists, enabling cost-effective inference of complete genome sequences from partial data like SNP arrays or low-pass WGS. This technique boosts statistical power in association studies, refines fine-mapping, and resolves compatibility issues across different platforms.
Genotype imputation relies on two key elements; the haplotype reference panel (HRP) and the imputation algorithm. The literature surrounding genotype imputation has historically focused mainly on the imputation algorithm, with a typical comparative study investigating the relative performance of various imputation algorithms while holding the reference panel constant. However, the role of the reference panel itself on overall imputation performance is equally, if not more, important than the choice among the many high-performing algorithms. Even though it is intuitive that the quality of a reference panel should play a role in the accuracy of imputation, it has remained unclear to what extent common errors during panel creation (e.g., genotyping and phase error) lead to suboptimal imputation performance.
A recent preprint from the Gencove data science team dives into how common errors in reference panels impact imputation accuracy and the interplay between the two across a spectrum of sequencing coverages.
To understand this impact the team introduced variations in the reference panel by inducing varying degrees of 1) phase errors, 2) genotyping errors, and 3) by randomly pruning variants from the panel. They performed imputation on genetic data from diverse individuals at varying sequencing coverages (0.5x, 1.0x, and 2.0x) using the perturbed reference panels. The evaluation was conducted using the r^2 metric for the entire cohort as well as subsets stratified by ancestry.
Genotype error has the greatest impact on imputation accuracy
Imputation accuracy was observed to be highly sensitive to phase and genotype errors, particularly at lower allele frequencies, with genotype error having the greatest impact. While the density of the reference panel, within the evaluated range, exhibited a relatively marginal effect on accuracy. Additionally, the sensitivity of accuracy to perturbations decreased with increasing sequencing coverage due to the additional read evidence present in higher coverage sample.