Blog

Lex Flagel, Staff Data Scientist - Sep 19, 2023

An updated evaluation of a recent open-access human genome reference panel for low-pass imputation

Over the past decade, human genomic research has benefited tremendously from the availability of open-access human genome diversity panels like the 1000 Genomes Project (1KG) and the Human Genome Diversity Project (HGDP). Both projects released high-quality genome sequences from a diverse set of individuals. Though this data is readily available, it has proven to be a challenge to combine and harmonize variant calls between these panels to make a single high-quality dataset.

In a blog post earlier this year, we used low-pass imputation performance to evaluate a combined HGDP + 1KG panel (HGDP1KG) produced by a team led by Dr. Alicia Martin at the Broad Institute. This combined panel was compared to the expanded 1KG panel released by the New York Genome Center (NYGC1KG). Though the combined HGDP1KG panel was larger, we found that the NYGC1KG panel had higher imputation accuracy, especially at indel sites.

Recently however, the team at the Broad released an update to the HGDP1KG panel as part of the gnomAD_v3.1.2 release that included several tweaks and changes meant to increase the accuracy of the combined panel. In this blog post we once again compare imputation performance of this updated panel to NYGC1KG panel.

Panel Comparisons

The new version of the HGDP1KG comprises 4091 individuals with ~76.4M variants (~67.2M SNPs and ~9.2M indels) among the 22 autosomes, while the NYGC1KG dataset comprises 3202 individuals and contains ~73.6M variants (~64.1M SNPs and ~9.5M indels) among the autosomes and X chromosome. This latest release of the HGDP1KG is considerably smaller than past releases. For example, we found that the past release of the HGDP1KG contained roughly twice as many SNPs as the NYGC1KG. Due to filtering performed on the gnomAD_v3.1.2 release, it now has a comparable number of variants to the NYGC1KG panel.

Results

Two panel comparisons were performed. First we selected 60 individuals that are found in both the NYCG1KG and HGDP1KG panels and imputed them using GLIMPSE2 with either panel in a “leave-one-out” manner, meaning the target individual was removed from the reference panel, and downsampled to 1x coverage for imputation. For efficiency we only performed this analysis on chromosome 20. This same design (leave-one-out on the same 60 individuals for chromosome 20) was used in our recent preprint. Figure 1 contrasts panel performance. The metric we evaluated was the imputation r2, which is the squared Pearson correlation coefficient between the imputed genotype and the truth genotypes provided by the high coverage genotype calls. The aggregate imputation r2 among all 60 individuals is stratified by minor allele frequency (MAF) of the site in the respective reference panel. In the upper panel of Figure 1 the blue lines represent the HGDP1KG panel while the red lines represent the NYGC1KG panel. Solid lines represent the imputation r2 performance when considering all sites in the panel, while dashed lines represent imputation r2 performance at sites shared between both reference panels (~1.4M sites) on chromosome 20. The lower panel of the figure represents the imputation r2 difference between the HGDP1KG panel and the NYCG1KG panel results, again stratified by MAF.

Figure 1: Aggregate imputation r2 performance comparison between HGDP1KG and NYGC1KG panels using leave-one-out analysis on 60 individuals at 1x genomic coverage. Top panel: raw imputation r2, bottom panel: r2 differences between panels.

The results from chromosome 20 were encouraging, so we went on to genome-wide tests of imputation performance, now using 116 individuals from the Simons Genome Diversity Project (SGDP), which are in neither the NYGC1KG or HGDP1KG panels. The SGDP individuals represent a diverse global sample of human genetic variation. These individuals were also downsampled to 1x genomic coverage and imputed with both panels. Finally we compared these imputed results against the “true” calls from high-coverage (~30x) genotype calls of the SGDP individuals again using imputation r2. Figure 2 shows these imputation r2 results aggregated across all 116 individuals. The layout of Figure 2 is identical to Figure 1, except that the MAFs used to stratify the results come from the SGDP. Genome-wide the two panels shared ~63.8M variants.

Figure 2: Aggregate imputation r2 performance comparison between HGDP1KG and NYGC1KG panels among 116 individuals from the SGDP downsampled to 1x genomic coverage. Top panel: raw imputation r2, bottom panel: r2 differences between panels.

Discussion

As is evident in the Figures 1 & 2, the new gnomAD_v3.1.2 release version of the HGDP1KG panel marks a significant improvement over the current gold-standard NYGC1KG panel. We see especially large imputation accuracy improvements at lower minor allele frequencies. Moreover, we see significant imputation performance gains even after filtering both panels to a common set of variants (dashed lines). From these results we conclude that the HGDP1KG panel offers a large set of high-quality genotype calls, and is a comprehensive improvement over the NYCG1KG panel for genotype imputation from low-pass sequences.

The new HGDP1KG panel is now available on the Gencove platform as our standard human reference panel.