Blog

Jeremy Li, Head of Data Science - Mar 16, 2021

New imputation pipeline using newly resequenced data from 3202 individuals

The 1000 Genomes Project was completed in 2015 and culminated in its “phase 3 release” of phased genotypes from 2504 individuals at over 80 million variants, and remains the largest fully open-sourced genomic dataset [0]. The final variant callset was based on a combination low-coverage whole-genome sequencing (at an average of around 7x) along with whole exome sequencing and results from genotyping array assays of these individuals.

Earlier this year, the New York Genome Center published a preprint describing their recent resequencing effort of the samples in the 1000 Genomes project in addition to several hundred additional samples from individuals related to the 2504 in the original set [1]. This effort saw the high coverage whole genome resequencing of a total of 3202 individuals at an average coverage of 34x along with variant discovery using state of the art methods.

One of the outputs from this study is a statistically phased haplotype reference panel comprising these 3202 individuals at ~72MM variants — in the manuscript, the authors demonstrate a considerable boost in imputation accuracy as measured by imputation r² when using this new dataset as an imputation reference panel relative to using the haplotype reference panel accompanying the 1000 Genomes phase 3 release.

We recently incorporated this new haplotype reference panel into a production pipeline on our Gencove systems and performed some light benchmarking to examine the improvement in imputation accuracy of using this panel compared to other, existing imputation reference panels on build 38. In particular, we took an individual with primarily European ancestry with known genotypes from a genotyping array and used these to evaluate the performance of imputation using this panel compared to a panel comprising the hg38 liftover of the 1000 Genomes Phase 3 release and a panel comprising an earlier release of the present dataset by the NYGC, also on hg38. We observed an increase in imputation accuracy of 0.2% over the former and an increase of 2.7% over the latter for this particular individual.

Given the above, we’re excited to announce that this pipeline is now publicly available on our systems as our default imputation pipeline for samples on build 38 of the human reference genome. Existing customers can access this pipeline by choosing the project configuration “Human low-pass GRCh38 v1.2” when creating a new Gencove project. As always, please reach out with any suggestions or comments!

Bibliography

[0] 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68.

[1] Byrska-Bishop, M., Evani, U. S., Zhao, X., Basile, A. O., Abel, H. J., Regier, A. A., … & Human Genome Structural Variation Consortium. (2021). High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. bioRxiv.