Blog

Lex Flagel, Staff Data Scientist - Oct 04, 2023

Accurate HLA genotyping using low-pass sequence data

The human leukocyte antigen (HLA) locus is a cluster of genes on chromosome 6 of the human genome that play a crucial role in self/non-self recognition in the immune system. Understanding genetic variation at the HLA locus is essential in human genetics as these genes are frequently associated with GWAS hits for common diseases and are crucial for organ transplantation1. However, accurately genotyping the HLA locus has proven challenging because it is complex and highly polymorphic.

A common approach for genotyping the HLA locus is to use imputation. Traditionally, an individual with an unknown HLA genotype is first genotyped using a backbone of biallelic markers in the HLA region, for example, from a genotyping array. These marker genotypes are then compared to a reference panel with known HLA genotypes, and the unknown individual’s HLA genotype is imputed by various means (e.g., machine learning2 or hidden Markov model3).

To adapt this method for use with low-pass sequence data, one would first need to impute the raw low-pass read data to a backbone set of markers and then use those markers to impute the HLA genotype. This “double imputation” approach has been shown to achieve high accuracy4. In this study, we use CookHLA3 for double imputation. CookHLA is a modern HLA genotyping algorithm that uses a hidden Markov model to achieve high performance.

Recently, a new approach called QUILT5 has been developed. Using a hidden Markov model, QUILT imputes the HLA region directly from raw low-pass reads, simplifying the double imputation method above.

Here, we compare these two HLA imputation approaches to learn which path leads to greater accuracy. Moreover, we evaluate two versions of the GRCh38 human reference genome. First, the primary release includes >500 HLA “ALT contigs” that represent diverse HLA alleles. Second, there is a special release of GRCh38, which lacks all ALT contigs, including those from the HLA region. Since these two assemblies differ substantially in how they represent the HLA region, we also looked at how the choice of GRCh38 reference genome impacts HLA imputation performance.

Results

Our test samples included 136 individuals from the 1000 Genomes Project phase 3 release with known HLA genotypes6. These individuals represented all five major population groups in the 1000 Genomes Project.

Raw sequence data from each individual was randomly downsampled to 1x genomic coverage, and reads were aligned to both reference genomes noted above. We imputed the low-pass data using our Gencove Human low-pass imputation v4.2 pipeline in a “leave-one-out” manner, meaning that the individual being imputed was dropped from the reference panel before imputation. We fed the imputed genotypes into CookHLA4 to impute HLA genes using a 1000 Genomes Project HLA reference panel constructed by the authors, or we fed the raw sequence alignment into QUILT5 to impute HLA genes, also using a 1000 Genomes Project derived HLA reference panel constructed by the software’s authors. In both cases, HLA imputation was also performed in a “leave-one-out” manner.

The HLA locus has its unique code for representing genetic variation. The locus is first broken down into its constituent genes (e.g., HLA-A, HLA-B, and HLA-C). Then, allelic variants within each gene are given a 4 component numeric code, where each component gets increasingly specific (allele family, allele sub-family, etc.). Both CookHLA and QUILT impute the first two components of this code. For many applications, this is sufficient resolution. Both programs only output the 5 “classic” HLA genes, HLA-A, -B, -C, -DQB1, and -DRB1.

Below are the average imputation accuracies stratified by major population groups in the 1000 Genomes Project and colored by HLA gene (Figure 1). CookHLA was able to impute all 136 individuals, but QUILT did not produce results for 13 individuals when ALT contigs were included and 36 when they were excluded. For QUILT, the results reflect averages on the subset of samples that did not fail.

Figure 1: The upper two panels are the results for CookHLA, while the lower two are QUILT. The two panels on the left use the GRCh38 reference genome version with HLA ALT contigs, while those on the right use the reference genome without ALT contigs.

Mean accuracy values across all five HLA genes are below:

CookHLA

QUILT

With ALT contigs

0.484

0.762

Without ALT contigs

0.980

0.741


Discussion

From our results above, it is clear that CookHLA without ALT contigs has the best overall performance for HLA genotype imputation accuracy, with 98% accuracy across the five HLA genes. Additionally, we found that CookHLA is very sensitive to the version of the reference genome being used. We suspect this is because the underlying low-pass imputation struggles when reads map to multiple allelic contigs representing the same HLA region.

On the other hand, QUILT has consistent performance on both reference genomes, albeit lower than CookHLA’s highest performance. QUILT was designed to work with ALT HLA contigs, and it is recommended by the authors that one uses that version of the reference genome. Finally, QUILT has a relatively high sample failure rate (~10%). We did investigate the error messages produced (there were three distinct error messages among the failed samples) but were unable to diagnose the core issues. In contrast, CookHLA produced results for all samples.

We also observe moderate variation in accuracy across populations. Using the best reference genome for CookHLA (without ALT contigs) and the best for QUILT (with ALT contigs), we see about a 3% difference between the highest concordance population for CookHLA and an 11% difference for QUILT when performance is averaged across all 5 HLA genes.

Conclusion

While QUILT is conceptually more attractive as an HLA imputation approach, as it was built to work directly on low-pass read data and to utilize HLA ALT contigs, we conclude that using CookHLA in a “double imputation” setup is the best performance option. When combined with Gencove’s low-pass imputation engine, the HLA genotype accuracies we achieve with CookHLA are state-of-the-art. Gencove is unlocking the value of genetic data, enabling the most ambitious applications of genomics across industries with an end-to-end platform for data generation, analysis & management. Get in touch to learn how we can help accelerate the journey from samples to solutions.

References

  1. Sakaue et al. Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease. 2023. Nature Protocols. doi:10.1038/s41596-023-00853-4

  2. Zheng et al. HIBAG—HLA genotype imputation with attribute bagging. 2014. The Pharmacogenomics Journal. doi:10.1038/tpj.2013.18

  3. Cook et al. CookHLA: Accurate Imputation of Human Leukocyte Antigens. 2021. Nature Communications. doi:10.1038/s41467-021-21541-5

  4. Wasik et al. Comparing low-pass sequencing and genotyping for trait mapping in pharmacogenetics. 2021. BMC Genomics. doi:10.1186/s12864-021-07508-2

  5. Davies et al. Rapid genotype imputation from sequence with reference panels. 2021. Nature Genetics. doi:10.1038/s41588-021-00877-0

  6. http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/