Blog

Gillian Belbin, Senior Data Scientist - Sep 30, 2024

Genome-wide insights from the exome through leveraging off-target reads

Whole exome sequencing (WES) is a popular and cost-effective method for capturing high quality genotype calls at the protein coding portion of the genome, and has been key in facilitating the discovery of functionally relevant variants across a range of research and clinical settings.^1-7 Despite this, WES is designed to only capture exonic regions, meaning that the yield of on-target calls obtained through this technology is typically restricted to only around 1% of the genome in total. This has the potential to hamper discovery efforts that solely rely on WES, as it means that the intronic and intergenic fraction of the genome, which comprise the majority, remain un-assayed. Notably, over 90% of signals discovered through Genome Wide Association Studies (GWAS) fall within non-coding regions of the genome,⁸ and many analytical methods in the field of genomics rely on dense sampling of genotypes across samples genome-wide for statistical inference. Consequently, researchers who choose to employ WES for large-scale discovery efforts often have to supplement exome sequencing with array-based genotyping technologies in order to capture a more complete picture of their study population.

We wanted to explore an alternative approach to obtaining genomic information in non-coding regions of the genome using WES alone, by leveraging information from off-target reads⁹. Off-target reads refer to sequencing data that is captured from regions outside of the intended exonic targets. Although WES is specifically designed to enrich and sequence the exonic regions of the genome, the hybrid capture method used is not perfectly selective, leading to the incidental capture of sequences from non-exonic regions. The proportion of off-target reads in a standard WES study ranges from 20-40%¹⁰ of the total read yield, and while typically not incorporated into downstream analysis in standard WES workflows, has the potential to be repurposed for assaying genotypes across the whole genome.

We sought to combine information from off-target WES reads with state-of-the-art imputation methods, typically used for low-pass whole genome sequencing (lp-WGS), to see if we could obtain high quality genotypes at non-coding regions, providing both high coverage exome calls and genome-wide genotypes in one assay.

To do this we analyzed N=64 samples from the 1000 Genomes Project (1KG) that are present in the Gencove imputation reference panel and for which high coverage WES data is available. We analyzed the WES samples using the Gencove imputation platform in a leave-one-out manner (that is, ensuring the exclusion of the test sample from the reference panel at the time of imputation). We were then able to benchmark the quality of the genotype calls obtained by comparing imputed calls to the “ground truth” high coverage whole-genome sequencing data that is publicly available for that sample.

To better understand the characteristics of the WES samples being analyzed we calculated some standard quality control metrics. We noted that the samples achieved an effective coverage of 0.24x on average (median, interquartile range (IQR)=0.35x) (Figure 1). The median percentage of off-target reads across samples was 32.4% (IQR=13.0%) (Figure 2), in keeping with values reported in previous studies (although this is known to vary in a protocol dependent manner).¹¹

To begin to explore the quality of imputation yielded overall, we first calculated the proportion of sites that were concordant between the imputed WES sample and the ground truth data at all sites in the imputation panel. The average concordance at all sites was 99.65% (standard deviation (s.d)=0.24% ; Figure 3). We also looked at concordance at only heterozygous sites, and noted an attenuation in concordance when restricting only to these (mean 95.94%, s.d.=2.96%; Figure 4), a similar attenuation in imputation quality was also observed when analyzing concordance at non-reference sites only (NRC ; mean 95.82%, s.d.=4.40% ; Figure 5). However we note the values obtained for each of these metrics were broadly in line with those previously observed when imputing variants from lp-WGS, where it was observed that NRC was correlated with effective coverage.¹² We examined the relationship between effective coverage in the WES samples and NRC and observed a similar correlation (Figure 6), suggesting that effective coverage from WES may be a useful predictor of imputation performance for studies utilizing WES data also.

Finally, we also examined concordance at sites assayed on the Global Screening Array (GSA), a widely used genotyping array in human genetics studies. Of the n=654,027 sites assayed on the GSA, n=554,447 were also present in our imputation reference panel. The NRC at the GSA sites was observed to be 93.8% (s.d. 6.2%) (Figure 7), suggesting that while slightly lower than that of our overall concordance rate, this approach has the potential to capture high quality genotype calls at common variants of interest genome-wide.

We also explored the extent to which imputation quality might vary by genetic ancestry. Based on the 1KG super population meta data available, of the N=64 samples that were assayed N=17 were labeled Africa, N=10 from the Americas, N=8 East Asian, N=18 European and N=11 South Asian. When we compared concordance rates at all sites stratified by ancestry (Figure 8), we noted NRC was highest in the samples from Europe (mean NRC=94.7%, s.d.=4.30%), the Americas (96.41%, 4.23%), and Africa (94.19%, 3.61%), with a moderate attenuation observed for other population groups (East Asia=93.45%, 3.41%; South Asia=91.20%, 5.53%), in keeping with trends observed when imputing across ancestry groups with other genotyping technologies.

Summary

By combining off-target read data with advanced imputation methods, we demonstrate that it's possible to achieve high-quality genotype calls across the entire genome from WES alone. This technique not only provides deep coverage of exonic regions but also enables genome-wide genotyping in a single assay, offering a cost-effective alternative for large-scale genetic studies.

References

1. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12: 745–755.

2. Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med. 2013;369: 1502–1511.

3. Sullivan JA, Schoch K, Spillmann RC, Shashi V. Exome/Genome Sequencing in Undiagnosed Syndromes. Annu Rev Med. 2023;74: 489–502.

4. Backman JD, Li AH, Marcketta A, Sun D, Mbatchou J, Kessler MD, et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599: 628–634.

5. Lassen FH, Venkatesh SS, Baya N, Hill B, Zhou W, Bloemendal A, et al. Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank. Cell Genom. 2024;4: 100602.

6. Holstege H, Hulsman M, Charbonnier C, Grenier-Boley B, Quenez O, Grozeva D, et al. Exome sequencing identifies rare damaging variants in ATP8B4 and ABCA1 as risk factors for Alzheimer’s disease. Nat Genet. 2022;54: 1786–1794.

7. Leopold JA. Whole-exome sequencing for the discovery of rare genetic variants that protect from coronary artery disease. Coron Artery Dis. 2016;27: 253–254.

8. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106: 9362–9367.

9. Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, et al.. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet. 2012 May 20;44(6):631-5

10. Samuels DC, Han L, Li J, Quanghu S, Clark TA, Shyr Y, et al. Finding the lost treasures in exome sequencing data. Trends Genet. 2013;29: 593–599.

11. Seaby EG, Pengelly RJ, Ennis S. Exome sequencing explained: a practical guide to its clinical application. Brief Funct Genomics. 2016;15: 374–384.

12. Li JH, Mazur CA, Berisa T, Pickrell JK. Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays. Genome Res. 2021;31: 529–537.

UNLOCK the power of sequencing >