Blog

Gencove Team - Mar 12, 2024

Short Reads, Deep Insights: Imputing Structural Variants From Short-Read Sequencing Data

Detecting structural variants in the human genome remains a substantial challenge for most sequencing projects. For example, using short-read sequencing methods, only 10%-70% of SVs can be detected, with up to 89% false positive rates1. Yet, accurately detecting SV is critical, not only to building a more comprehensive view of human genetic variation, but to better understanding the genetics behind complex conditions like schizophrenia, Alzheimer’s disease, and cancer2-4.

Structural variants—defined as insertions, deletions, duplications, or inversions that span more than 50bp—represent a prevalent and underexplored source of human genetic variation. Estimates suggest that each person has between 23,000 to 31,000 structural variants1. In order to both identify these complex variants and predict their functional consequence, researchers need the ability to perform large-scale, genome-wide association studies (GWAS). However, the limitations of current sequencing technology make this difficult to do. Most DNA sequencing projects use short-read NGS platforms, making it challenging to accurately resolve long, complex structural mutations. While long-read sequencing platforms are both available and well-suited for structural variant detection, the significant cost of using these platforms has prevented their widespread use.

As a result, the vast majority of genomics research is carried out using short-read sequencing. Large repositories of short-read sequencing data have become available to researchers around the world, enabling statistically well-powered studies that continue to uncover subtle gene-phenotype associations. Yet, the presence and effect of structural variants in these data sets remains obscured, greatly limiting our understanding of this common mutation type.

Developing a tool that allows researchers to infer the presence of structural variants from short-read sequencing data would open the flood gates, enabling researchers to scour existing data sets for new information. Towards this end, a team of researchers from Gencove and Boehringer Ingelheim came together to create a multi-ancestry structural variation imputation panel based on Oxford Nanopore long-read sequencing data5. This panel enables the imputation of structural variation from short-read sequencing data. Such a resource could greatly improve our understanding of the human genome and the diseases that are borne from it.

The Value of Imputation

Imputation is a general approach used in statistics that allows you to infer the identity of an unobserved data point based on previously observed patterns. When applied to genetics, genotype imputation allows researchers to account for genetic variants that aren’t directly captured in the sequencing process. Among other benefits, this opens the door to greater efficiency in genetic studies. Rather than deep sequencing an entire genome, for example, researchers can greatly reduce the cost of sequencing by performing low-pass sequencing. Though a smaller fraction of the genome can be resolved with low-pass sequencing, this data can be used to impute unresolved portions, thereby enabling researchers to study genome-wide variation at a fraction of the cost.

However, the value of imputation is only as good as the patterns that guide it. Accurate imputation must be based on high quality reference panels built from in depth sequencing data. Many such reference panels exist for studying the presence of single nucleotide variants (SNVs) from short-read data, but very few are designed for the detection of structural variants. To build a structural variant reference panel, researchers would need genome-wide long-read sequencing data that enables detection of both complex variants and the SNVs that are normally found using short-read sequencing technology. Combined, this data could then be used to build a reference panel wherein patterns of SNVs—detected using short-read sequencing—allow researchers to infer the presence of structural variants.

This is exactly what the teams at Gencove and Boehringer Ingelheim set out to do.

Development of a Structural Variant Imputation Panel

Towards unlocking the potential of structural variant research, the team of researchers performed whole-genome, long-read sequencing on a set of 888 well-characterized individuals from the 1000 Genomes project.

The team found that, on average, each person possessed more than 16,000 structural variants. Roughly 107,000 distinct structural variants were identified, approximately 79,000 of which were absent in previous sequencing datasets wherein these same individuals were sequenced using short-read technology. Analysis of the structural variants’ location and predicted effects highlighted 4,406 variants that would be expected to have a significant functional impact.

With such a large set of whole-genome sequencing data, including highly resolved structural variants, the Gencove team was able to build a reference panel for the imputation of structural variants from short-read sequencing data.

The Discovery Potential of Structural Variant Imputation

To demonstrate the value of the newly developed imputation panel, the team imputed the structural variants detected in their long-read dataset from the UK Biobank—a widely used biorepository that primarily consists of short-read sequencing data and phenotypic data collected from ~500,000 individuals in the United Kingdom. With structural variants imputed, the team then performed a GWAS, analyzing for associations with 32 phenotypes (19 continuous, 13 binary) related to respiratory, metabolic, and liver diseases.

Results revealed 3,858 significant structural variant associations. Additional analyses showed 10,518 instances where structural variations were significantly associated with the quantities of certain proteins. Additionally, by including structural variation as a new data modality in these association studies, the team was able to replicate and further fine-map previous GWAS hits.

In a previous study, Shrine et al. (2023), analyzed UK Biobank GWAS data with the goal of prioritizing genes and SNVs in phenotype-associated regions of DNA, ranking them according to their calculated probability of influencing the phenotype3. In the present study, approximately 70 of the loci analyzed by Shrine et al., were found to contain a structural variant that was significantly associated with the same phenotype. Significantly, at 14 of these loci, structural variants were conditionally independent primary or secondary signals, suggesting that these mutations are likely responsible for the given phenotype and strengthening evidence that the affected gene is tied to the measured phenotype.

Several specific examples are offered by the authors, including one at the AAGAB gene locus. This gene was found to be significantly associated with a measure of lung function (forced vital capacity, or FVC). Previous analysis from Shrine et al. (2023)3, found no significant variants linking AAGAB to FVC, however phenome-wide analyses and other proteomic studies have suggested that AAGAB has a causal link to FVC. In the present study, two large deletions were found in AAGAB, which were significantly associated with lung function measurements.

The Robust Value of Structural Variant Imputation

Collectively, this research contributes an immensely valuable resource to the genetic community by enabling structural variant imputation using short-read sequencing data.

That this resource was built with long-read sequencing data and a diverse cohort is critical to its value. Use of a diverse cohort is important because the frequency and influence of structural variants, like other mutations, can vary depending on a person’s ancestral background. Indeed, this study found substantial variation among ancestral backgrounds, such that only about 40% of the identified structural variants were shared across all populations. By training this imputation panel on a diverse population, the team makes it possible to accurately apply this resource to diverse data sets around the world, thus democratizing access to structural variant imputation.

The present study demonstrated how this imputation panel can be used to discover new data from already existing and well studied short-read sequencing data sets, such as the UK Biobank. In recent years, many population genomic and biobanking projects have been launched and have begun to amass substantial amounts of genomic data based on short-read sequencing data. Despite meticulous study of these data sets, the presence and influence of structural variants has been largely overlooked. With this imputation panel, researchers can easily transform these large data sets into powerful resources for the study of structural variants at a fraction of the cost.

By developing innovative approaches to data analytics and bioinformatics, Gencove is empowering researchers to pursue the most ambitious applications of genomics.

Access to the long-read sequencing imputation panel can be found via the OpnMe initiative of Boehringer Ingelheim GmbH.

References

  1. Sedlazeck, Fritz J., et al. “Accurate Detection of Complex Structural Variations Using Single-Molecule Sequencing.” Nature Methods, vol. 15, no. 6, 30 Apr. 2018, pp. 461–468, https://doi.org/10.1038/s41592...

  2. Lowe, William L., and Timothy E. Reddy. “Genomic Approaches for Understanding the Genetics of Complex Disease.” Genome Research, vol. 25, no. 10, Oct. 2015, pp. 1432–1441, https://doi.org/10.1101/gr.190....

  3. Shrine, Nick, et al. “Multi-Ancestry Genome-Wide Association Analyses Improve Resolution of Genes and Pathways Influencing Lung Function and Chronic Obstructive Pulmonary Disease Risk.” Nature Genetics, vol. 55, no. 3, 1 Mar. 2023, pp. 410–422, https://doi.org/10.1038/s41588...

  4. Ebbert, Mark T. W., et al. “Systematic Analysis of Dark and Camouflaged Genes Reveals Disease-Relevant Genes Hiding in Plain Sight.” Genome Biology, vol. 20, no. 1, 20 May 2019, https://doi.org/10.1186/s13059...

  5. Noyvert, Boris, et al. “Imputation of Structural Variants Using a Multi-Ancestry Long-Read Sequencing Panel Enables Identification of Disease Associations.” MedRxiv (Cold Spring Harbor Laboratory), 22 Dec. 2023, https://www.medrxiv.org/conten...