Blog

Jeremy Li, Head of Data Science - Mar 03, 2021

Gencove’s new pig haplotype reference panel

At Gencove, we’re continuously expanding the selection of species which can be used with imputation pipelines based off of low-pass sequencing. Unlike, for instance, cattle in the form of the 1000 Bull Genomes Project [1], there does not currently exist a large-scale public resequencing effort to characterize the genetic diversity of extant breeds used for agricultural purposes, and the existing literature on the performance on genotype imputation in pigs is primarily limited to genotyping array imputation [2,3,4].

To address this current shortcoming, we have recently developed and released a pipeline for low-pass sequencing in pigs on the latest reference genome (Sscrofa11.1/susScr11) and a diverse haplotype reference panel encompassing the range of genetic diversity in the most common breeds found in the swine industry.

This panel enables both low-cost genotyping for genomic selection as well as an unprecedented level of resolution for a high throughput assay for research and fine-mapping in swine.

In order to test the performance of our imputation pipeline, we performed a pilot in partnership with the USDA-ARS Meat Animal Research Center (USDA-ARS MARC) wherein we sequenced and imputed 83 boar samples provided by the USDA-ARS and compared the imputed genotypes to those deriving from an orthogonal assay, the GGP Porcine 50K array. Thanks to their historical public sequencing efforts, the MARC populations are relatively well represented in our panel, but as we will see below, even the subset of samples deriving from breeds not well-represented in our panel imputed reasonably well due to the large amount of shared genetics between pig breeds.

Haplotype reference panel construction

To construct our haplotype reference panel, we first collated publicly available swine sequence data on the Sequence Read Archive (SRA) from 414 individuals. We then created a haplotype reference panel using our in-house reference panel building pipelines that have been applied and validated on a number of other agricultural (and human) species. The resulting reference panel comprises 53M SNPs and short indels.

To examine the population structure of the samples comprising our reference panel, we performed PCA on a subset of the markers in our panel. In the following figure, the axes are the first two principal components of the marker subset, and each point on the plot represents an individual in our reference panel. Each point is colored by the sample breed, and animals of the same breed can be seen to cluster together, illustrating the distinctness of population groups. These lines span the full genetic diversity of publicly available data, and therefore ensure that any incoming sample sharing genetics with these individuals will be accurately characterized by imputation to this reference panel.

USDA-ARS Boar Pilot

To test the performance of the haplotype reference panel on out-of-sample material, we partnered with the USDA-ARS MARC on a pilot project wherein we performed low-pass sequencing on 83 boar samples provided by the USDA-ARS to target coverages of 0.5x-1x. These samples were chosen because they had been previously genotyped using the GGP Porcine 50K array, therefore providing a comparison callset against which we could benchmark our imputation pipeline.

The 83 boar samples derived from a number of different distinct breeds which were represented to different degrees in our haplotype reference panel, therefore allowing us to examine the effects of representation on imputation performance.

To measure the performance of imputation on these samples, we first took the intersection between the reference panel variants and the variants assayed by the genotyping array. We then computed genotype concordance between the (confidently) imputed sequence genotypes and the array genotypes at these sites. On average, there were ~48k overlapping sites, with 46k of these overlapping sites being confidently imputed.

The following table contains the resulting performance metrics broken down by breed. Shown are the average effective coverage for each breed, along with the number of assayed animals and the breed’s representation in the reference panel (“N of breed in reference panel”).

Overall, imputation performance is equal or superior to the state of the art [2,3,4], with average genotype concordance to the array ranging from 93% for Hampshire boars to 97.4% for Yorkshire/Large White boars. Note here that since these are comparisons to genotyping array data (which themselves have a nontrivial error rate), these should be treated more as a lower bound as regards performance to the true underlying genotypes. A more comprehensive analysis which remains to be done is a similar analysis with the imputed sequence data benchmarked against genotypes called from high-depth whole genome sequence. It is also instructive to note at this point that the relatively high imputation error rate here is, while consistent with the literature, rather lower than results observed for cattle (see for example [5] where concordance between low-pass and array genotypes across the MAF spectrum for diverse cattle consistently exceeded 98%). The reasons for this are currently not entirely clear. However, the best-performing sample (a Yorkshire boar) had a genotype concordance of 99% to the genotyping array, indicating the potential for very high imputation accuracy for animals with especially high amounts of shared genetics with the individuals reference panel.

Even with this limited data, it is easy to observe the effect of increased representation in the reference panel — for instance, even though the Pietrain samples had a significantly lower average effective coverage than the Hampshire samples (0.48 vs 0.79 respectively), the Pietrain samples imputed more accurately (96.5% concordance vs 93.0% concordance) due to significant Pietrain representation in the reference panel (32 purebred Pietrain vs 0 purebred Hampshire).

Based on work in other swine projects and in other species, we predict that the addition of a few dozen additional representative individuals to the reference panel for Hampshire would be sufficient to bring the breed-level imputation accuracy up to levels comparable to the other breeds evaluated in this project. This illustrates the straightforward manner by which our approach allows for the easy extensibility of imputation to proprietary or private breeding populations — typically, the addition of just a few dozen high coverage sequenced samples is adequate to represent even more unusual breeds and assure accurate imputation based off low-pass sequencing data.

Conclusion

We have constructed a comprehensive haplotype reference panel encompassing a large range of genetic diversity in pigs, and have validated its performance with out-of-sample individuals provided by the USDA-ARS. We’re excited to announce that this pipeline is now publicly available on our systems, and existing customers can access this pipeline by choosing the project configuration “Swine low-pass v1.0” when creating a new Gencove project.

We’re excited to be part of the next generation of genomic selection and breeding in pig genomics — please reach out with any suggestions or comments!

Acknowledgments

We wish to thank Gary Rohrer and colleagues at the USDA-ARS MARC for their involvement in this collaboration.

Bibliography

1 Hayes, Ben J., and Hans D. Daetwyler. “1000 bull genomes project to map simple and complex genetic traits in cattle: applications and outcomes.” Annual review of animal biosciences 7 (2019): 89–102.

2 Zhang, Chunyan, et al. “Genomic evaluation of feed efficiency component traits in Duroc pigs using 80K, 650K and whole-genome sequence variants.” Genetics Selection Evolution 50.1 (2018): 1–13.

3 Wu, Pingxian, et al. “GWAS on imputed whole-genome Resequencing from genotyping-by-sequencing data for farrowing interval of different parities in pigs.” Frontiers in genetics 10 (2019): 1012.

4 van den Berg, Sanne, et al. “Imputation to whole-genome sequence using multiple pig populations and its use in genome-wide association studies.” Genetics Selection Evolution 51.1 (2019): 1–13.

5 Snelling, Warren M., et al. “Assessment of Imputation from Low-Pass Sequencing to Predict Merit of Beef Steers.” Genes 11.11 (2020): 1312.