Gillian Belbin, Senior Data Scientist - Jun 10, 2024

Enhancing genome-wide association analyses with whole genome data in the UK Biobank

The recent release of Whole Genome Sequencing (WGS) data for 490,640 participants in the UK Biobank (UKBB) has presented researchers with opportunities for comprehensive assessment of genomic risk factors underlying disease at an unprecedented resolution and scale.

Integrating WGS with data from Electronic Health Records (EHR) in the context of genome-wide discovery allows for the interrogation of both non-coding genetic variants, not captured via exome sequencing, as well as rare variation that is not readily detectable via older array-based genotyping technologies. As a result, insights from Genome-Wide Association Studies (GWAS) that leverage WGS can offer a more complete picture of the role of genetic factors in disease risk, across a wide array of clinical outcomes.

With this release of data, we wanted to explore two traits of interest, Body Mass Index (BMI), and Type II Diabetes (T2D), due to their high prevalence and significant morbidity risk. Our aim was to understand how the increased resolution impacts genome-wide associations for these traits.

GWAS results


We first performed GWAS for BMI via linear regression in all unrelated “White/British” samples in UKBB, adjusting for age, sex and the first 10 genetic principal components, and using all available non-singleton genetic variants in the WGS data (n=445,895,289 variants). In doing so we were able to detect n=15,953 sites associated with BMI that surpassed a suggestive genome-wide significance threshold (p<1e-8, Figure 1).

Figure 1. Genome-wide association of n=445,985,289 variants captured by Whole Genome Sequencing for Body Mass Index in the UK Biobank.

The top association was at the FTO locus (rs7187250, p<4.786e-176, Beta=0.071), an extensively characterized signal that has been heavily implicated in the biology of BMI1,2,3. We also recapitulate the signal at the MC4R locus (rs6567160, p<1.155e-71, Beta=0.05), again, a known and well-characterized locus previously associated with BMI4,5. Of all genome-wide significant sites that we identify, >99.9% (n=15,945) map within a previously reported association to BMI (<= 25Kb), as reported in the GWAS Catalog. In addition to these, we note 8 loci that achieved a genome-wide suggestive threshold for significance but that lay outside of 25Kb of a known BMI locus (Table 1), mostly falling within intergenic (N=7/8) regions of the genome, which may represent novel hits for this trait.

Risk AlleleBetaP-ValueGeneAnnotation
2030908722A-0.03969 5.159e-09-regulatory_region_variant

Table 1. Genome-wide suggestive associations with Body Mass Index that fell outside of 25Kb of previously reported associations.

To explore whether the increased density of observed variants in the UKBB WGS might help to improve fine-mapping efforts, we examined the signal at the MC4R locus in both the UKBB WGS analysis and for an identical analysis that was performed using the genotype array data available for the UK Biobank (Figure 2).

Figure 2. Comparison of genotypic density at MC4R locus between WGS and array based GWAS for BMI in the UKBB.

Notably, we observe n=477 sites at p<1e-8 at MC4R (GRCh38 chr18:60.0-60.5MB) via UKBB WGS, while only observing n=30 sites at the same locus using the array data, suggesting that the additional genomic resolution offered by using WGS in this context provides additional information to help resolve the fine-mapping of causal variants at complex GWAS loci.

Type 2 diabetes

We next performed a similar exercise using T2D as the phenotypic outcome (Figure 3). We defined case control status for T2D using affirmative answers to the two following UK Biobank entries: “diabetes diagnosed by doctor” (2443), and “ever had diabetes” (120007). We used the aggregate of these two entries as the outcome variable in a logistic regression, using the same covariates, individuals, and set of genetic sites as predictor variables as described for our analysis of BMI.

Figure 3. Genome-wide association of n=445,985,289 variants captured by Whole Genome Sequencing for Type II Diabetes in the UK Biobank.

In this analysis we successfully recapitulate the known association between T2D and the TCF7L2 locus (rs7903146; p<4.53e-175; adjOR=1.36) as our top signal6. We further demonstrate that, of associations surpassing the genome-wide suggestive level of significance, N=5156/5171 fall within 25Kb of a known T2D associated locus (i.e. mapped to the phenotype “type 2 diabetes mellitus" in the GWAS Catalog).

The remaining 15 genome-wide suggestive loci (the top 10 of which are shown in Table 2) again mostly represent intergenic variants (11/15).

Risk AlleleOR (adjusted)P-ValueGeneAnnotation
5163855722C5.8764.208e-09Processed pseudogene-

Table 2. Genome-wide suggestive associations with Type II Diabetes that fell outside of 25Kb of previously reported associations.

However, several variants fall within the introns of protein coding genes: most notably, the top association among these (chr7:158162695:C/T; p<5.242e-10; adjOR=6.12) falls within the intron of the gene PTPRN2. This gene encodes for a tyrosine phosphatase that is expressed in pancreatic islet cells and that has previously been mechanistically implicated in the insulin signaling pathway7, with knockout mice for PTPRN2 exhibiting impaired insulin secretion in vivo8. PTPRN2 has previously been shown to play a role in the pathophysiology of Type I Diabetes9, and has been associated with BMI in prior genomic studies10. However it is presently not associated with “Type II Diabetes Mellitus” in the GWAS Catalog. We note the caveat that this variant is rare within the UKBB WGS (MAF=0.006%), and has not previously been reported in other large publicly available WGS databases (i.e. gnomAD). Given the rarity of this variant within UKBB, we sought to explore its fine-scale segregation within the UK by examining its carrier rate by place of birth (Figure 4). We note that the majority of carriers report their place of birth in North West England, with the addition of several carriers observed in London, and the South of England.

Figure 4. Fine-scale segregation of PTPRN2 variant in participants born within the UK.

Overall, we demonstrate that GWAS using WGS data has the power to successfully recapitulate the results of previous studies and, due to additional site density, also has the potential to help refine association signals at uncovered loci. Furthermore, the addition of deep coverage at intergenic and intronic regions of the genome over pre-existing technologies has the potential to facilitate novel discovery, aiding the identification of novel therapeutic targets, and advancing precision medicine efforts as a whole.


1. Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894 (2007).

2. Loos, R. J. F. & Yeo, G. S. H. The bigger picture of FTO: the first GWAS-identified obesity gene. Nat. Rev. Endocrinol. 10, 51–61 (2014).

3. Claussnitzer, M. et al. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N. Engl. J. Med. 373, 895–907 (2015).

4. Lotta, L. A. et al. Human Gain-of-Function MC4R Variants Show Signaling Bias and Protect against Obesity. Cell 177, 597–607.e9 (2019).

5. Farooqi, I. S. et al. Clinical spectrum of obesity and mutations in the melanocortin 4 receptor gene. N. Engl. J. Med. 348, 1085–1095 (2003).

6. Grant, S. F. A. et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38, 320–323 (2006).

7. Torii, S. et al. The pseudophosphatase phogrin enables glucose-stimulated insulin signaling in pancreatic β cells. J. Biol. Chem. 293, 5920–5933 (2018).

8. Yasui, T. et al. Insulin granule morphology and crinosome formation in mice lacking the pancreatic β cell-specific phogrin (PTPRN2) gene. Histochem. Cell Biol. 161, 223–238 (2024).

9. van den Maagdenberg, A. M. et al. Assignment of Ptprn2, the gene encoding receptor-type protein tyrosine phosphatase IA-2beta, a major autoantigen in insulin-dependent diabetes mellitus, to mouse chromosome region 12F. Cytogenet. Cell Genet. 82, 153–155 (1998).

10. Lee, S. The association of genetically controlled CpG methylation (cg158269415) of protein tyrosine phosphatase, receptor type N2 (PTPRN2) with childhood obesity. Sci. Rep. 9, 4855 (2019).