Research Article: An exploratory phenome wide association study linking asthma and liver disease genetic variants to electronic health records from the Estonian Biobank

Date Published: April 12, 2019

Publisher: Public Library of Science

Author(s): Glen James, Sulev Reisberg, Kaido Lepik, Nicholas Galwey, Paul Avillach, Liis Kolberg, Reedik Mägi, Tõnu Esko, Myriam Alexander, Dawn Waterworth, A. Katrina Loomis, Jaak Vilo, Gyaneshwer Chaubey.


The Estonian Biobank, governed by the Institute of Genomics at the University of Tartu (Biobank), has stored genetic material/DNA and continuously collected data since 2002 on a total of 52,274 individuals representing ~5% of the Estonian adult population and is increasing. To explore the utility of data available in the Biobank, we conducted a phenome-wide association study (PheWAS) in two areas of interest to healthcare researchers; asthma and liver disease. We used 11 asthma and 13 liver disease-associated single nucleotide polymorphisms (SNPs), identified from published genome-wide association studies, to test our ability to detect established associations. We confirmed 2 asthma and 5 liver disease associated variants at nominal significance and directionally consistent with published results. We found 2 associations that were opposite to what was published before (rs4374383:AA increases risk of NASH/NAFLD, rs11597086 increases ALT level). Three SNP-diagnosis pairs passed the phenome-wide significance threshold: rs9273349 and E06 (thyroiditis, p = 5.50×10-8); rs9273349 and E10 (type-1 diabetes, p = 2.60×10-7); and rs2281135 and K76 (non-alcoholic liver diseases, including NAFLD, p = 4.10×10-7). We have validated our approach and confirmed the quality of the data for these conditions. Importantly, we demonstrate that the extensive amount of genetic and medical information from the Estonian Biobank can be successfully utilized for scientific research.

Partial Text

Genetic data are an important resource for scientific research and potential drug target identification [1] and genome-wide association studies (GWAS) have identified many disease-associated genetic variants [2]. Complementary to GWAS are phenome-wide association studies (PheWAS) which use a genotype-to-phenotype approach, testing for associations between specific genetic variants over a wide spectrum of phenotypes [3]. Combining genetic data with phenotypes defined and validated in electronic health records (EHRs) permits associations between genetic variants and disease outcomes, including diagnoses and procedures not commonly found in GWAS studies. To date, despite a relative abundance of EHR and genetic data becoming available, few large scale PheWAS studies have linked these data [4]. Understanding the full range of associations along with understanding the functional mechanisms of causal genetic variants will have important implications for the design of novel therapies across indications to reduce morbidity and mortality.

In an effort to assess the utility of the genetic and phenotypic data available in the Estonian Biobank, our first two objectives were to show our ability to detect previously reported GWAS associations of (1) asthma-associated SNPs with biomarkers (lab measures) and clinical diagnoses of asthma, and (2) liver disease-associated SNPs and clinical diagnoses of NAFLD/NASH within the Estonian Biobank. Both objectives are first, to test the Biobank suitability for validation/replication studies, and second, to confirm no systematic errors in the Biobank data before moving on to objective three, to conduct a PheWAS to test the association of the selected 24 SNPs with all other ICD-10 clinical diagnoses. Where significant associations between SNPs and ICD-10 diagnostic codes are found, our fourth objective was to determine associations of corresponding SNPs with lab/biomarker measures (quantitative traits) prior to the date of diagnosis.

After applying the exclusion criteria, 26,766 (51.2%) Caucasian individuals remained from the original 52,274 (Table 2). The main reason for the drop in the number of samples is missing genotype data (not genotyped with the given genotyping array). In this study, most participants were female (71.8%), 41.6% of individuals had a body mass index (BMI) of between 18.5–25 (normal weight), 59.2% were never smokers and 75.9% were of Estonian nationality (Table 3).

This is the first PheWAS study using Estonian Biobank data linked to EHRs. Large scale PheWAS are scarce [44] with other PheWAS utilising mostly small cohorts [42,45–52]. Recently large scale PheWAS have become possible in the UK Biobank, but other biobanks will still be required for further validation or replication. To assess the Estonian Biobank’s suitability for such a study or association validation/replication purposes, we focused on asthma and liver disease associated SNPs only and as a first task, investigated whether the effects previously reported in the literature (GWAS) could also be detected in our data. That is also to confirm that the data have no systematic errors and are suitable for PheWAS analysis. We replicated previous GWAS results, reporting a significant association of TSLP and RORA gene variants with asthma [31,33] (Table 4) and GCKR, PNPLA3, TRIB1 and TM6SF2 gene variants with the risk of developing liver diseases, notably NAFLD/NASH [8,11–15,20,24–26,29,53] (Table 5). We observed decreasing neutrophil levels per addition of an effect allele in the BIRC3 gene variant, consistent with previously reported results. In addition, variants in HSD17B13 and ERLIN1 influence alanine transaminase (ALT) levels. The HSD17B13 association is consistent with a recent study reporting that a loss-of-function variant associated with decreased levels of ALT and aspartate aminotransferase (AST) and reduced the risk of liver disease and progression from NAFLD to NASH [22]. Our observed association between rs11597086 (ERLIN1) and ALT level is in the opposite direction of what has been previously reported [21].