Skip to content

Use of Electronic Health Records Genetically Validated for Research Into Bipolar Disorder

Key findings

  • Massachusetts General Hospital is part of a multinational consortium that is using automated algorithms to identify bipolar disorder (BD) cases and controls from electronic health record (EHR) systems
  • In this study, it was possible to collect DNA from BD cases and controls by linking discarded blood samples to de-identified EHR data
  • Algorithm-selected samples yielded single nucleotide polymorphism-based heritability comparable to that observed in genome-wide association studies (GWAS)
  • The cohorts ascertained by automated EHR phenotyping also exhibited substantial genetic correlations with large samples ascertained through GWAS
  • High-throughput phenotyping using the large data resources available in EHRs is a feasible way to advance genetic research in psychiatry

The use of electronic health records (EHRs) has the power to accelerate genetic research in bipolar disorder (BD). Currently the major rate-limiting step for genome-wide association studies (GWAS) is the need for ever-larger sample sizes to detect both common modest-effect variants and rarer large-effect variants.

Psychiatrists at Massachusetts General Hospital co-founded the International Cohort Collection for Bipolar Disorder (ICCBD), a consortium that applies high-throughput phenotyping methods to EHRs at sites in the U.S., U.K. and Sweden. The ICCBD previously demonstrated that automated algorithms could feasibly identify BD cases and controls from an EHR system.

Now the consortium has genetically validated its algorithms by using GWAS data to estimate single nucleotide polymorphism (SNP)-based heritability and genetic correlation with other large-scale BD samples. Jordan W. Smoller, MD, ScD, director of the Psychiatric and Neurodevelopmental Genetics Unit (PNGU), PNGU member Roy Perlis, MD, and postdoctoral fellow Chia-Yen Chen, ScD, and colleagues report their findings in Translational Psychiatry.

In an earlier study, BD cases and controls were identified by creating a "datamart" of 52,235 patients from the EHR system at Mass General, which spans more than 20 years of data from 4.6 million patients. Eligible patients had at least one diagnostic code for BD or manic disorder.

Creation and Clinical Validation of the Algorithms

The researchers created four automated phenotyping algorithms to identify cases and one to identify controls:

  • 95-NLP: Expert clinicians manually reviewed narrative notes from randomly selected patients and identified gold-standard cases. They used 414 relevant features from those notes to train natural language processing software to predict BD
  • Coded-strict: A rules-based algorithm that required at least three diagnostic codes for BD, a predominance of BD diagnoses in the longitudinal record and either (a) treatment with lithium or valproate within a year of BD diagnosis or (b) treatment at a bipolar specialty clinic
  • Coded-broad: Required at least two ICD codes for BD, a predominance of BD diagnoses and treatment with at least two medications commonly used for BD
  • Coded-broad-SV: Same as coded-broad except that two or more BD diagnoses could occur during the same episode of illness
  • Controls: Defined controls as patients at least 30 years of age with no diagnostic code or medication history related to a psychiatric or neurological condition

One-hundred and ninety patients identified as BD cases or controls had in-person diagnostic interviews with blinded expert clinicians. Except for coded-broad-SV, all algorithms achieved high-positive predictive value compared with the interviews.

DNA Sample Collection and Genotyping

For the new study, the algorithms were applied to the EHR system to ascertain case and control DNA samples by linking de-identified phenotypic data to discarded blood samples. Genotyping was performed using a high-throughput genome-wide genotyping array that includes ~250,000 common variants, ~250,000 rare variants and ~50,000 additional markers.

The researchers limited further analysis to DNA samples of European ancestry. The final dataset included 3,330 BD cases and 3,952 controls.

SNP-based Heritability

The highest heritability (0.24, P = .015) was seen with the 95-NLP algorithm. That figure, the researchers note, is nearly identical to heritability in GWAS studies the ICCBD and the Psychiatric Genomics Consortium (PGC) conducted on large, traditionally ascertained BD cohorts.

The coded-strict and coded-broad algorithms also yielded significant, although relatively lower, heritability estimates (0.09–0.12). The coded-broad-SV algorithm did not exhibit significant heritability.

Even so, the overall heritability of the EHR-based BD sample was 0.12 (P = .004). The EHR-based BD definitions were nearly perfectly genetically correlated with each other, with pairwise correlations ranging from 0.98 to 1.0.

SNP-based Genetic Correlations

Overall, the correlation between the EHR-based BD case/control samples and the ICCBD + PGC samples was 0.83. Thus, the algorithms captured genetic influences that strongly overlap with those acting on BD in traditionally ascertained samples. The finding also suggests, according to the authors, that EHR-defined DNA samples can be combined with other samples to enhance the power of genetic discovery.

In summary, the researchers write, high-throughput phenotyping using the large data resources available in EHRs is a feasible way to advance genetic research in psychiatry.

Learn more about the Psychiatric & Neurodevelopmental Genetics Unit

Explore psychiatry research at Mass General

Related topics


Previous research has shown that diagnostic codes routinely collected in electronic health records can help predict domestic abuse an average of two years in advance. Could EHR systems also be used to predict suicidal behavior?


A key quest in psychiatric research is the search for objective ways to diagnose major mental illnesses. Mass General researchers are exploring whether functional MRI can be used to distinguish between bipolar disorder and unipolar depression.