Skip to content

Machine Learning Algorithm Leverages Use of EHRs for Lung Cancer Research

Key findings

  • Researchers at Massachusetts General Hospital developed a machine learning algorithm that could assemble a large cohort of patients with lung cancer for longitudinal research
  • The algorithm incorporated structured electronic health record (EHR) data (e.g., demographics, procedure codes, medications) and relevant terms from clinical notes, extracted using natural language processing tools
  • The algorithm identified a cohort of 42,069 lung cancer patients and demonstrated an area under the receiver operating curve (AUC) of 0.93 and a positive predictive value of 94%
  • The researchers also developed a prognostic model based on the cohort; when applied to 11,724 patients with non–small cell lung cancer, the AUC for five-year overall survival was 0.81
  • Because it offers real-world treatment data and information not typically found in registries, this EHR cohort may be a powerful platform for clinical outcome studies

The growing availability of electronic health record (EHR) data presents a low-cost alternative to conducting traditional cohort studies in cancer research. However, clinical features needed to select a cohort, such as stage, biomarker profile or performance status, may not be available in structured EHR data.

Public health researchers used technology-driven approaches to build a lung cancer EHR cohort that is suitable for studying lung cancer progression. They also developed a prognostic model that performed well in estimating the survival of patients with non–small cell lung cancer (NSCLC).

Qianyu Yuan, PhD, of the Department of Environmental Health at Harvard T.H. Chan School of Public Health, David C. Christiani, MD, MPH, physician-investigator in the Pulmonary and Critical Care Medicine Division at Massachusetts General Hospital, and colleagues report their methodology and results in JAMA Network Open.

Developing a Classification Algorithm

First, the researchers identified 76,643 patients from the Mass General Brigham EHR system who had at least one lung cancer diagnosis code assigned between July 1988 and October 2018. Two separate reviewers reviewed the records of 200 of those patients, selected randomly, to determine their lung cancer status (present, not present or indeterminate) and develop a small set of criterion-labeled data.

Next, the team gathered:

  • Structured EHR data (e.g., demographics, procedure codes, medications)
  • Relevant terms from narrative notes (e.g., stage, histologic type, somatic variation), extracted using natural language processing tools

They used that information to develop a machine learning classification algorithm, a form of artificial intelligence in which the computer learns through practice and need not be specifically programmed.

When applied to the initial dataset, the algorithm identified 42,069 patients as having lung cancer. The final cohort was still very large (n=35,735) after excluding individuals with recurrent or secondary lung cancer or <14 days of follow-up after diagnosis.

Performance of the Algorithm

Compared with the results of the manual review, the algorithm had an area under the receiver operating curve (AUC) of 0.93, indicating an excellent ability to identify people with lung cancer.

When the specificity of the algorithm was set to 90%, it achieved a sensitivity of 75% and a positive predictive value of 94%.

The completeness and accuracy of the data the algorithm extracted from the EHR system compared well with a dataset on 6,225 patients from the Boston Lung Cancer Study, an epidemiologic study Mass General has participated in since 1992.

Prognostic Model

The researchers also used EHR-derived data to develop a model for predicting non-small cell lung cancer survival. They applied the algorithm to 11,724 patients with NSCLC, ages 18 to 90, who were diagnosed between January 2000 and January 2015. Data were analyzed from March 2019 to July 2020.

The AUC of the model for overall survival was:

  • 0.82 at 1 year
  • 0.81 at 3 years
  • 0.81 at 5 years (primary endpoint)

A New Direction for Research

Registries such as those of the Surveillance, Epidemiology and End Results Program can be used to conduct large-scale cohort studies in cancer. However, this EHR cohort offers comprehensive information not typically found in registries, including detailed treatment data collected during routine clinical care, genetic and molecular profiling, and laboratory test results.

This cohort may therefore be a powerful platform for clinical outcome studies. For example, it might reveal associations between the use of specific drugs and survival outcomes.

area under the curve for a machine learning algorithm in identifying patients with lung cancer

area under the curve for a prognostic model in predicting 5-year survival in non–small cell lung cancer

Learn about the Division of Pulmonary and Critical Care

Refer a patient to the Division of Pulmonary and Critical Care

Related topics


By integrating genomic and transcriptomic data, David C. Christiani, MD, MPH, of the Pulmonary and Critical Care Unit, and colleagues gleaned information about the pathogenesis of acute respiratory distress syndrome (ARDS) and potential genetic targets for therapies, including therapies for COVID-19–related ARDS.


Benjamin J. Drapkin, MD, PhD, Anna Farago, MD, PhD, and Nicholas J. Dyson, PhD, of the Cancer Center demonstrated that the combination of olaparib, an inhibitor of DNA-damage repair, and temozolomide is highly active against relapsed SCLC—and they've identified genes that seem to predict response to the regimen.