- Researchers at Massachusetts General Hospital developed a machine learning algorithm that could assemble a large cohort of patients with lung cancer for longitudinal research
- The algorithm incorporated structured electronic health record (EHR) data (e.g., demographics, procedure codes, medications) and relevant terms from clinical notes, extracted using natural language processing tools
- The algorithm identified a cohort of 42,069 lung cancer patients and demonstrated an area under the receiver operating curve (AUC) of 0.93 and a positive predictive value of 94%
- The researchers also developed a prognostic model based on the cohort; when applied to 11,724 patients with non–small cell lung cancer, the AUC for five-year overall survival was 0.81
- Because it offers real-world treatment data and information not typically found in registries, this EHR cohort may be a powerful platform for clinical outcome studies
The growing availability of electronic health record (EHR) data presents a low-cost alternative to conducting traditional cohort studies in cancer research. However, clinical features needed to select a cohort, such as stage, biomarker profile or performance status, may not be available in structured EHR data.
Subscribe to the latest updates from Pulmonary & Critical Care Advances in Motion
Public health researchers used technology-driven approaches to build a lung cancer EHR cohort that is suitable for studying lung cancer progression. They also developed a prognostic model that performed well in estimating the survival of patients with non–small cell lung cancer (NSCLC).
Qianyu Yuan, PhD, of the Department of Environmental Health at Harvard T.H. Chan School of Public Health, David C. Christiani, MD, MPH, physician-investigator in the Pulmonary and Critical Care Medicine Division at Massachusetts General Hospital, and colleagues report their methodology and results in JAMA Network Open.
Developing a Classification Algorithm
First, the researchers identified 76,643 patients from the Mass General Brigham EHR system who had at least one lung cancer diagnosis code assigned between July 1988 and October 2018. Two separate reviewers reviewed the records of 200 of those patients, selected randomly, to determine their lung cancer status (present, not present or indeterminate) and develop a small set of criterion-labeled data.
Next, the team gathered:
- Structured EHR data (e.g., demographics, procedure codes, medications)
- Relevant terms from narrative notes (e.g., stage, histologic type, somatic variation), extracted using natural language processing tools
They used that information to develop a machine learning classification algorithm, a form of artificial intelligence in which the computer learns through practice and need not be specifically programmed.
When applied to the initial dataset, the algorithm identified 42,069 patients as having lung cancer. The final cohort was still very large (n=35,735) after excluding individuals with recurrent or secondary lung cancer or <14 days of follow-up after diagnosis.
Performance of the Algorithm
Compared with the results of the manual review, the algorithm had an area under the receiver operating curve (AUC) of 0.93, indicating an excellent ability to identify people with lung cancer.
When the specificity of the algorithm was set to 90%, it achieved a sensitivity of 75% and a positive predictive value of 94%.
The completeness and accuracy of the data the algorithm extracted from the EHR system compared well with a dataset on 6,225 patients from the Boston Lung Cancer Study, an epidemiologic study Mass General has participated in since 1992.
The researchers also used EHR-derived data to develop a model for predicting non-small cell lung cancer survival. They applied the algorithm to 11,724 patients with NSCLC, ages 18 to 90, who were diagnosed between January 2000 and January 2015. Data were analyzed from March 2019 to July 2020.
The AUC of the model for overall survival was:
- 0.82 at 1 year
- 0.81 at 3 years
- 0.81 at 5 years (primary endpoint)
A New Direction for Research
Registries such as those of the Surveillance, Epidemiology and End Results Program can be used to conduct large-scale cohort studies in cancer. However, this EHR cohort offers comprehensive information not typically found in registries, including detailed treatment data collected during routine clinical care, genetic and molecular profiling, and laboratory test results.
This cohort may therefore be a powerful platform for clinical outcome studies. For example, it might reveal associations between the use of specific drugs and survival outcomes.
Learn about the Division of Pulmonary and Critical Care
Refer a patient to the Division of Pulmonary and Critical Care