Skip to content

Artificial Intelligence Model for Labeling Chest X-Rays Allows Balance of Accuracy, Efficiency

Key findings

  • Massachusetts General Hospital researchers have developed an artificial intelligence model that automatically labels chest X-rays as positive or negative for cardiomegaly, pleural effusion, pulmonary edema, pneumonia, and atelectasis
  • For each label, the user pre-selects the desired level of confidence with values between 0 to 1 (the higher the value, the more accurate the labeling of the images, but the lower the capture rate)
  • When the model was applied to three large open-source datasets of chest X-rays, automatic labeling equaled or exceeded that of human experts
  • Using only 100 automatically labeled exams selected from the three large independent datasets, the researchers fine-tuned the model and found its performance was preserved or improved
  • The new system seems capable of providing highly accurate, fully automated labeling regardless of the size of the open-source database being studied

To implement artificial intelligence (AI) into radiology, it will be necessary to apply diagnostic labels to very large sets of images accurately and efficiently. Labeling serves as the "ground truth" for training clinically relevant AI models.

Researchers at Massachusetts General Hospital have developed and demonstrated a new method of automatically labeling large imaging databases that allows the user to select the desired tradeoff between accuracy and efficiency. They believe this approach will allow retraining of existing AI models to improve their accuracy and also help standardize the labeling of open-source datasets.

Doyun Kim, PhD, research fellow in the Laboratory of Medical Imaging and Computation, Joowon Chung, MD, visiting scholar there, Synho Do, PhD, director of the laboratory, and colleagues describe their method in Nature Communications.

Expanding an AI Model

In previous work, published in Nature Biomedical Engineering, Mass General researchers developed and validated an "explainable" AI model for detecting acute intracranial hemorrhage. Explainable means the model provides feedback to confirm it's performing at a predetermined level of accuracy.

In the new research, the team adapted the existing model to detect five chest X-ray labels—cardiomegaly, pleural effusion, pulmonary edema, pneumonia, and atelectasis—based on their similarity to the "library" of images in the previous model.

For each clinical label, the user can specify a quantitative threshold for a desired level of confidence, which the team calls the probability-of-similarity (pSim) metric. pSim values range from 0 to 1. The higher the value, the more accurate the labeling of the images, but the lower the capture rate (fewer cases will be identified from an external database as being similar to the model-labeled cases).

Training and Testing the Model

The researchers developed the new model using 151,700 anteroposterior chest X-ray views and 90,023 posteroanterior views obtained at Mass General between February 2015 and February 2019. For each view position, 1,000 images were selected as a test set, and pathological labels were determined by the consensus of three radiologists.

The remaining X-rays were separated into training and validation sets. Their labels were determined by automated natural language processing of corresponding radiology reports.

The model was applied to posteroanterior chest X-ray images from three large publicly accessible sets of chest X-rays (from CheXpert, MIMIC, and the National Institutes of Health; total images of 167,953). The goal was for the system to access its "memory" of labels present in the training set and estimate their presence in the open-source data.


There was a strong correlation between the clinical output labels and:

  1. The percentage of positively auto-labeled X-rays from the three pooled public datasets (the capture rate)
  2. The percentage of cases with complete agreement between the model and all seven expert readers
  3. The lowest pSim value for labeling such that all positive cases captured were true positive
  4. The lowest pSim value for labeling such that all negative cases captured were true negative

Retraining the Model

The researchers fine-tuned the model using automatically labeled exams for retraining. To assess performance, seven expert radiologists reviewed 100 randomly selected cases: 10 cases the model had judged "positive" for each of the five labels and 10 cases it had judged "negative" for each label. These 100 cases were distributed equally among ten pSim value ranges (0–0.1, 0.1–0.2, …, 0.9–1.0).

Performance was preserved or improved, resulting in a highly accurate, more generalized model. This suggests the new method may be able to provide highly accurate, fully automated labeling regardless of the size of the open-source database being studied—while demonstrating that it is possible to develop AI that grows smarter through retraining.

The authors of the study also note that the technology can be used to develop AI that is easily applied to hospitals or countries with different data characteristics.

Learn more about the Department of Radiology

Learn more about Radiology Research at Mass General


The Mass General AR/VR RAD Lab is developing and deploying AR and VR technologies for a host of training and clinical applications, ranging from anatomy education to presurgical planning and intraprocedural image overlay.


Binyin Li, MD, David H. Salat, PhD, and colleagues created a machine learning classifier that detected Alzheimer's disease–like brain patterns in younger adults (40 to 59 years old) and in a cohort with mild cognitive impairment.