Skip to content

Predicted Survival After Radical Prostatectomy Varied by Race With a Trained Machine Classifier

Key findings

  • This study evaluated the performance of a machine learning algorithm that was trained to predict survival after radical prostatectomy of 68,630 patients with localized disease (84% white, 12% African American, and 4% other races)
  • Training of the algorithm was conducted in five samples: 80% of the full cohort; 80% of the white, African American, and non-white, non–African American (NWNAA) subgroups; and a synthetically race-balanced sample
  • The algorithm's ability to predict survival was best in African American patients (even though this subgroup represented a minority of the training sample), was relatively lower in white patients, and was lowest in NWNAA patients
  • Performance of the algorithm within race subgroups was comparable whether it was trained in a naturally race-imbalanced, race-specific or synthetically race-balanced sample
  • To avoid potential disparities in care, the performance of machine learning algorithms should be evaluated in race subgroups prior to clinical deployment

Assessing life expectancy is a key step in counseling patients with localized prostate cancer because randomized trials have demonstrated a lack of survival benefit for treatment in older patients with comorbidities. A machine classifier that could predict survival accurately would have important applications to shared decision-making.

Using a large U.S. cancer database, researchers at Massachusetts General Hospital developed a machine-learning algorithm to predict survival after radical prostatectomy, training it in naturally race-imbalanced, race-specific, and synthetically race-balanced samples.

Even so, the algorithm's performance varied by race subgroup, as Madhur Nayan, MDCM, PhD, a clinical fellow in surgery in the Department of Urology, Quoc-Dien Trinh, MD, of Brigham and Women's Hospital, and colleagues report in The Prostate.


From the 2016 National Cancer Database, the researchers identified 68,630 patients diagnosed with invasive prostate adenocarcinoma and managed with radical prostatectomy for pT1–4, N0/X, M0/X disease. 84% of patients were white, 12% were African American, and 4% were non-white and non–African American (NWNAA).

The team trained their classifier in 80% of five different samples:

  • White
  • African American
  • Naturally race-imbalanced (full cohort)
  • Synthetically race-balanced (the NWNAA subgroup plus randomly selected subsamples of white and African American patients)

The remaining 20% of each sample served as the test set.

Within five years, 5,068 patients died (10% of African American patients, 8% of white patients, and 5% of NWNAA patients).

Primary Performance Metric

The primary performance metric was the F1 score, which is the harmonic mean of the positive predictive value, also known as precision, and sensitivity, also known as recall. An F1 score ranges from 0 to 1, with 1 indicating perfect precision and recall.

Performance within each race subgroup was relatively similar regardless of how the classifier was trained:

  • African American test set—The F1 score was 0.5 when the classifier was trained on the race-imbalanced sample, 0.6 when trained on the race-specific (African American) sample, and 0.5 when trained on the race-balanced sample
  • White test set—0.5, 0.5 and 0.5
  • NWNAA test set—0.4, 0.4 and 0.4
  • Race-imbalanced test set—0.5 when trained in the race-imbalanced sample, and 0.5 when trained in the race-balanced sample

Secondary Metrics

  • Sensitivity—generally low and particularly low in the NWNAA test set; highest in the African American set
  • Specificity—generally high and relatively similar across race subgroups
  • Positive predictive value—relatively similar across race subgroups, except low in the NWNAA subgroup when the classifier was trained in the race-imbalanced set
  • Negative predictive value—highest in the NWNAA test set and lowest in the African American set

Preventing Disparities in Care

This study demonstrates the importance of evaluating a machine learning algorithm's performance in patient subgroups before clinical deployment. The use of this classifier would have resulted in overestimating the five-year survival for a larger proportion of NWNAA patients than African American or white patients.

The discrepancy was presumably due in part to the low frequency of NWNAA patients in the National Cancer Database. In this study, the largest race subgroup of NWNAA was Filipinos (n=178), 0.26% of the full cohort.

The performance variation of an algorithm in race subgroups may also relate to the features included, and further research is needed to identify potential race-specific features.

Learn more about the Department of Urology

Refer a patient to the Department of Urology

Related topics


Madhur Nayan, MDCM, PhD, and Adam S. Feldman, MD, MPH, and colleagues created the first machine learning models for predicting prostate cancer progression during active surveillance and found they outperformed traditional logistic regression.


In this video, Keyan Salari, MD, PhD, discusses his team's work to identify the features of seemingly indolent prostate cancers that actually make them more aggressive.