Skip to content

Using AI and Provider Notes to Study Anti-Neutrophil Cytoplasmic Antibody–Associated Vasculitis

Key findings

  • Massachusetts General Hospital researchers used topic modeling, a form of automated natural language processing, to review 113,048 clinical notes on 660 patients with anti-neutrophil cytoplasmic antibody–associated vasculitis (AAV)
  • Without any need for the researchers to pre-label the notes or supervise the machine learning, the modeling identified 90 clusters of related words, and the team manually labeled the topics and selected the top 25 words in each cluster
  • Trends in the topics identified strongly correlated with trends identified using corresponding structured data from electronic medical records, highlighting delays in diagnosis and enabling the ability to pinpoint the time of treatment initiation
  • This study provides proof of concept that automated topic modeling is feasible for using provider notes to study the natural history of AAV

A better understanding of the clinical course of anti-neutrophil cytoplasmic antibody–associated vasculitis (AAV) might allow physicians to identify opportunities for interventions that would improve morbidity and mortality. Care provider notes would be a rich source of information, but the review of narrative notes is extremely time-consuming.

Researchers recently demonstrated that topic modeling, a text-mining tool and type of natural language processing, can discover clinically relevant themes across longitudinal clinical notes on patients with AAV. Liqun Wang, PhD, research fellow at Brigham and Women's Hospital, and physician Zachary S. Wallace, MD, MSc, of the Division of Rheumatology, Allergy and Immunology at Massachusetts General Hospital, and colleagues explain their findings in Seminars in Arthritis and Rheumatism.

Study Methods

Topic modeling applies statistical machine learning methods to discover the abstract topics that occur in a collection of documents. The learning is unsupervised and does not require any prior labeling of the documents.

From the Mass General Brigham Research Patient Data Registry, the researchers obtained 113,048 inpatient and outpatient clinical notes for 660 AAV patients that were added between March 1, 1990, and August 23, 2018.

Identifying Topics

After some preprocessing (e.g., removing commonly used words such as "of"), the team fed the notes into an open-source Java-based package for topic modeling. It generated three topic models, each with 100 topics. The researchers then applied an algorithm that identified 90 topics that occurred stably over all three models.

Manually, team members labeled the topic clusters (e.g., granulomatosis with polyangiitis, Churg–Strauss syndrome, AAV-specific treatment, general treatment, skin lesion, neurology findings, pulmonary findings) and selected the top 25 words in each cluster. Some topic clusters received the same labels.

Temporal Trends

Trends in the topics identified from clinical notes strongly correlated with trends identified using corresponding structured data from electronic medical records. For example:

  • Reflecting the reality that diagnosis of AAV is often delayed, signs and symptoms related to AAV and non-specific treatments (e.g., glucocorticoids) were mentioned in the months preceding AAV-specific treatment initiation
  • The proportion of notes referencing the diagnosis of AAV began to increase approximately one month before initiation of AAV treatment
  • References to specific AAV treatments began to be frequent, and continued to be frequent, after the date of treatment initiation as determined by manual chart review
  • The frequency of references to end-stage renal disease and renal transplantation increased after AAV treatment initiation
  • One surprise was that psychiatric symptoms were mentioned months before treatment initiation

New Support for Research

Structured data fields (e.g., laboratory data, medications and problem lists) are unlikely to reflect the full extent of signs, symptoms, comorbidities and complications associated with AAV. In addition, the patient's providers may be spread out across several health care systems. Claims data are equally unsatisfactory because a patient's insurance coverage can change year to year.

This study provides proof of concept that automated topic modeling is feasible for using provider notes to study the natural history of AAV. Topic modeling might also be feasible for research into other multi-organ rheumatic conditions such as systemic sclerosis and systemic lupus erythematosus.

Learn more about the Rheumatology Unit at Mass General

Refer a patient to the Rheumatology Unit at Mass General


Massachusetts General Hospital researchers, Lacey B. Robinson, MD, MPH, and Carlos A. Camargo, Jr., MD, DrPH, found that between 2006 and 2015 in the U.S., emergency department visits for anaphylaxis in infants and toddlers more than doubled while hospitalizations declined.


Clinicians in the Division of Rheumatology, Allergy and Immunology found that for patients with rheumatic and musculoskeletal diseases who develop COVID-19, the risks of death and other severe outcomes have declined since early in the pandemic but remain considerable.