Skip to content

Accuracy of Short-read Sequencing Benchmarked Across M. tuberculosis Genome

Key findings

  • In this study, long-read sequencing data from 36 phylogenetically diverse Mycobacterium tuberculosis (Mtb) isolates were used to benchmark the accuracy of short-read variant detection
  • Illumina whole-genome sequencing (WGS) had high precision but limited recall in repetitive and structurally variable regions of the Mtb genome
  • Filtering variants based on variant quality annotations, such as mean mapping quality, allowed for a greater range of precision and recall than the masking of specific low-confidence regions of the Mtb genome
  • It was possible to analyze a much greater proportion of the Mtb genome with Illumina WGS than previously thought, and the paper provides a revised set of low-confidence regions for future studies

For the identification of genetic variants—"variant calling"—a common first step is to perform short-read whole-genome sequencing (WGS) using instruments and reagents from Illumina (San Diego, CA), followed by alignment with a reference genome.

These short reads are efficient and cost-effective, but genetic divergence from the reference genome, repetitive sequences, and sequencing bias reduce their performance. Longer reads generated with tools from PacBio (Menlo Park, CA) can assemble complete bacterial genomes and resolve repetitive areas with greater confidence.

To identify the sources of error that arise when short-read data are used for variant detection, researchers at Massachusetts General Hospital benchmarked Illumina WGS against PacBio long-read sequencing data from Mycobacterium tuberculosis (Mtb) isolates. In Bioinformatics, Maha Reda Farhat, MD, MSc, a physician in the Division of Pulmonary and Critical Care Medicine at Mass General, Maximillian Marin, a research assistant in Dr. Farhat's lab, and colleagues present results that have broad implications for clinical and research applications of sequencing.

Background and Methods

The researchers chose Mtb for this investigation because while its genome has high clonality and stability, 10.7% is regularly excluded from WGS. Certain regions are enriched for repetitive sequences and have been considered too error-prone for short-read variant calling:

  • Proline–glutamate/proline–proline–glutamate (PE/PPE) genes (n=168), which might play important roles in virulence and immune modulation
  • Mobile genetic elements (n=147)
  • 69 genes homologous to genes elsewhere in the genome

The research team sequenced 36 phylogenetically diverse Mtb isolates with both Illumina short reads and PacBio long reads.

Key Results

The performance of reference genome–based Illumina variant calling was evaluated with an F1 score, the weighted average of the positive predictive value (also known as precision), and sensitivity (also known as recall):

  • A major limitation of variant calling with Illumina was low variant recall—a maximum of 89% across parameters evaluated
  • On the other hand, the maximum precision of Illumina variant calling was very high, 98.5%
  • The approach that maximized variant recall while maintaining high precision was to tune the mapping quality (MQ) filtering threshold (the confidence of the read mapping); the optimal MQ threshold will presumably vary between species, but for Mtb, an MQ threshold ≥40 achieved 85.8% recall and 99.1% precision
  • An alternative strategy was to mask repetitive sequence content; at MQ ≥40, this approach increased precision (99.6%) at the cost of substantially lower recall (70.2%) but would be appropriate for applications where precision needs to be prioritized
  • Of the Mtb genomic regions typically excluded from genomic analysis, 68% were accurately called using Illumina WGS, including 34.5% of PE/PPE genes, which had perfect mappability and near-perfect gene-level base-level recall (≥0.99)

Revised Regions of Low Confidence

Based on the findings, the researchers present a supplementary file showing regions of low confidence in the Mtb genome, those that accounted for the largest sources of error and uncertainty in the analysis of Illumina WGS. Those regions spanned 4% of the genome—less than the 10.7% usually excluded from genomic analysis.

The low-confidence regions frequently overlapped with regions with structural variation, low sequence uniqueness, and low sequencing coverage.

The study results pave the way for expanded use of WGS in the study of Mtb biology. Moreover, they suggest that benchmarking against long-read data will allow regions of low confidence to be defined for additional species, improving the accuracy of Illumina WGS and inferences about transmission in public health surveillance systems.

Explore research in the Division of Pulmonary and Critical Care Medicine

Refer a patient to the Division of Pulmonary and Critical Care Medicine

Related

By improving on a genomic "barcode" for classifying Mycobacterium tuberculosis, Luca Freschi, PhD, and Maha Reda Farhat, MD, MSc, have found and validated 30 new genetically distinct clades, greatly expanding what's known about the population structure, biogeography and transmissibility of the pathogen.

Related

Massachusetts General Hospital researchers recently summarized evidence about the risk of lung cancer in patients with interstitial lung disease (ILD), considerations for diagnosis and treatment of lung cancer in that population, and monitoring for ILD progression during lung cancer treatment.