Skip directly to search Skip directly to A to Z list Skip directly to page options Skip directly to site content

Volume 7, Number 3—June 2001


Single Nucleotide Polymorphisms in Mycobacterium tuberculosis Structural Genes

Suggested citation for this article

Read original article,

To the Editor: A recent article by Fraser et al. (1) discussed the frequency of single nucleotide polymorphisms (SNPs) in two genomes of Mycobacterium tuberculosis, strains H37Rv (2) and CDC1551 (unpublished). The article contains an inaccurate representation of our published M. tuberculosis data on SNP frequency. The authors state that "detailed comparison of strains H37Rv and CDC1551 indicates a higher frequency of polymorphism, approximately 1 in 3,000 bp, with approximately half the polymorphism [sic] occurring in the intergenic regions. In other words, 50% of the polymorphisms are in 10% of the genome. While this rate is higher than that suggested (3), it still represents a lower nucleotide diversity than found in limited comparisons from other pathogens."

On the basis of comparative sequence analysis of eight M. tuberculosis structural gene loci (open reading frames [orf]), we initially published an estimated average number of synonymous substitutions per synonymous site (Ks value) that indicated that this pathogen had, on average, approximately 1 synonymous difference per 10,000 synonymous sites (4). This finding was unexpected given the relatively large population size of M. tuberculosis and paleopathologic evidence suggesting its presence in humans as early as 3700 B.C. Subsequent sequence analysis of two megabases in 26 structural genes or loci in strains recovered globally confirmed the striking reduction of silent (synonymous) nucleotide substitutions compared with other human bacterial pathogens (3). A large study (approximately 2 Mb of comparative sequence data) of 12 genes potentially involved in ethambutol resistance (5) and 24 genes encoding protein targets of the host immune system (6) provided data consistent with the original estimate of 1 synonymous nucleotide change per 10,000 synonymous sites in structural genes in this pathogen. Our estimate did not include SNPs located in putative regulatory regions of structural genes (intergenic regions), nor did it include nonsynonymous nucleotide changes in structural genes. These classes of polymorphisms were not included in our estimates because of difficulties in ruling out the possibility that they arose as a consequence of selective pressure due to antimicrobial agent treatment or perhaps extensive in vitro passage. Synonymous nucleotide changes (neutral mutations) are commonly used to estimate many values of interest to evolutionary biologists and population geneticists.

The estimate provided by Fraser et al. is based on a genomewide frequency of SNPs (1/3,000 nucleotide sites), 50% of which presumably are located in intergenic regions and 50% in structural genes. On the basis of a genome size of roughly 4.4 Mb, there would be roughly 1,500 total SNPs, with approximately 750 in orfs (90% of genome = 3,960,000 bp) and 750 in intergenic regions (10% of genome = 440,000 bp). On the basis of these estimates, the frequency of all SNPs located in structural genes would be roughly 1/5,280 bp. (An estimate of 1,300 total SNPs [translating to 1/6,000 bp] was presented by the group at a meeting held at the Banbury Center last December.) As expected, these numbers differ from our estimate (1/10,000), in part because they contain both synonymous and nonsynonymous nucleotide polymorphisms.

We analyzed orfs (available in public databases) dispersed around the chromosome of M. tuberculosis strains CDC1551 and H37Rv. Surprisingly, the number of nonsynonymous SNPs exceeded the number of synonymous SNPs. We found only approximately 323 synonymous SNPs, yielding a synonymous SNP frequency of roughly 1/12,260 bp in orfs.

M. tuberculosis, a pathogen that infects one third of humans, clearly has an unusual if not unique molecular evolution history. Precise data on the frequency of its true SNPs genomewide are critical. At this point, data (3-6) are consistent with our original estimate of 1 synonymous nucleotide change per 10,000 synonymous sites in structural genes in natural populations of this pathogen.

James M. Musser

Author affiliation: National Institutes of Health, Hamilton, Montana


  1. Fraser CM, Eisen J, Fleischmann RD, Ketchum KA, Peterson S. Comparative genomics and understanding of microbial biology. Emerg Infect Dis. 2000;6:50512. DOIPubMed
  2. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393:53744. DOIPubMed
  3. Kapur V, Whittam TS, Musser JM. Is Mycobacterium tuberculosis 15,000 years old? J Infect Dis. 1994;170:13489. DOIPubMed
  4. Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, Whittam TS, Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A. 1997;94:986974. DOIPubMed
  5. Ramaswamy SV, Amin AG, Goksel S, Stager CE, Dou S-J, El Sahly H, Molecular genetic analysis of nucleotide polymorphisms associated with ethambutol resistance in human isolates of Mycobacterium tuberculosis. Antimicrob Agents Chemother. 2000;44:32636. DOIPubMed
  6. Musser JM, Amin A, Ramaswamy S. Negligible genetic diversity of Mycobacterium tuberculosis host immune system protein targets: evidence of limited selective pressure. Genetics. 2000;155:716.PubMed

Suggested Citation for this article: Musser JM. Single Nucleotide Polymorphisms in Mycobacterium tuberculosis Structural Genes [letter].Emerg Infect Dis [serial on the Internet]. 2001, Jun [date cited].

DOI: 10.3201/eid0703.017334

Related Links

Read original article,

To the Editor: In his letter on single nucleotide polymorphisms in Mycobacterium tuberculosis, Dr. Musser indicates that genome strain CDC1551 has not been published. Cole et al. (1) described some of the biology of M. tuberculosis based on the genome sequence data. The actual sequence, while not published, is in Genbank (Accession NC00962), the sequence data are available at, and the annotation is available at . We have a manuscript in preparation using a method of whole genome comparison (2) to evaluate the sequence diversity of strains H37Rv and CDC1551 and applying the information to the analysis of >150 clinical isolates. The complete sequence data and annotation for strain CDC1551 have been available for over a year at and, and periodic updates are provided. In addition, we are preparing to submit the strain CDC1551 sequence and annotation to Genbank (Accession AE000516).

We agree that sequencing accuracy in assessing comparative single nucleotide polymorphism (SNP) data is important. The error frequency suggested by Dr. Weinstock ("Error frequency in a finished sequence has never been precisely measured but is thought to be one error [frameshift or base substitution] in 103 to 105 bases" [3]) is not supported by any evidence. The whole-genome shotgun sequencing method developed by The Institute for Genomic Research (TIGR) (4) and adopted by many others is highly accurate because of the following qualities: 1) high redundancy in shotgun sequencing (average 7.9-fold for the strain CDC1551 project with a minimum of 2-fold coverage for any nucleotide); 2) assignment of quality values to each nucleotide base; 3) adoption of assembly programs that use quality values for consensus building; and 4) manual editing of electropherograms as necessary.

These methods were applied to the M. tuberculosis genome sequencing project. In comparing the CDC1551 and H37Rv strains, it is reasonable to suspect that the SNPs also have the potential to be results of sequencing errors. The sequence differences were verified by two independent methods. One hundred SNPs were chosen at random, and the base calls were independently verified by inspection of the original electropherograms at TIGR (CDC1551) and the Sanger Center (H37Rv). A second method, independent of sequencing, was also used to confirm the base calls of these 100 SNPs. The visual inspection of the electropherograms and the sequencing independent method were in good agreement and indicated that 80 (91%) of 88 successful assays of the nucleotide differences were genuine.

Since our initial report, we have improved our methods for overlaying the annotation of open reading frame coordinates onto our analysis of the coordinates of nucleotide substitutions. Approximately 7% of the genome is noncoding, and approximately 15% of the substitutions are in these regions.

Dr. Musser is correct in pointing out that the substitution frequency expressed in Fraser et al. (5), based on our preliminary annotation of our M. tuberculosis sequence data, is not an equivalent comparison to the synonymous substitution frequency derived by his method of sequencing a select set of genes over a wide range of M. tuberculosis strains. He uses the methods of Li et al. (6), among the most widely accepted, for the calculation of nucleotide substitution frequencies and derives a Ds value of <0.01 synonymous substitutions per 100 synonymous sites. Our preliminary data presented the frequency of total nucleotide substitutions at all positions (coding [synonymous and nonsynonymous] and noncoding) of the two recently sequenced strains, H37Rv and CDC1551. Our manuscript in preparation comparing the two M. tuberculosis strains will contain an analysis of synonymous substitutions. However, while Dr. Musser compared a select group of genes over perhaps several hundred strains, our frequency will be based on a genome-wide comparison between two strains.

Robert Fleischmann
Author affiliation: The Institute for Genomic Research, Rockville, Maryland, USA


  1. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393:53744. DOIPubMed
  2. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL. Alignment of whole genomes. Nucleic Acids Res. 1999;27:236976. DOIPubMed
  3. Weinstock GM. Genomics and bacterial pathogenesis. Emerg Infect Dis. 2000;6:496504. DOIPubMed
  4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496512. DOIPubMed
  5. Fraser CM, Eisen J, Fleischmann RD, Ketchum KA, Peterson S. Comparative genomics and understanding of microbial biology. Emerg Infect Dis. 2000;6:50512. DOIPubMed
  6. Li WH, Wu CI, Luo CC. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 2000;2:150512.PubMed

Table of Contents – Volume 7, Number 3—June 2001