Volume 8, Number 11—November 2002
Tuberculosis Genotyping Network, United States
National Tuberculosis Genotyping and Surveillance Network: Analysis of the Genotype Database
As part of the National Tuberculosis and Genotyping Surveillance Network, isolates obtained from all new cases of tuberculosis occurring in seven geographically separate surveillance sites from 1996 through 2000 were genotyped. A total of 10,883 isolates were fingerprinted by the IS6110-restriction fragment length polymorphism method, yielding 6,128 distinct patterns. Low-copy isolates (those with six or fewer bands) were also spoligotyped. The distribution of specific genotype clusters was examined. Databases were also examined for families of related genotypes. Analysis of IS6110 patterns showed 497 patterns related to the W-Beijing family; these pattens represent 946 (9%) of all isolates in the study. Six new sets of related fingerprint patterns were also proposed for isolates containing 6–15 copies of IS6110. These fingerprint sets contain up to 251 patterns and 414 isolates; together, they contain 21% of isolates in this copy number range. These sets of fingerprints may represent endemic strains distributed across the United States.
The National Tuberculosis Genotyping and Surveillance Network was created by the Centers for Disease Control and Prevention (CDC) to determine the relative frequency of Mycobacterium tuberculosis strains in specific geographic areas, the extent of spread of related strains in communities, and the impact of IS6110 fingerprinting on tuberculosis (TB) control. From 1996 through 2000, the TB genotyping network laboratories fingerprinted 10,883 isolates (one isolate per newly diagnosed case of TB) from seven sentinel surveillance sites in the United States: the states of Arkansas, Maryland, Massachusetts, Michigan, and New Jersey, along with four counties in Texas and six counties in California. Key components of the project included the establishment of standard methods and use of specialized software, the BioImage Whole Band Analyzer version 3.4 (BioImage, Ann Arbor, MI), for pattern analysis. The following were created as part of this study: databases containing the images of IS6110 patterns for all isolates at each sentinel surveillance site; a network database that includes all distinct spoligotype patterns; and an epidemiologic database (Epi-Info) for information about sentinel surveillance site, case report date, IS6110 pattern designation, and secondary typing results for each patient. The final network database of fingerprints contains 6,128 patterns.
We report here an overview of the contents of the TB genotyping network fingerprint database, including the distribution of isolates at sentinel surveillance sites, genotype patterns that occurred with high frequency, the extent of previously described genotype families, and new families of related fingerprints. This analysis should not be considered exhaustive but rather a summary of our observations and an introduction to the types of data that can be derived.
Methods for IS6110 fingerprinting, spoligotyping, and compiling the network databases are described elsewhere (1). The distribution of isolates by sentinel surveillance site, fingerprint pattern, and spoligotype (for isolates with six or fewer copies of IS6110) was derived from the Epi Info database. The spoligotype patterns are reported in octal code by the convention previously described (2). IS6110 fingerprint patterns start with the prefix FP and spoligotype patterns with SP.
We used the BioImage Whole Band Analyzer software package version 3.4 to analyze fingerprint patterns in the genotyping network fingerprint database. The bands in two patterns were compared at two levels. First, bands in two patterns were identified as matched bands if the size of the bands differed by <2.5%. Next, the interband spacing between matched bands in the two patterns was compared; a limit of 95% for variation in interband spacing was used. The Jaccard coefficient of similarity between two patterns, A and B, was used to calculate the percentage match between two patterns: 100 x number of matched bands /(number bands in A + number bands in B - number of matched bands).
Isolates from 10,883 patients from seven sentinel surveillance sites were fingerprinted by using the IS6110 restriction fragment length polymorphism (RFLP) method: Arkansas, 709; California, 2,514; Massachusetts, 986; Maryland, 1,180; Michigan, 1,471; New Jersey, 2,113; and Texas, 1,910. From these isolates, 6,128 distinct fingerprint patterns were identified and included in the final genotyping network database. Analysis of the IS6110 copy number of the isolates confirmed the previously described bimodal distribution (Figure 1) (3). This distribution has been used to separate isolates of M. tuberculosis into two groups: isolates with six or fewer copies of IS6110 are classified as low-copy isolates and those with more than six copies as high-copy isolates. The greatest numbers of patterns occurred in the 9–14 copy-number range (Figure 1). Clustering of isolates on the basis of matching fingerprint patterns is summarized in Table 1. Clustering was very high among the low-copy isolates, which supports the requirement for secondary typing of these isolates. Clustering decreased with increasing copy number; copy numbers 21 and 22, which included large outbreaks, were the exceptions.
The 8,245 isolates with more than six copies of IS6110 yielded 5,640 fingerprint patterns. Of these patterns, 4,846 (85.9%) were identified for a single isolate, and 3,399 isolates were grouped into 794 fingerprint-defined clusters. Of these clusters, 557 contained isolates from a single site. The clusters contained up to 105 isolates, but 683 (86.0%) of the clusters contained only two to five isolates. In fact, only 18 clusters contained 20 or more isolates. The distribution of isolates in these 18 clusters is shown in Table 2, and the fingerprint patterns are shown in Figure 2. For 11 of the 18 largest clusters, >90% of the isolates in the cluster were from a single site.
One of the largest clusters (FP 00237, 100 isolates) corresponds to M. tuberculosis strain 210, a member of the W-Beijing family that was shown in previous studies to be disseminated across the United States (3). In the network, FP 00237 was associated with large clusters in Arkansas and Texas and was also reported by Maryland and New Jersey. Two additional patterns associated with large clusters, FP 00027 (102 isolates in Michigan) and FP 01284 (46 isolates in Texas), were similar to FP 00237. The Beijing family of strains has received considerable attention because of its association with several large outbreaks, frequent association with multidrug resistance, and emergence in selected populations, particularly in the former Soviet Union (4,5). All Beijing isolates share a characteristic spoligotype (000000000003771); however, in this study, spoligotyping was not performed for high-copy isolates. Other molecular criteria that define W-Beijing strains include insertions of IS6110 in the dnaA-dnaN region (A1 insertion) and in the NTF region and an empirical fingerprint pattern that contains 15 to 24 bands and is similar to that of strain W (4). To estimate the occurrence of Beijing isolates in our study, all patterns with 16 to 24 bands were visually compared to FP 00237. The W fingerprint was easily identified among the patterns with 17 or more bands; however, we were less confident about identifying it in those with 16 bands and did not include them in this analysis. Of the 688 patterns analyzed, 497 (72.2%) were similar to FP 00237. Examples can be seen in Figure 3. Nearly all of the individual patterns, 480 (97%), were reported by a single site. These 497 patterns represent 946 isolates, 82% of all isolates with >17 copies of IS6110 and 9% of all isolates in the study (Arkansas, 3%; Maryland, 4%; New Jersey, 7%; Massachusetts, 9%; and California, Michigan, and Texas, 11%). The distribution of these isolates by site is reported in Table 3.
Because only one of the molecular criteria (overall fingerprint pattern) could be applied, isolates with these patterns cannot be definitively called W-Beijing. All of the insertion sites in strain 210, FP 00237, have been defined by sequencing (6). To identify conserved insertion sites, we determined the percentage of the 497 patterns that contained each of the bands in FP 00237 (Figure 4). Nine bands were found in >50% of the patterns, and two were present in >85%. A common feature of these fingerprint patterns is a group of smaller bands (1.0 to 1.5 kb) that are difficult to resolve. Variation in band identification resulted in some of the heterogeneity of the patterns in the database. W-Beijing strains likely account for a large portion of Beijing isolates, but other Beijing strains exist. FP 00242 (reported for 96 isolates in Texas; fingerprint pattern shown in Figure 2) shares only a few bands with FP 00237, but isolates with this pattern have the Beijing spoligotype (Teresa Quitugua, pers. comm.).
To identify other large families in the database, we analyzed all patterns having 6 to 15 bands (4,846 patterns). Since the BioImage software cannot create a dendrogram for more than 1,250 patterns, patterns were compared to each other by using an arbitrarily chosen matching threshold of 50% to identify those that matched a large number of other patterns. For a 50% match, two thirds of the bands in two patterns with equal band number must match. Six sets of related fingerprints, designated A through F, were defined; each consisted of six prototype patterns (Figure 5) along with all of the patterns that matched the prototypes. Data on these sets are summarized in Table 3. The isolates in each set appear widely dispersed across the sites, and the patterns likely represent endemic strains in the United States. Key bands in each set were determined in comparison to the common bands in the prototype patterns as described above for FP 00237 (Figure 4). Sequencing the IS6110 insertion sites corresponding to these key bands would allow isolates belonging to these sets to be rapidly identified with microarray techniques or the reverse dot-blot (“insite”) assay we have described previously (7).
The sets described here are certainly not the only sets of related patterns in the database, nor are they necessarily novel. The patterns in set A are similar to the patterns for M. tuberculosis strains H37Rv and H37Ra (8); the eight common bands in the prototype patterns for set A are also found in the patterns for these two laboratory strains. The patterns in set D appear similar to those of the Haarlem family (9). Interestingly, 1,404 (29.0%) of the patterns with 6 to 15 bands did not match any other pattern at the 50% matching threshold, suggesting a substantial number of orphan strains in this study.
Of the 457 fingerprint patterns identified among the 2,507 low-copy isolates, 314 (68.7%) were reported for a single isolate, and 143 patterns grouped 2,193 isolates into clusters. Clustering was much higher in low-copy isolates (87.5%) than in high-copy isolates (41.2%). Most isolates were in a few large clusters; 14 clusters contained 1,601 (63.9%) low-copy isolates. The distribution of isolates in the largest clusters across the sentinel surveillance sites is shown in Table 4, and the fingerprint patterns are shown in Figure 6.
Spoligotype results were available for most low-copy isolates (2,507 of 2,638 isolates). Isolates collected in Arkansas before 1998 were not spoligotyped (97 of 210 isolates) nor were most isolates collected in Maryland before 1998 that had unique fingerprint patterns (23 of 323 isolates). Of the 495 spoligotypes identified among the low-copy isolates, 322 were reported for a single isolate, and 173 grouped 2,185 isolates into clusters. In this study, the clustering of low-copy isolates by spoligotyping (87.2%) was only slightly lower than clustering by fingerprinting (87.5%). Analysis of the isolates by IS6110 copy number showed that spoligotyping performed better than fingerprinting only for those isolates with fewer than four copies of IS6110 (data not shown). Similar to the results obtained with fingerprinting, most isolates are in large clusters; 1,481 isolates are in the 20 largest clusters. The spoligotypes for these clusters as well as the distribution of these isolates by site and IS6110 copy number are listed in Table 5.
Neither fingerprinting nor spoligotyping provided great discriminatory power among low-copy isolates, but the combination of the two methods gave slightly better results. The number of spoligotypes identified per fingerprint pattern ranged from 1–92 spoligotypes, and the number of fingerprint patterns identified per spoligotype ranged from 1–77 patterns. Combining the fingerprinting and spoligotyping data resulted in the identification of 987 distinct genotypes; 745 genotypes were unique, and 242 grouped 1,762 isolates into clusters. These genotype clusters contained up to 167 isolates. Performing the secondary typing method decreased the number of clustered isolates by nearly 20%, but clustering was still much higher among the low-copy isolates (70.2%) than among the high-copy isolates (41.2%).
In our recent study of low-copy isolates from Michigan, we noted numerous patterns with similarities to FP 00017 (10). In this study, 201 isolates from all seven sites had FP 00017. When the three lower bands (1.39, 2.32, and 3.03 kb) in FP 00017 were matched to all patterns having three to six bands, 54 patterns representing 411 isolates were identified. The distribution of these isolates by site can be seen in Table 3. Of note, M. tuberculosis strain CDC1551 has FP 00017 (11), but none of the study isolates with this fingerprint had the spoligotype corresponding to strain CDC1551 (700076757760771).
Spoligotypes have been divided into clades or families on the basis of commonly observed motifs (Figure 7) (12). First, spoligotypes can be subdivided on the basis of spacers 33–36. Only M. tuberculosis complex genotypic group 1 strains (M. bovis, M. africanum, and some M. tuberculosis strains) have spacers 33–36 (13,14). Four spoligotype motifs have been identified among group 1 isolates: bovis (15), africanum (16), Beijing (5), and East African-Indian (EA-I) (12,17) (Figure 7). The remaining spoligotypes that lack spacers 33–36 can be subdivided into two subgroups on the basis of spacers 29–32. Isolates with at least one of spacers 29–32 are likely to be isolates in M. tuberculosis genotypic groups 2 or 3. Isolates without spacers 29–32 have a deletion in the direct repeat locus that is too large to definitively assign to a genotypic group. Four specific motifs have been identified among the spoligotypes associated with non–genotypic group 1 isolates: Haarlem (9), Latin American and Mediterranean 1 and 2 (12,17), and X (12) (Figure 7). Of the 495 spoligotypes observed for low-copy isolates, 323 contained one of the eight defined motifs. This allowed 2,007 (80.1%) low-copy isolates to be assigned to a spoligotype family; the data for each family are summarized in Table 3. The majority (51.5%) of the low-copy isolates belonged to family X. The only published information regarding this motif indicated that it is highly prevalent in some English-speaking countries (12). In our study, 1,036 (70.3%) of isolates with two to four copies of IS6110 belonged to family X. The second largest spoligotype family was family EA-I. Isolates with this motif belong to group 1 (13) and have up to nine copies of IS6110 (17). Our isolates that belonged to this family had one to six copies of IS6110, but 378 (67.7%) possessed a single copy. In fact, 62.7% of isolates with a single copy of IS6110 belonged to the EA-I family. The remaining spoligotype families grouped only a few isolates, probably because isolates in these families are mostly high copy (9,17), and this occurrence should not suggest that these spoligotype families are uncommon in the United States. Thirty-two isolates were classified as M. bovis and 19 as M. africanum, solely by spoligotype motifs; no additional tests were conducted to confirm this classification.
After isolates were assigned to a spoligotype family, fingerprint clusters of isolates were examined for consistency with the spoligotype family assignment. We were surprised to identify several fingerprint patterns that have isolates with very different spoligotype patterns. For example, FP 00017 (Figure 6) and FP 00104 (a five-band pattern) share four bands in common with a size difference of <1% and also have two spoligopatterns in common. SP 3 (777776777760771) is a very common pattern among M. tuberculosis group 2 and 3 isolates (13), whereas SP 290 (330777777767671) has a motif associated with M. africanum isolates (group 1). The spoligotype patterns are clearly divergent, indicating either that the strains independently acquired three copies of IS6110 at the same insertion sites or that they have different IS6110 insertions that coincidentally yield PvuII fragments of the same length.
Most of the other examples of isolates clustered by IS6110 with divergent spoligotypes are among isolates with one or two copies of IS6110. Mathema et al. (18) investigated differences among 66 isolates with FP 00129 (one band of 1.40 kb); 26 had group 1 spoligotypes, and 40 had group 2 or 3 spoligotypes. In most isolates with a single copy of IS6110, the IS6110 is inserted in the direct repeat locus in the repeat located between spacers 24 and 25. The predicted fragment size for this insertion in isolates with group 1 spoligotypes is 1.30–1.45 kb, depending on the number of spacers between spacers 25 and 36, where the PvuII site is located. The predicted fragment size for this insertion in isolates with group 2 or 3 spoligotypes is 4.51–4.58 kb, depending on the number of spacers between spacers 25 and 43 (the next PvuII site occurs outside of the direct repeat locus). Since the predicted fragment size for these 40 isolates was not consistent with the observed size, the insertion site in these isolates was sequenced. Sequencing showed that the isolates had a different insertion site (DK1) (19), which is very common among isolates with two copies of IS6110. The predicted fragment size for this insertion is 1.38 kb. This size is the one predicted for group 1 isolates and is a clear example of two isolates with different IS6110 insertions yielding PvuII fragments that are indistinguishable by the standard RFLP method.
The TB genotyping network database demonstrates the diversity of strains that cause TB in the United States. The 10,883 patients in the study represented approximately 11.6% of all new cases of TB in the United States from 1996 through 2000. The sentinel sites were reasonably representative of the geographic and demographic diversity in the United States. Compiling this database from results submitted from seven laboratories was a considerable undertaking, and analyzing such a large collection of fingerprint patterns is difficult. From our quality assurance program and personal experience, we know that, even under the most carefully controlled conditions, IS6110 fingerprinting results are not 100% reproducible. We are certain that some of the fingerprint patterns, which were classified as different and received different designations, would have been identical had they been run side by side on the same gel. Also, as we have described, some fingerprint patterns for low-copy isolates appear identical but do not represent the same IS6110 insertions and thus do not represent closely related strains. Some of these difficulties resulted from the application of a rigid standard for defining distinct patterns, a process that is often subjective.
Even though some individual results may have been questionable, several clear conclusions emerged. Large sets of strains with related fingerprint patterns, not previously recognized, are spread across the United States. Given the rather slow rate of change in fingerprints, these must represent endemic strains that have circulated in the United States for decades. Consistent with this conclusion is the presence in the database of fingerprint patterns resembling the pattern of the laboratory strain H37 that was originally isolated in New York in 1905 (8). Many of the patterns in these sets represent single isolates, which suggests that they are the result of reactivation of remote infections acquired years or decades earlier. Analysis of the demographic characteristics of the patients will be required to confirm this observation. Among these large sets, outbreak strains (patterns) were generally restricted to a single sentinel site, as were clustered isolates in general.
We conclude that a large-scale, prospective comparison of fingerprint patterns from wide geographic regions is useful for research studies but is of limited value for TB control purposes. Comparisons of isolates from smaller areas are not only more meaningful but also more feasible. This limitation does not mean that searching multiple databases for specific fingerprint patterns, for example the “W” strain, is not useful in some circumstances.
The difficulties in analyzing IS6110 fingerprint patterns and the often slow turnaround time for obtaining results limit the value of this procedure to TB control programs. As an alternative, rapid, polymerase chain reaction–based testing, such as spoligotyping or mycobacterial interspersed repetitive units variable number of tandem repeats (MIRU-VNTR) analysis, would be a logical first step for universal genotyping of isolates. These methods provide greater reproducibility and give digital results, which simplify analysis. However, this approach has the following limitations. Many common spoligotypes were seen among low-copy-number isolates, although even IS6110 fingerprinting does not greatly improve resolution with these isolates. We also found that 9% of the isolates have W-Beijing fingerprint patterns that are known to have the same spoligotype; all isolates yielding the Beijing spoligotype would require IS6110 typing. Sufficient data are not available to predict the discriminatory power of MIRU-VTNR. However, preliminary results suggest that the combination of spoligotyping and VNTR typing will provide adequate resolution for most uses, thus limiting the need for additional typing by IS6110 fingerprinting.
Dr. Cowan is a service fellow in the Tuberculosis/Mycobacteriology Branch, Division of AIDS, STD, and TB Laboratory Research, National Center for Infectious Diseases, Centers for Disease Control and Prevention. Her current research interests include DNA fingerprinting and the molecular epidemiology of Mycobacterium tuberculosis.
Dr. Crawford is Chief of the Immunology and Molecular Pathogenesis Section, Tuberculosis/Mycobacteriology Branch, Division of AIDS, STC, and TB Laboratory Research, Centers for Disease Control and Prevention. His research interests include application of molecular methods to epidemiology and diagnostics of mycobacterial diseases.
We thank Steve Kammerer, and Charles L. Woodley for assistance in database analysis, and Barbara A. Schable, Christopher R. Braden, and the participants in the National Tuberculosis Genotyping and Surveillance project for their extensive efforts in compiling the databases on which this article is based.
- Crawford JT, Braden CR, Schable BA, Onorato IM. National Tuberculosis Genotyping and Surveillance Network: design and methods. Emerg Infect Dis. 2002;8:1192–6.
- Dale JW, Brittain D, Cataldi AA, Cousins D, Crawford JT, Driscoll J, Spacer oligonucleotide typing of bacteria of the Mycobacterium tuberculosis complex: recommendations for standardized nomenclature. Int J Tuberc Lung Dis. 2001;5:216–9.
- Yang Z, Barnes PF, Chaves F, Eisenach KD, Weis SE, Bates JH, Diversity of DNA fingerprints of Mycobacterium tuberculosis isolates in the United States. J Clin Microbiol. 1998;36:1003–7.
- Bifani PJ, Mathema B, Kurepina NE, Kreiswirth BN. Global dissemination of the Mycobacterium tuberculosis W-Beijing family strains. Trends Microbiol. 2002;10:45–52.
- van Soolingen D, Qian L, de Haas PE, Douglas JT, Traore H, Portaels F, Predominance of a single genotype of Mycobacterium tuberculosis in countries of East Asia. J Clin Microbiol. 1995;33:3234–8.
- Beggs ML, Eisenach KD, Cave MD. Mapping of IS6110 insertion sites in two epidemic strains of Mycobacterium tuberculosis. J Clin Microbiol. 2000;38:2923–8.
- Steinlein LM, Crawford JT. Reverse dot blot assay (insertion site typing) for precise detection of sites of IS6110 insertion in the Mycobacterium tuberculosis genome. J Clin Microbiol. 2001;39:871–8.
- Bifani P, Moghazeh S, Shopsin B, Driscoll J, Ravikovitch A, Kreiswirth BN. Molecular characterization of Mycobacterium tuberculosis H37Rv/Ra variants: distinguishing the mycobacterial laboratory strain. J Clin Microbiol. 2000;38:3200–4.
- Kremer K, van Soolingen D, Frothingham R, Haas WH, Hermans PWM, Martin C, Comparison of methods based on different molecular epidemiological markers for typing of Mycobacterium tuberculosis complex strains: interlaboratory study of discriminatory power and reproducibility. J Clin Microbiol. 1999;37:2607–18.
- Cowan LS, Mosher L, Diem L, Massey JP, Crawford JT. Variable-number tandem repeat typing of Mycobacterium tuberculosis isolates with low-copy numbers of IS6110 by using mycobacterial interspersed repetitive units. J Clin Microbiol. 2002;40:1592–602.
- Plikaytis BB, Kurepina N, Woodley CL, Butler WR, Shinnick TM. Multiplex PCR assay to aid in the identification of the highly transmissible Mycobacterium tuberculosis strain CDC1551. Tuber Lung Dis. 1999;79:273–8.
- Sebban M, Mokrousov I, Rastogi N, Sola C. A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis. Bioinformatics. 2002;18:235–43.
- Soini H, Pan X, Amin A, Graviss EA, Siddiqui A, Musser JM. Characterization of Mycobacterium tuberculosis isolates from patients in Houston, Texas, by spoligotyping. J Clin Microbiol. 2000;38:669–76.
- Sreevatsan S, Pan X, Stockbauer K, Connell ND, Kreiswirth BN, Whittam TS, Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A. 1997;94:9869–74.
- Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol. 1997;35:907–14.
- Viana-Niero C, Gutierrez C, Sola C, Filliol I, Boulahbal F, Vincent V, Genetic diversity of Mycobacterium africanum clinical isolates based on IS6110-restriction fragment length polymorphism analysis, spoligotyping, and variable number of tandem DNA repeats. J Clin Microbiol. 2001;39:57–65.
- Sola C, Filliol I, Legrand E, Mokrousov I, Rastogi N. Mycobacterium tuberculosis phylogeny reconstruction based on combined numerical analysis with IS1081, IS6110, VNTR, and DR-based spoligotyping suggests the existence of two new phylogeographical clades. J Mol Evol. 2001;53:680–9.
- Mathema B, Bifani PJ, Driscoll J, Steinlein L, Kurepina N, Moghazeh SL, Identification and evolution of an IS6110 low-copy Mycobacterium tuberculosis cluster. J Infect Dis. 2002;185:641–9.
- Fomukong N, Beggs M, El Hajj H, Templeton G, Eisenach K, Cave MD. Differences in the prevalence of IS6110 insertion sites in Mycobacterium tuberculosis strains: low and high-copy numbers of IS6110. Tuber Lung Dis. 1998;78:109–16.