Molecular Epidemiology of Tuberculosis in a Sentinel Surveillance Population

We conducted a population-based study to assess demographic and risk-factor correlates for the most frequently occurring Mycobacterium tuberculosis genotypes from tuberculosis (TB) patients. The study included all incident, culture-positive TB patients from seven sentinel surveillance sites in the United States from 1996 to 2000. M. tuberculosis isolates were genotyped by IS6110-based restriction fragment length polymorphism and spoligotyping. Genotyping was available for 90% of 11,923 TB patients. Overall, 48% of cases had isolates that matched those from another patient, including 64% of U.S.-born and 35% of foreign-born patients. By logistic regression analysis, risk factors for clustering of genotypes were being male, U.S.-born, black, homeless, and infected with HIV; having pulmonary disease with cavitations on chest radiograph and a sputum smear with acid-fast bacilli; and excessive drug or alcohol use. Molecular characterization of TB isolates permitted risk correlates for clusters and specific genotypes to be described and provided information regarding cluster dynamics over time.

ince 1990, characterization of Mycobacterium tuberculosis isolates by molecular methods has been useful in confirming suspected laboratory contamination and as an adjunct to epidemiology-based contact investigation (1)(2)(3). Most studies used the restriction fragment length polymorphism (RFLP) technique, based on IS6110 and specific to the M. tuberculosis complex. This genetic element may be present in different positions on the chromosome, resulting in a unique genotype useful for characterizing the strain of M. tuberculosis infecting a patient. Although RFLP has disadvantages (e.g., cost, time required to culture the organism, and specialized training and laboratory equipment), IS6110-based RFLP is the established method considered most discriminatory for genetic characterization of M. tuberculosis strains worldwide (4).
In 1996, the Centers for Disease Control and Prevention (CDC) established seven sentinel surveillance sites in the United States (National Tuberculosis Genotyping and Surveillance Network) to assess the utility of molecular genotyping for improving tuberculosis (TB) prevention and control. The TB genotyping network used standardized protocols for molecular characterization of M. tuberculosis isolates from patients in all sentinel sites. The network was designed to address specific epidemiologic questions regarding the natural history, transmission, and potential applicability of molecular genotyping of M. tuberculosis strains to augment TB control activities (5). Two objectives were to identify and determine the prevalence of specific M. tuberculosis genotype clustering in populations of sentinel surveillance TB patients and to describe the demographic characteristics of these populations and the genotypic characteristics of M. tuberculosis strains in clustered and nonclustered TB cases. We describe demographic and risk factor correlates for the most frequently occurring M. tuberculosis genotypes in isolates collected from sentinel TB patients.

Methods
This population-based sentinel study included all incident culture-positive TB patients from sentinel sites from January 1996 to December 2000. In brief, the seven sentinel surveillance sites included the states of Arkansas, Maryland, Massachusetts, Michigan, and New Jersey; Dallas, Tarrant, Cameron, and Hidalgo Counties in Texas; and Alameda, Contra Costa, Marin, San Mateo, Santa Clara, and Solano Counties in California. A detailed description of the study's design, participants, population, and laboratory and epidemiologic methods is provided elsewhere (6).
*Centers for Disease Control and Prevention, Atlanta, Georgia, USA S Dr. Ellis is a senior microbiologist with the National Center for Infectious Diseases, Centers for Disease Control and Prevention. Her research interests focus on the molecular epidemiology of infectious diseases, rodent-borne zoonotic diseases, and bioterrorism preparedness. Her work has included disease ecology studies of rodent-borne hemorrhagic fever viruses, molecular characterization of novel bartonellae, and molecular epidemiologic studies of Mycobacterium tuberculosis.
All patients included in the study were reported to the CDC national TB case registry on the form Report of a Verified Case of Tuberculosis, a standardized electronic form submitted for TB surveillance to CDC by all state public health reporting areas. Data reported include patient demographics, laboratory test results, drug susceptibilities, information on chest radiographs, and treatment outcomes (7).
Investigators from the sentinel surveillance sites submitted patient isolates to the corresponding regional laboratory for genotyping and conducted routine contact investigations. In addition, participants from the surveillance sites performed detailed epidemiologic investigations on groups of persons with M. tuberculosis isolates that had matching genetic patterns or clusters (see below). The regional genotyping laboratories conducted IS6110 RFLP on isolates from sentinel patients. Since low-copy numbers of IS6110 (i.e., six or fewer copies) reduce test specificity, spacer oligonucleotide typing (spoligotyping) was conducted on such isolates. A cluster, which was identified by analysis of the entire TB genotyping network database, was defined as two or more isolates with either identical RFLP patterns (at least seven copies of IS6110) or identical RFLP and spoligotype patterns for isolates with RFLP patterns that had six or fewer copies of IS6110.
Differences in the proportion of TB patients from the TB genotyping network population living in cities with populations of <100,000, 100,001 to 250,000, 250,001 to 500,000, and >500,000 were compared with those of the national TB patients for the year 2000 only. Statistics were obtained from the U.S. Census Bureau (available at: URL: http://www.census.gov/ population/cen2000/phc-t6/tab04.pdf ).
Correlation of average TB incidence among cases at the seven sentinel sites and percentage of cases with isolates that clustered genetically were examined by year by using the Spearman rank correlation statistic. Clustering was determined by examining each year's cases independently. A Mantel-Haenszel chi-square or Fisher exact test was used, as appropriate, to ascertain whether the sentinel population was representative of TB patients in the United States in terms of demographic, clinical, behavioral, or outcome characteristics.
We used multiple logistic regression to assess the importance of demographic, clinical, behavioral, or outcome variables in predicting the occurrence of a given genotype for those genetic clusters that occurred most frequently (≥20 isolates). The dependent variable was the presence or absence of a given genotype. The best-fit logistic regression model was determined by the strategy of Hosmer and Lemeshow (8). In brief, a univariate analysis of the categorical independent variables was done by using the Mantel-Haenszel chi-square or Fisher exact test, as appropriate; any variable with a significance value of ≥0.20 was included in a best subset, multivariate logistic regression model. Collinearity of independent variables was assessed by using the variance/covariance matrix from PROC LOGISTIC (SAS Institute, Inc., Cary, NC) to generate condition indices and a matrix of variance decomposition proportions to detect dependencies among the vari-ables (9). Backward elimination of independent variables was performed if the probability of the independent variable was ≥0.20. Both the Wald statistic and 95% confidence interval were used on each coefficient to assess the significance of variables in each model; the log-likelihood ratio was used to assess the overall significance of the final models, and the Hosmer-Lemeshow statistic was used to evaluate the fit of each of the final models. Data were analyzed by SAS version 8.0 software (SAS Institute, Inc.) (10).

Sentinel Population Characteristics
The incidence of TB cases in the sentinel surveillance sites varied within and among sites over time (Table 1). From 1996 to 2000, the overall incidence of TB in the United States declined from 8.0 to 5.8 per 100,000 inhabitants, and similar downward trends were observed in each of the TB genotyping network sites. The California, New Jersey, Arkansas, and Texas sites had a higher incidence of TB than the overall national rates. The incidence rates in California and Texas (sites that included only six and four counties from each state) were similar to the overall incidence rates for each state (data not shown).
In the surveillance area, 15,035 patients with verified TB represented 16% of the TB patients in the United States during the 5-year study period (Table 2). Overall, 11,923 TB patients were culture-positive (721 from Arkansas, 2,842 from California, 1,192 from Maryland, 1,022 from Massachusetts, 1,481 from Michigan, 2,599 from New Jersey, and 2,066 from Texas). Of TB patients in the surveillance areas, 79.3% (11,923) were culture positive, and RFLP results were available for 91.2% (10,883). However, spoligotyping results were not available for 131 of the isolates that had six or fewer copies of IS6110 (5%; n=2,638); thus, these patients were excluded from our analysis. Of 1,171 isolates not genotyped by RFLP or spoligotyping, 12 (1%) were from Michigan, 35 (3%) from Maryland, 40 (3%) from Massachusetts, 110 (9%) Characteristics of the TB patient population from the genotyping network sentinel sites were comparable with those from the entire United States, with some exceptions ( Table 2). Sentinel surveillance populations had higher proportions of women (42% for the genotyping network vs. 37% for the United States overall) and patients in the 15-to 44-year age category, and were more often homeless or lived in correctional or long-term care facilities. Higher proportions of genotyping network patients used intravenous drugs, but fewer patients used noninjecting drugs or alcohol excessively.
Of the study population, about 4% reported previous episodes of TB (652 of 15,035; Table 2). Of persons with a previous recent history of TB, 28 had TB after completing >1 year of therapy within the study period; genotyping data on isolates from both episodes were available for 22 of these persons. A higher number of persons from the TB genotyping network study population lived within city limits (97% vs. 87%). However, when compared with national averages, genotyping network populations were generally from smaller towns and cities: 1,446 (69%) of 2,099 genotyping network patients were from cities and towns with <250,000 inhabitants, compared with 10,093 (62%) of 16,377 TB patients nationwide (Mantel-Haenszel chi square= 41.8; p<0.0001).
The proportion of foreign-born patients was higher in genotyping network populations compared with the overall national average (50% for genotyping network vs. 41% for the United States). Numbers of foreign-born TB patients increased over time at about the same rate for both genotyping network populations and national TB patients. From 1996 to 2000, national proportions of foreign-born TB patients increased from 37% (7,725/21,045) to 47% (7,593/16,281); in the genotyping network populations, the proportions of foreign-born TB patients increased from 44% (1,153/2,642) to 58% (1,222/ 2,092). Characteristics of the genotyping network population between sites were similar, as were culture-positive genotyping network populations compared with the overall genotyping network case population.

Analysis of Genotyping Data
The distribution and diversity of RFLP and spoligotyping pattern results from the genotyping network have been discussed in detail (11). In contrast to that analysis, we used both RFLP and spoligotyping results to define genetic clusters. Overall, 6,609 distinct patterns were identified, including 1,029 that contained ≥2 isolates per cluster. When analyzed by site, 1,018 clusters were identified: 71 clusters were from Arkansas (611 cases genotyped, 2-16 cases per cluster), 233 from California (

Longitudinal Analysis
Most clusters occurred in only a single site (66%; 680/ 1,029). However, 260 (25%) were found in two sites, 55 (5%) in three sites, 19 (2%) in four, 8 (1%) in five, and 7 (1%) in six sites. As expected, clusters that spanned multiple sites were larger. Clusters found at a single site averaged four persons per cluster (mean=3.65; standard error [SE] ± 0.22; n=680), in contrast to 61 persons per cluster for the genotypes found at six sites (mean=61.14; SE ± 23.6; n=7; Kruskal-Wallis test, p<0.0001). Most (62%) of the 34 clusters that occurred in at least four sites occurred in all 5 years of the study; 26% in 4 years; and 6% each in 3 and 2 years of the study.
Changes in proportions of patients with isolates that clustered were observed over time. In the first 2 years of the study, the percentage of the cumulative total number of cases that clustered increased from 28% to 45%; smaller increases occurred thereafter ( Figure 1). Overall, the proportion of clustered cases was 48% (5,171/10,752). The percentages of clustered cases by sites were 28% (276/982) for Massachusetts; 34% (393/1,157) for Maryland; 41% (873/2,112) for New Jersey; 42% (1,046/2,511) for California; 44% (266/611) for Arkansas; 49% (720/1,469) for Michigan; and 57% (1,093/ 1,910) for Texas. Maximum cluster size and absolute numbers of cases with isolates that clustered continued to increase through the end of the study.
Overall, cases with isolates that clustered showed a concomitant decline with average incidence of TB over the 5-year period ( Figure 2). A significant positive association was observed between the percentage of cases with clustered genotypes and TB incidence over time (Spearman rho=0.90; p=0.037).

Risk Factor Analyses of Genetic Clusters
Compared with persons whose isolates had unique genotypes, persons with isolates that clustered were more likely to be non-Hispanic, black men born in the United States. They were more likely to have pulmonary disease and abnormal chest radiographs with cavities; in addition, they more often had positive sputum smears; were HIV-positive, homeless, or residents of a correctional facility; and used drugs or alcohol excessively (Table 3). Patients with unclustered isolates were 5 years older on the average than those with isolates that clustered (44.8 years vs. 49.4 years, respectively; Table 3). Multiple logistic regression efforts resulted in models that were not robust (data not shown).

TUBERCULOSIS GENOTYPING NETWORK
Except for 4 genotypes, all 34 clusters with ≥ 20 isolates per cluster had significant demographic, clinical, and behavioral risk factors (Table 4). Race, ethnicity, and place of birth were frequently significant predictors for a given genotype. Other predictors included gender, age, site of disease, resistance to first-line drugs, and alcohol or drug abuse (Table 4). Twelve (40%) of 30 of these larger clusters were observed in four or more sites over a 5-year period. Lower percentages of foreign-born patients than U.S.-born patients clustered, regardless of the number of IS6110 copies (Figure 3). More than 50% (1,025/1,825) of the foreign-born patients whose isolates clustered had been in the United States for ≥5 years. Clustering of isolates from foreign-born patients ranged from 15% (49/316) in Michigan to 38% (309/816) in Texas.

Discussion
This population-based study is the largest that has been conducted in the United States to assess risk factors related to specific M. tuberculosis genotypes. Generally, clustered iso-lates have been considered recently acquired infections (12). However, this assumption may not always be correct. Clustering does not prove that transmission occurred, and its demonstration depends on adequate sampling of the population, incidence of TB, and characteristics of the study population (e.g., age structure, population mobility, duration of residence, and immune status) (1,13). Only 25%-42% of patients in genetic clusters were shown to have epidemiologic connections with another member of the cluster (14-16). Conventional epidemiologic investigation of these TB patients (including interviews) was conducted, but inclusion in this analysis was outside the scope of this article. Thus, results that indicate clustered genotypes are representative of recent transmission should be interpreted with caution.
Given this caveat, our results nevertheless demonstrate several consistent patterns. Differences in demographic and other risk factors for persons with isolates that clustered corroborated those from smaller studies conducted in the United States and larger surveys in Europe. Extensive surveys from Figure 1. Numbers of tuberculosis cases, cumulative proportion of cases with isolates in genetic clusters, and maximum genetic cluster size from seven sentinel surveillance sites by quarter that verified case was counted, 1996-2000. Numbers of cases with isolates that had unique genotypes and those with isolates that were in genetic clusters are shown separately.  (17) also demonstrated that persons with isolates that clustered genetically were younger than those with unique genotypes. Other risk factors for clustering included being male, born in the United States, non-Hispanic black, or homeless; using drugs and alcohol excessively; and having pulmonary disease and cavitations on chest radiograph, a sputum smear with acid-fast bacilli, and HIV infection. These risk factors have been observed for TB patients in different communities (12,18,19). The heterogeneity and diversity of the study population may account for our failure to produce a multivariate logistic model to predict clustering. A third of the foreign-born cases were recent immigrants to the United States, and overall, the percentage of clustered isolates from foreign-born persons was lower than the percent-age from nonimmigrants (Figure 3), indicating that at least a portion of these cases resulted from reactivation of latent disease or recent infection in the country of origin. In addition, for foreign-born persons, clustering of M. tuberculosis increased with the duration of residence in the United States. These results suggest that recently imported strains of M. tuberculosis from foreign-born persons may not commonly spread to U.S. residents or that transmission may be occurring after a lag time before the imported strains manifest as disease in contacts. Similar observations have been published in studies from San Francisco, New York, Switzerland, and Norway (20)(21)(22)(23)(24). These data may also reflect gaps in our knowledge of M. tuberculosis genotypes in circulation; a comparison of the U.S. TB genotyping network results with other databases worldwide may be warranted.
Logistic regression analysis of the most commonly occurring strains demonstrated that different risk factors were associated with specific genotypes. Several genotypes were associated with ethnic origin (e.g., Asian or Pacific Islander and Hispanic patients with six and three genotypes, respectively; Table 4). A recent study in Norway showed that several clusters consisted of patients of the same ethnic origin (23). An association has also been observed between the patient's ethnic origin and IS6110 copy number (25). These results, in conjunction with additional epidemiologic data, may be useful in tracking the geographic origin and spread of M. tuberculosis strains of public health importance (26).
A small proportion of clustered isolates were from persons from more than four sites spanning 5 years of study (Table 4). Figure 2. Average annual incidence of tuberculosis for seven sentinel surveillance sites and percentage of cases with isolates in genetic clusters, 1996 to 2000. Spearman correlation coefficient and probability of correlation between incidence and percentage of cases clustered are given.

TUBERCULOSIS GENOTYPING NETWORK
Although an in-depth analysis of epidemiologic links was not possible in this study, we found no evidence of recent transmission between patients with identical genotypes from the different states (data not shown); this lack of transmission was also noted in a smaller study in the United States (27). Since TB transmission is generally considered a local event, these ubiquitous genotypes may be widespread because of social factors (e.g., homelessness or alcohol or drug abuse; Table 4).
In addition, these genotypes may represent older, endemic domestic strains that have been in the United States for centuries and have dispersed more widely throughout the United States than the more recently imported strains. Further molecular characterization of these genotypes may show additional differences not detected by RFLP. Nonetheless, the effect of M. tuberculosis virulence or host factors on the distribution of these genotypes cannot be ascertained. The proportion of strains that were classified into clusters of identical genotypes (48%) was comparable with proportions in the Netherlands and Denmark (50%) (2,28), but the proportion was considerably higher than in two other countries (17% in Switzerland [29]; 20% in Norway [23]). The cumulative percentage of clustered strains reached a plateau by the end of the study's second year (Figure 1), a finding consistent with other molecular epidemiologic TB studies (2). Increases in maximum cluster size were anticipated because, as sample sizes increase with time, the number of isolates in each cluster would be expected to increase. In addition, higher proportions of clustered cases were observed for low-band number patterns (Figure 3), which had the maximum cluster size and may Only genetic clusters that had ≥20 isolates were included in the analysis; some samples sizes are <20 because of missing data among independent variables (Wald 95% confidence intervals given in parentheses). Only genetic clusters with significant predictors are listed. Age was modeled as a continuous variable. c The National Tuberculosis Genotyping Surveillance Network (NTGSN) designation for the IS6110 RFLP pattern is represented; spoligotype octal code designations are presented only for those genetic clusters from isolates that had ≤6 copies of IS6110. RFLP patterns and spoligotypes are detailed elsewhere (11). d isolates observed in ≥ 4 sites over 5 years. e First-line drug resistance is resistance to at least one of the following: isoniazid, rifampin, ethambutol, or streptomycin. Second-line drug resistance is resistance to one or more of the following: ethionamide, kanamycin, cycloserine, capreomycin, para-amino salicylic acid, amikacin, rifabutin, ciprofloxacin, ofloxacin, or other drugs. indicate that the low-copy IS6110 patterns are not specific, even with the addition of spoligotyping. The sensitivity and specificity of IS6110 RFLP in molecular epidemiologic studies have not been quantified and represent a potential limitation of this study. Although the stability of IS6110 is relatively high, the half-life of IS6110 RFLP is estimated to be 3-10 years (29-31) based on typing of serial isolates from individual patients. A study of isolates from patients in confirmed chains of transmission showed little change in IS6110 patterns (32). Calculation of these rates may be influenced by the duration between time of disease onset and time of sampling and may be proportional to the effectiveness of the TB control program (30). Because genotyping results were not available for 10% of TB cases in this study, estimates of the degree of clustering and the size of clusters are conservative. Some unique isolates might have clustered if some of the missing isolates had been available or if other cases with the same strain were present outside the study area (33).
Sentinel surveillance sites defined by artificial boundaries (i.e., state lines) not entirely representative of TB patients from the United States were included in this study. More than 90% of the isolates from patients from the surveillance areas were genotyped, and these isolates were representative of those culture-positive patients from the sentinel surveillance areas. However, 16% of all TB case-patients reported in the United States were included in these sentinel surveillance sites during the 5-year study period. In addition, the sentinel surveillance population had higher proportions of foreign-born persons than the national average. Because of the propensity of foreign-born persons to have isolates with unique genotypes, the actual rate of clustering may have been underestimated. Nonetheless, sentinel surveillance of TB cases has provided a useful method for documenting genotypes in circulation in the United States and for identifying risk factor correlates of common genotypes.
Annual declines in TB incidence were paralleled by similar declines in the proportion of cases with genotypes in clusters ( Figure 2), a finding consistent with the hypothesis that decreased clustering is expected with declining incidence (20). Since effort was similar each year, this association is not likely to be an artifact related to sample size (i.e., as sample size or number of cases becomes smaller, the probability of detecting clusters decreases). These findings underscore the importance of long-term longitudinal molecular studies and the potential usefulness of these methods in evaluating program effectiveness and improving program management.