Sampling Bias in the Molecular Epidemiology of Tuberculosis

Among the goals of the molecular epidemiology of infectious disease are to quantify the extent of ongoing transmission of infectious agents and to identify host- and strain-specific risk factors for disease spread. I demonstrate the potential bias in estimates of recent transmission and the impact of risk factors for clustering by using computer simulations to reconstruct populations of tuberculosis patients and sample from them. The bias consistently results in underestimating recent transmission and the impact of risk factors for recent transmission.

olecular epidemiology makes use of the genetic diversity within strains of infectious organisms to track the transmission of these organisms in human populations. It is used extensively to differentiate reactivation tuberculosis (TB), which is due to a remote infection, from disease caused by recently transmitted organisms. This approach is based on the concept that epidemiologically related organisms share similar or identical genetic fingerprints, while unrelated organisms differ at some genetic loci. Isolates of Mycobacterium tuberculosis that occur in clusters sharing similar fingerprints are thought to be caused by recently transmitted infection; those with unique fingerprints are thought to result from distantly acquired infection. Since the extent of recent transmission of an infectious disease often directly reflects the success of control measures (1,2), accurately assessing this quantity is of considerable public health importance.
In addition to distinguishing primary TB from reactivation disease, these molecular techniques have been used to identify risk factors for recent transmission in population-based epidemiologic studies (3). The goals of these investigations have been both to quantify the extent of ongoing transmission of M. tuberculosis and to identify host-and strain-specific risk factors for disease spread. Typically, these researchers have studied a specific population at risk for the disease by enrolling a cohort of persons with incident clinical TB, assessing these patients' individual risk factors, and fingerprinting the TB isolates obtained from them (4)(5)(6)(7)(8)(9)(10)(11). TB cases are then categorized as either clustered or unique; a cluster is usually defined as two or more patients whose isolates share an identical or near-identical DNA fingerprint, while unique cases are those with unmatched patterns (12). Clustered cases are assumed to share fingerprints as a result of recent spread of the organism among those in the cluster, while cases with unique patterns are assumed to be TB resulting from reactivated latent infection. These studies usually report the proportion of cases that are clustered within the cohort and use this result to infer the relative proportions of clustered and unclustered cases in the community from which the cohort was drawn.
Two different methods have been used to estimate the proportion of clustered cases. The first method, usually referred to as the "n" method, uses the number of all cases that fall into clusters as the estimator of clustered cases. The "n minus one" method assumes that one case per cluster is a case of reactivation TB and thus removes one case per cluster from the counts of "clustered" cases. The "n minus one" approach gives a number of clustered cases that is always less than that calculated by the "n" approach. Covariates associated with clustered fingerprints are taken to be host-specific risk factors for recent transmission of M. tuberculosis. The identification of these risk factors may provide specific targets for interventions designed to interrupt disease transmission.
These population-based molecular studies are often based on random or convenience samples drawn from available clinical isolates of M. tuberculosis. Implicit in the "populationbased" approach to molecular epidemiology is the assumption that the results of studies based on these samples are reliable estimates of the parameters of interest in the population from which the sample was drawn. The criteria by which an estimate is judged to be reliable require that it be precise and unbiased, or, in other words, free from both major random and systematic error (13). Small samples usually render parameter estimates imprecise, or more vulnerable to the effects of chance, but do not specifically cause them to be systematically biased. When the parameter in question is a measure of clustering, however, the correct classification of each clustered case depends on other cases that share identical fingerprints being included in the sample. If these cases are not included because the sample is too small, clustered cases will be misclassified as unique and the resulting proportion of clustered cases will be underestimated. This, in turn, results in underestimation of the extent of recent transmission and overestimation of the extent of reactivation TB, as well as biased estimation of the effects of risk factors for transmission.
The magnitude of the bias incurred by sampling strategies depends both on the sampling fraction and the frequency distribution of sizes of clusters in the population. A recent simulation study of the influence of sampling on estimates of recent TB transmission demonstrated that an increase in sampling fraction yields an increase in the proportion of isolates identified as clustered (14). These simulations further showed that underestimation of clustering is more marked in populations of isolates that include small clusters than those within which large clusters predominate. For this study, I extended this approach by using analytic methods in addition to simulations to estimate the magnitude of the bias introduced by commonly used sampling strategies in assessing the relative proportion of clustered and unclustered cases and in estimating the relative effect of potential risk factors for recent transmission.

Methods
The purpose of this study is to investigate biases inherent in estimating measures of clustering and risk factors for clustering when common sampling strategies are used to collect the empirical data. Since the true distributions of cluster sizes cannot be directly observed if sampling is not complete, I used a Monte Carlo simulation model to generate a variety of hypothetical cluster distributions based on simple assumptions about TB transmission. These distributions represent a wide range of potential data structures reflecting heterogeneous transmission parameters, contact networks, and sociodemographic variables. Accordingly, my aim here is not to model TB transmission dynamics with precision but to generate a collection of heterogeneous cluster distributions that could be used to demonstrate the effects of sampling, given a variety of potential transmission settings.
Generally, the microsimulation model enumerates a population of discrete individuals, each of whom is characterized by a vector of variables that affect risk for TB infection, for clinical disease, and for transmitting infection once infected. Persons are assigned to a series of social and physical spaces such as households, neighborhoods, and multineighborhood communities. The model also specifies the stochastic processes by which latent disease reactivates, infection progresses to primary TB, immunity is conferred by vaccination or by previous infection, and duration of disease is determined. Persons to whom disease is transmitted during the simulation acquire a variable reflecting the strain number of the source of their infection; thus, chains of disease transmission can be identified as "clusters" of cases sharing a specific strain number. The model is run over a time period during which these stochastic processes may occur. Output of the model includes standard measures of the incidence of infection and disease, the prevalence of infectious TB over time, and a count of cluster sizes. Five different cluster distributions were generated on the basis of running the model for 4 years with input variables specific to the different geographic and social settings in which TB is transmitted. The assumptions and baseline input variables for the model have been described (15).

Estimation of Bias in Proportions of Unique and Clustered Cases
The proportion of unique cases calculated after sampling and the variance of that proportion were estimated as follows. Using the "n" method to estimate the proportion of clustered cases, we assume that the true set of isolates is composed of n k clusters of size k for k = 1,2,…,k max . Further, we assume that each subject in the true set of isolates is sampled independently with a common sampling probability p.
Let I ijk be the indicator of whether the i th subject i =1,...,k from the cluster, j=1,…, n k, of size k has been sampled. Under our assumptions, the I ijk are i.i.d. Bernoulli (p) random variables. The total number of subjects sampled is N= .
Therefore, the expected value of the number of isolates is . .

The variance of N is var(N) =
Now let U jk = 1 if the number of isolates sampled from the j th cluster of size k is precisely 1 and U jk = 0 if otherwise. Then the total number of unique isolates is U = .
Now U jk is a Bernoulli random variable with success probability kp(1-p) k-1 equal to the probability of choosing exactly one member from the j th cluster of size k. Hence,

var(U) =
The expectation and large sample variance of the random variable (U/N) are derived in Appendix 1. Using these formulae for E(U/N) and var(U/N), estimates of the biased results for a range of sampling fractions were calculated for each of the five transmission scenarios described above. The results of this analysis were verified against a computer simulation that counted cluster sizes after random draws without replacing a proportion p of the true populations of isolates. For each true data set, the simulated data collection process was repeated 1,000 times. The mean value of the estimates obtained is reported, in addition to simulated confidence intervals expressed as values that represented the 0.05th and 0.95th largest estimates. The variance of the empirical distribution for the set of 1,000 simulations was nearly identical to that obtained by the large sample variance formula for a ratio expressed above.
These simulations were repeated using the "n minus one" approach, in which one case per cluster is removed from the count of clustered cases and added to the count of

Estimation of Bias in the Relative Risks and Odds Ratios of Risk Factors for Recent Transmission
The magnitude of bias in the odds ratios of potential risk factors introduced by the misclassification of clustering due to sampling error was also assessed. Risk factors for clustering were postulated to which were assigned "true" odds ratios of 2, 5, and 10. The prevalence of these risk factors in the absence of clustering was set at 0.1. This exposure was thus randomly assigned to 10% of the unclustered cases and proportions of the clustered cases to obtain the specified odds ratios in each of the modeled data sets. The odds ratios were recalculated after sampling by moving the clustered cases that were sampled as unique from the category of recently transmitted cases to the category of reactivated cases and reassessing the respective exposure status for these outcomes.

Bias in Estimates of Proportion of Unique and Clustered Cases
Output from the transmission model (Table 1) includes estimates of the incidence of TB infection and clinical TB disease, as well as a summary of the frequency distribution of cluster size for each scenario. The proportion of unique isolates that would be observed after sampling a given fraction of the isolates in the complete data sets for each of the five scenarios was estimated analytically and verified by computer simulation. These methods produced nearly identical results, demonstrating that there is often substantial bias in the estimated proportions of unique and clustered cases of TB when sampling is based on sampling fractions consistent with those used in common epidemiologic practice. Table 2 summarizes the error in the measurement of the proportion of unique and clustered cases of TB introduced by sampling various fractions of the data from the range of transmission scenarios. These results are given for both the "n" and "n minus one" methods of counting clustered cases; they show that the error in these estimates depends on both the "true" transmission pattern and the fraction of the total data sampled. Transmission scenarios in which there is a higher "true" proportion of unique cases and those in which transmission is concentrated in large clusters tend to demonstrate less error than those in which there are fewer unique cases and more small clusters. In all cases in which the estimate is biased, the estimated proportion of unique cases is an overestimate of the true value, indicating that the error in these estimates tends to inflate the proportion of TB cases due to reactivation and minimize the proportion due to recent transmission. In many simulations, all of the 1,000 estimates obtained were less than the true proportion.

Bias in Odds Ratios for Risk Factors for Clustering
The bias in the proportions of clustered and unclustered cases results from misclassification of cluster status due to inadequate sampling; this misclassification also biases the results of analyses of risk factors for recent transmission in the direction of the null hypothesis of no effect. Table 2 also presents estimates of the odds ratios for the effect of a range of hypothetical risk factors for recent transmission. These results show that the odds ratios of a risk factor for clustering are markedly underestimated in the transmission scenarios in which there are lower proportions of unique cases or in which smaller cluster predominate. This bias is especially marked when odds ratios are high; in the worst-case scenario described in Table 2, an odds ratio of 10 could be estimated as 1.58 when only 10% of the isolates are sampled.

Discussion
The recent development of molecular methods to accurately type infectious organisms has led to a marked proliferation in studies of the molecular epidemiology of infectious diseases, especially of TB. The goals of many of these studies have been to address the longstanding problem of assessing the relative proportions of incident TB cases due to recent transmission and to chronic or reactivated disease and to identify risk factors for recent transmission. A systematic bias that consistently underestimates the proportion of cases due to recent transmission could present a serious impediment to the constructive use of molecular typing techniques for studying the epidemiology of infectious disease.
The results of this study show the extent to which bias can be introduced by sampling strategies commonly used in the molecular epidemiology of TB. Depending on the underlying distribution of cluster sizes, the error involved in underesti-mating the proportion of unique TB isolates in a sample may be sizable, even when up to 70% of the complete data is sampled. The odds ratios for risk factors for clustering are also consistently and markedly underestimated with this approach. The findings of this study support the conclusions of previous investigators (14) who have shown that the extent of error in Table 2. Monte Carlo means and 95% ranges for the proportion of unique isolates and for odds ratios after sampling a fraction of the complete data set. RESEARCH these estimates is a function of both sampling fraction and underlying cluster distribution in the complete data sets. These results imply that reasonable predictions of the extent of error can be made, given knowledge of both the true distribution of cluster sizes in the population of persons with TB and the size of the population of TB patients from which the sample was drawn. Although the true distribution of cluster sizes cannot be observed in the absence of complete sampling, epidemic models such as this one may elucidate factors that contribute to these distributions and help investigators arrive at prior expectations of cluster distributions in the specific transmission scenarios under study. I considered how much impact this kind of sampling bias might have had on the studies of the molecular epidemiology of TB published to date. Many researchers report on a convenience sample of cases drawn from one or more clinical sites, without providing an estimate of the number of incident cases in the area in question during the period in which the cases were collected (16)(17)(18)(19). In areas with high TB prevalence, the number of cases in these series is often <1% of the number of cases expected in that region on the basis of national reporting or World Health Organization predictions. These results suggest that the bias expected in these studies is so extreme that the findings are useful only as lower bounds for the proportion of recently transmitted cases and for risk factors for recent transmission. Nonetheless, lower bounds may be informative in situations in which undetected transmission is incorrectly attributed to reactivation disease alone or when a new risk factor for transmission is identified.

Country/Specific Statistics
In industrialized countries with lower rates of incident TB, researchers have tried to enroll a compete cohort of patients by making use of public health reporting systems to identify and fingerprint all new cases of clinical TB in a defined geographic region during a specified time period (4,5,(20)(21)(22). Although this approach leads to much more complete and systematic sampling, it may not always ensure that the resulting estimates are free from bias. For these series of cases to be complete samples, one would have to assume that none of the cases in the sample had transmission links to cases that did not appear in the study population or were reported before the onset of the study. Furthermore, the most rigorously documented TB fingerprinting studies have reported 15%-40% loss of data as a result of difficulties in culturing, fingerprinting, and interpreting fingerprint patterns (4,5,(20)(21)(22). Even if the patients excluded from these studies resemble those retained in every other respect, their exclusion will result in a biased outcome of the study.
The "complete" data sets used to estimate bias in this study were generated through stochastic epidemic modeling that outputs cluster distributions in addition to estimates of the incidence of TB infection and disease. Multiple demographic and disease-specific parameters have been found to affect cluster distributions, and many potential "transmission scenarios" could be generated by varying these parameters. In addition, the length of the study period and the stability of the molecular markers used will impact the observed patterns of clustering (23,24). Given that true cluster distributions cannot be known in the absence of complete sampling, the model cannot be validated by using it to derive known cluster distributions. Since the purpose of this study is to explore the bias in measures commonly used in empirical studies of molecular epidemiology, sets of parameters were chosen from a variety of specific areas in which the burden of TB disease has been described or projected based on the information currently available. Although the true transmission patterns in any particular population may be inadequately captured by the epidemic model used, these results do provide some perspective on the potential misinterpretation of molecular data on TB. The simulations may also differ from data sets obtained in the field in that sampling was random and the very real problem of selection bias in the collection of isolates was not addressed. Finally, in the assessment of the bias in the estimates of the effect of risk factors for clustering, I assigned risk status randomly within groups of clustered and unique cases. If cluster size were correlated with a risk factor for clustering, so that, for example, incarceration was more common among cases in large clusters than small ones, the bias in the odds ratio of incarceration would be less than the estimates reported here.
These results demonstrate that estimates of clustering based on molecular fingerprinting of a population of isolates of infectious agents may be severely biased. When these methods are used to estimate the extent of primary and reactivation disease in a community, they consistently underestimate recent transmission. In circumstances in which the error is greatest, the bias may undermine the value of an investigation by providing a community with false reassurance that ongoing transmission is being curtailed and therefore that control measures are adequate.
The findings of this study further suggest that molecular methods in epidemiology require the development of both appropriate epidemiologic study design and analytic tools to yield meaningful assessments of disease transmission. In particular, they imply that estimates of recent transmission obtained by molecular methods cannot be compared across studies which have used different sampling fractions and in which the distribution of cluster size can reasonably be expected to vary. One way for molecular epidemiologists to approach this problem is to provide sensitivity analyses estimating the potential error involved, given prior expectations of cluster distributions and an estimate of the fraction of cases sampled in a particular study. The analytic solution presented here can be easily programmed and used to explore the range of potential error under a variety of hypothetical transmission scenarios.
{ } Dr. Murray is an assistant professor of epidemiology at the Harvard Medical School and an infectious disease clinician at the Massachusetts General Hospital. Her research interests include epidemiologic methods, the molecular epidemiology of tuberculosis, and the transmission dynamics of infectious diseases.

Appendix 1
We wish to derive the expectation and variance of the random variable U/N, denoting the proportion of all sampled isolates that form a unique cluster of size 1 in the sample. In large samples, the mean of U/N is approximately the ratio of the mean of U to the mean of N: E and the variance of U/N is approximated by the large sample variance formula for a ratio.

var(U/N)
It only remains to evaluate cov(U,N), which is done in the following lemma.
Lemma where E(U jk ) = kp(1-p) k -1 and E(N jk ) = n k p. The result then follows.

Appendix 2
The bias in the proportion of reactivated cases after sampling when the clustered cases are counted by using the "n minus one" method is described below. The number of cases considered to be due to reactivation is the sum of the unique cases and the source cases. The "true" number of source cases is equal to the number of clusters in the complete data set, .
We are interested in finding the number of source cases after sampling. Since the number of source cases in a sample is equal to the "true" number of source cases minus the source cases that are not sampled or are sampled as unique, we need to estimate the expected value of the numbers of clusters not sampled and the expected value of the clusters sampled as unique. Let E(CL0) and E(CL1) be the expected values of the ≅ numbers of clustered not sampled or sampled as unique, respectively. Then, by using the nomenclature defined in the text and following the logic there described: E(CL0) = and E(CL1) = The expected number of source cases after sampling a fraction p of the complete set of isolates is equal to The overall estimate of the proportion of reactivated cases can then be obtained by summing the number of unique cases after sampling with the number of source cases and dividing by the expected number of sampled isolates, p(N).