Multivariate Markovian modeling of tuberculosis: forecast for the United States.

We have developed a computer-implemented, multivariate Markov chain model to project tuberculosis (TB) incidence in the United States from 1980 to 2010 in disaggregated demographic groups. Uncertainty in model parameters and in the projections is represented by fuzzy numbers. Projections are made under the assumption that current TB control measures will remain unchanged for the projection period. The projections of the model demonstrate an intermediate increase in national TB incidence (similar to that which actually occurred) followed by continuing decline. The rate of decline depends strongly on geographic, racial, and ethnic characteristics. The model predicts that the rate of decline in the number of cases among Hispanics will be slower than among white non-Hispanics and black non-Hispanics a prediction supported by the most recent data.

The epidemiology of TB in the United States has an underlying Markovian nature (20)(21)(22). Except in unknown or difficult-to-predict influxes of immigrants, future infections and cases of TB may be probabilistically estimated by knowing sufficiently well the present situation of susceptible, infected, and ill persons; rates of transition from healthy to infected and infected to ill; patterns of contact; and rates of fertility and death. Indeed, a Markov chain is completely specified by an initial state and transition probabilities. To model the future course of TB in the United States, we used the structure depicted in Figure 1.
The U.S. population was disaggregated into states multivariately defined by sociodemographic, health, and disease descriptors: age (single-year age groups); sex; race (white, black, Asian or Pacific islander, Native American or Alaskan native, or other race); ethnicity (Hispanic or non-Hispanic); TB infection or disease status; and location. One hundred sixty locations were defined: 60 corresponded to the largest standard metropolitan statistical areas (SMSAs) in the United States, 50 to groupings of smaller SMSAs in the 50 states, and 50 to groupings of non-SMSA counties in the 50 states. The impact of the HIV epidemic on the TB epidemic was addressed by dealing with certain location-specific populations in a way that recognizes that a fraction of them are dually affected. A state in the Markov chain was completely defined by the joint specification of all these descriptors.

Multivariate Markovian Modeling of Tuberculosis: Forecast for the United States
In the absence of longitudinal data, the population database for the model was constructed by assembling population profiles for each of the individual years from historical surveys and population projections (5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)23). These profiles consisted of the sizes of population groups defined multivariately by age, race, sex, ethnicity, and geographic location. The popula-tion projections of the Bureau of the Census incorporate expected immigration patterns.
Transition probabilities for the model were based on rates cited in the literature to describe how these transitions occur for individuals. The probability of transition from healthy to infected has been termed the "annual risk for infection." To estimate this probability from tuberculin skin test surveys, we used the catalytic method of Muench (19,24,25). This method relates incidence to prevalence, given several data points of the latter, under the assumption of an exponential model, the exponential curve being called "catalytic" at the time of Muench, who used graphic methods for estimation. The average number of susceptible persons infected by each infectious (i.e., sputum-positive) case in a geographic location in 1 year was termed the "contagious parameter" by Styblo, who noted the correspondence between the prevalence rate of the disease and the annual risk for infection (25,26). Since we assume that new active cases are treated during their year of occurrence, our use of the contagious parameter is consistent with that of Styblo.
An important assumption made in the model about the spread of infection is that each ill person infects susceptible persons entirely within race ethnicity location-specific subgroups, with the number of infections determined by the contagious parameter. The age-sex distribution of newly infected persons in the model was based on national TB contact investigation data (27). Time trends for contagious parameters for the various race ethnicity location-specific subgroups were based on quadratic regression fits of the total cases of TB for those subgroups over time from 1980 through 1993. For illustration, we present in Table 1 baseline values for the contagious parameter for 17-to 25-year-old white men, by state, derived from the U.S. naval recruits data (28)(29)(30)(31)(32)(33)(34)(35).
The probability of transition from infected to ill is a product of host-parasite interactions and is determined by the biologic features of these interactions. A person with a new Mycobacterium tuberculosis infection (i.e., an infection of <1 year) is at much higher risk for clinical disease than someone with a mature infection, who has been at an ongoing risk in later years (36). These risks vary with age. Although not explicitly modeled, reinfection of cured persons is allowed at the same rate as infections with the same  Table 2). The effects of the HIV epidemic and the emergence of multidrug-resistant TB are modeled through time-dependent exogenous inputs for appropriate geographic locations and, within these locations, for specific demographic subgroups, as dictated by trends in prevalence and dual-infection rates (43)(44)(45). These calculations result in time-dependent changes to the probabilities of transition from new or mature infection to clinical TB.

Research
We established 1980 as the baseline year for all variables used in the model. The initial values of these variables were based on the Reports of Verified Cases of Tuberculosis (RVCT) files and published reports of TB incidence from Centers for Disease Control and Prevention. Because the RVCT program was not implemented nationwide until 1985, reporting for the years 1980 to 1984 was incomplete. To estimate the 1980 baseline values, we inferred values of missing data according to the multivariate structure observed in 1985. Baseline information also included counts of mature infections prevalent in 1980 and counts of new infections in 1980. These values were estimated by applying difference equations relating the numbers of cases, new infections, and mature infections in a given year to the numbers new infections, and mature infections in the previous year, where the relationships between the figures for the 2 years were dependent on model parameters such as survival rates, rates of transition from mature infection to disease, rates of transition from new infection to disease, and contagious parameters, and where each difference equation pertained to a specific multivariately defined subpopulation. Through repeated application of the difference equations, it was possible to express the number of mature infections in 1980 in terms of TB case counts for an arbitrary number of subsequent years for which actual data were available. Estimates thus obtained were examined for consistency and the most stable was selected. Because of the lack of reliable information, we did not attempt to address the underreporting of TB cases or underestimates of some populations. Computer implementation of the model was carried out on a Sun SPARCstation 5 with 80 megabytes of main memory and network access to a SPARCserver 1000E with approximately 100 gigabytes of online disk storage. All software was written in C++, an object-oriented programming language. A library of C++ objects and functions (46,47) was used as the basis for the representation of fuzzy numbers in the software.
Because of the extremely large sizes of the multidimensional arrays containing cross-sectional population data and year-to-year transition data, specialized data structures and algorithms were developed to take advantage of characteristics of these data that allowed them to be stored in a more compact form (48). For crosssectional population data, a height-balanced binary search tree structure (49,50) was used. Within each tree node, both cell information and array subscript information were stored. Because the subscripts would occupy at least 9 bytes within each node of the tree if stored conventionally, a 4-byte compressed representation of the subscripts was stored within each node. For transition data, a simple linked-list structure was used in which each node contained both cell information and compressed subscript information. This approach resulted in a 96% reduction of storage space required for population data and a 99% reduction of storage space required for transition data (48). Figure 2 presents model projections of new cases of TB for the United States for 1980 through 2010 as well as actual data for white non-Hispanics, blacks, and Hispanics, the groupings used in TB case reporting in the United States. For these three groups, differences between actual counts and model predictions were generally <1,000 for blacks and Hispanics and <1,500 for white non-Hispanics. Table 3 presents actual and projected TB case rates per 100,000 for each 5 years from 1980 to 2010 with results presented for racial, ethnic, and geographic groups. Generally, the model performs best, within the 1980 to 1995 validation period, for large subgroups with relatively high case rates. For example, it projects TB case rates to within 12% of actual rates for the U.S. population at large. Its performance is not as good in subgroups having lower case rates (e.g., there is a 23% projection error for non-Hispanic whites in 1995). In addition, projection errors are relatively high for smaller subgroups, especially those in which case reporting and population reporting are subject to inaccuracy (e.g., projection errors are consistently above 20% for Hispanics and nearly as high for American Indians and Asian/Pacific Islanders). The model's best performance tends to be among the older population: the model predicts case rates to   within 1% accuracy for the U.S. population aged 65 and over for 1990 and 1995. This is of significance in that the highest TB case rates are among the elderly. Figure 3 presents model projections of the TB case rate quartiles, by state, for the year 2010. Note that the projected case rate for Wyoming spuriously falls into the highest quartile, as a result of its small population size. TB is projected to decrease in all parts of the United States and remain largely a problem of states receiving large numbers of immigrants.

Research
In Figure 4, the model projection for the United States for 1980 through 2010 is presented together with actual numbers of newly reported TB cases for 1980 through 1997. Also depicted is the sensitivity of the model to changes in key parameters, seen in thicker lines in Figure 1. Specifically, changes were allowed for the contagious parameters and for rates of transition from healthy to infected and infected to ill.

Research
Specifically, above and below the central projection are lines representing projections from simultaneous 10% increases and decreases in these key parameters for each year in the projection period. A simultaneous decrease of 10% in the parameters for each year of the projection period results in a decrease of 27% (from 13,458 to 9,861) in projected TB cases in 2010. A simultaneous increase of 10% yields an increase of 44% (from 13,458 to 19,381) in projected cases in 2010. Not shown in the figure, a simultaneous decrease of 5% in the same parameters yields a decrease of 15% (from 13,458 to 11,459) for projected cases in 2010, and a simultaneous increase of 5% yields an increase of 19% (from 13,458 to 16,063) in 2010.

Conclusions
The studies of Waaler and colleagues, who carefully noted and credited other early workers in this field, mark the beginning of a modern approach to modeling and the thoughtful consideration of transition probabilities in modeling (51,52). ReVelle, Lynn, and Feldman, using many of the assumptions of Waaler and his colleagues, projected declining TB case rates in developing countries as a result of BCG vaccination (53). Azuma modeled TB incidence in Japan (54). Estimates of AIDS prevalence were used by Styblo (55), Schulzer and colleagues (56), and Heymann (57) to project TB incidence in sub-Saharan Africa. Modeling was used by Murray, Styblo, and Rouillon to examine the economic impact of TB in developing countries (58). Blower and colleagues used a differential equation-based model to examine the decline of TB epidemics and emphasize the importance of effective case treatment in TB control (59,60). Murray and colleagues used differential equation-based models to compare the global impact of different control strategies (61,62). Brewer and colleagues used simulation techniques to compare TB control policies (63).
Mathematical modeling of TB in the United States probably began in 1939 with the early work of Frost, who used age cohort analyses of secular trends to predict the future decline of TB (64). Ferebee developed a model for TB in the United States based on data from 1963 and used it to emphasize the potentially important contribution of chemoprophylaxis to TB control (65). The model we present is substantially more complex than Ferebee's and differs from her early work by disaggregating the American population. Given the marked heterogeneity of the population of the United States and the corresponding diversity of TB case rates in various American population groups, our approach offers major advantages over aggregate modeling both in improved accuracy of estimation of input data and in output that can be disaggregated to relate to segments of the population with very different TB control problems.
A highly disaggregated model incorporating population dynamics is appropriate. When applied to small groups, population dynamics are aptly described by a linear model, particularly when natural constraints are stated as totals. The model we constructed is Markovian in that it is independent of its history prior to its last known state. Our computer implementation of the theoretical model is clearly Markovian because its operation parallels the nature of a Markov chain, in the sense that the initial state was the state of our system in 1980, and the state in each subsequent year was generated from the previous year's state and 1-year transition rates. Since no information other than the previous year's state and 1-year transition rates are available to the computer software to generate the state for a year, the Markovian nature is established.
Our model for the intermediate years predicted (on the basis of 1980 baseline data) that TB would increase late in the ensuing decade and decline again. This prediction reasonably approximated the subsequent events. While the model does include changes in TB incidence related to the HIV epidemic and to immigration, it does not specifically incorporate changes in TB control measures, whose impacts are only addressed indirectly through net changes of transition probabilities in the model. The latter could be addressed by a submodel for appropriate population subgroups, and could affect projections.
The projections presented here suggest that the number of new cases and incidence rates of TB will now continue to decrease to a level approximately half the present level nationally. Measuring the extent to which increased control efforts contributed to the return of the TB epidemic to its formerly declining phase will require more detailed analysis.
The handling of geographic location necessarily relies on politically defined entities which determine how data are collected. Since disease patterns do not necessarily follow these Research boundaries, the model is not uniformly accurate by geographic location. This may be seen in Table  3 where the TB case rate in 1995 is overestimated for Los Angeles and underestimated for Miami. However, the very fine age structure imposed has resulted in projections by age group within 20% accuracy for age 25 and older ( Table 3). Refinement of the geographic characterization could result in corresponding increases in accuracy.
The model projections are smoother over time than the actual counts, as should be expected because random fluctuations are not built into the model (Figures 2,3). In Figure 4 the time series of actual cases is moving toward the lower 10% bound. This may be a temporary fluctuation (as in the mid-1980s) but could also indicate changes in transitions in the real-world disease process.
Errors of a projection model may arise from shortcomings of model structure, from inaccuracies in initial conditions, or from errors in the values of key model parameters. The structure of the Markov chain model is mechanistic, driven by highly detailed population dynamics. Each part of the mechanistic process may prompt suggestions for model refinements and discussions of variability; notably, occurrence of infection (66,67), time from infection to disease (68,69), and involvement of HIV (70) and immigrant status (71). The greatest potential for model inaccuracy due to initial conditions arises from estimates of the 1980 pool of prevalent infections. Errors thus introduced become less consequential with passing time, as persons infected before 1980 die. So, model projections in the latter part of the projection period are least affected by errors in the numbers of those infected before 1980. Model parameters not only have the obvious role of governing transitions for various population subgroups, but also, as reflected by their time dependency, are used to depict exogenous phenomena such as the HIV epidemic, implementation of TB control efforts, and changes in immigration patterns. Geographically focused changes of TB control efforts implemented after 1993 related to directly observed therapy could result in overestimation in projections in those areas.
Although the model does not allow mixing of groups from one location to another, this limitation is in large measure adjusted in the calculations of transition probabilities since these are based on actual data that implicitly reflect the movements of healthy, infected, and ill groups.
The other type of mixing not accounted for in the model occurs in a location among different raceethnicity subgroups. The most serious projection errors would then occur when a subgroup of uninfected persons lives in the same location as a subgroup of persons with active TB, for then any cases arising in the former subgroup would not be predicted by our model. This type of situation is likely to arise only in locations with very few active TB cases. Otherwise, the projection errors arise only from the extent to which the crossspread of infection fails to cancel out (i.e., the net extent to which a subgroup infects others, over and above the number of the latter being infected by the former).
Although HIV status is addressed in this model, adding structure that more precisely reflects changes in transition rates of HIVinfected persons might also improve model projections. The model predicts that an increasing portion of cases will occur in Hispanics, and this prediction is supported by the most recent data, which show that the rate of decline in the number of cases in Hispanics is slower than for white non-Hispanics and for black non-Hispanics.
The foreign born account for a large fraction of the new cases seen in the United States (71). The model assumptions regarding the future contribution of the foreign born to the U.S. population are from the U.S. Bureau of the Census projections; therefore, this component of the population may be thought of as forming part of the backdrop against which the model projections are made. Departure from the patterns of immigration projected by the Census Bureau could cause two kinds of errors in model projections. A surge of immigration from countries with high rates of TB would soon be followed by increases in TB cases and case rates in the United States that would not be projected by the model. A different effect would be seen as a consequence of unexpected large increases in immigrants. Regardless of the TB status of the countries of origin, an increase in the number of immigrants would be reflected first in an artificial lowering of case rates because of the increase in denominators, followed by a longerterm slow increase in numbers of cases. The model could not recognize these changes within its current structure. These potential errors may be avoided as additional structure better reflecting immigrant status becomes incorporated into the