Recency-Weighted Statistical Modeling Approach to Attribute Illnesses Caused by 4 Pathogens to Food Sources Using Outbreak Data, United States

Foodborne illness source attribution is foundational to a risk-based food safety system. We describe a method for attributing US foodborne illnesses caused by nontyphoidal Salmonella enterica, Escherichia coli O157, Listeria monocytogenes, and Campylobacter to 17 food categories using statistical modeling of outbreak data. This method adjusts for epidemiologic factors associated with outbreak size, down-weights older outbreaks, and estimates credibility intervals. On the basis of 952 reported outbreaks and 32,802 illnesses during 1998–2012, we attribute 77% of foodborne Salmonella illnesses to 7 food categories (seeded vegetables, eggs, chicken, other produce, pork, beef, and fruits), 82% of E. coli O157 illnesses to beef and vegetable row crops, 81% of L. monocytogenes illnesses to fruits and dairy, and 74% of Campylobacter illnesses to dairy and chicken. However, because Campylobacter outbreaks probably overrepresent dairy as a source of nonoutbreak campylobacteriosis, we caution against using these Campylobacter attribution estimates without further adjustment.

Of the resulting 2,655 outbreaks caused by one of the 4 priority pathogens as the single etiology, we excluded 38% (n = 1,014) because investigators did not identify an implicated food and 26% (n = 689) because the implicated food(s) could not be assigned to a single food category. Implicated foods could not be assigned to a single food category because the identified food was complex (composed of ingredients belonging to more than one food category) (n = 448); or foods from more than one food category were implicated or suspected (i.e., multiple foods) (n = 142); or the food was too vaguely described to be assigned to any category (e.g., "buffet," "appetizer") (n = 50); or the food was too vaguely described to be assigned to the specific food categories used in the analysis (e.g., could only be assigned to "Produce" or "Meat-Poultry") (n = 49).
We focus on single-pathogen single-food category outbreaks because the appropriate categorization of both pathogens and foods is known. A method is in development for assigning to multiple food categories those outbreaks due to complex foods for which the implicated ingredient was unknown. The previously published approach could not be applied to our data series without substantial revisions (3). This approach used "recipes" developed using internet searches and these recipes would need to be updated to reflect current online recipes and to incorporate changes to food categories (4).
Thus, our final dataset for analysis included outbreaks caused by a single pathogen that could be assigned to one of 22 specific food categories; these were 952 (36%) of the 2,655 single etiology outbreaks. The pathogen with the most outbreaks in the resulting data was Salmonella (n = 597), the predominant serotype of which was Enteritidis (n = 184) (Appendix Table 1).
As part of preliminary analyses, we assessed the quality of information on etiology status.
In FDOSS, an outbreak must have at least 2 ill persons (2). For Salmonella, E. coli O157, and Campylobacter, outbreaks with "confirmed" etiology are defined as those in which the outbreak strain was isolated from at least 2 patients or from epidemiologically implicated food; "confirmed" outbreaks of L. monocytogenes infections must have 1 person with the outbreak strain isolated from a normally sterile site (5). (Cases of listeriosis can also be diagnosed based on symptoms and culture of products of conception, which are not sterile.) The etiology of an outbreak not meeting these conditions is considered to be "suspected." Of the 2,732 outbreaks associated with the 4 priority pathogens, 90% (2,462) were coded as having confirmed etiology.
We found that 12% of outbreaks coded as having confirmed etiology did not have sufficient data to fulfill the confirmed etiology definition, but also found that over 95% of outbreaks coded as having suspected etiology had at least one laboratory-confirmed illness. Outbreaks occurring early in the study period were more likely to have insufficient data to confirm an etiology.
We decided to include outbreaks with either confirmed or suspected etiology status in the analysis so as not to lose information associated with those outbreaks, following the decision made in Painter et al. (3). We also conducted a sensitivity analysis on this decision, as described elsewhere in this appendix.
NORS includes 3 variables related to outbreak size: the number of lab-confirmed primary cases (ConfirmedPrimary), the number of additional illnesses that were not laboratory confirmed (ProbablePrimary), and the total of both confirmed and probable illnesses (EstimatedPrimary). In our attribution estimates, we use the estimated total illnesses as our measure of outbreak size.

Statistical Model Development
This section provides additional details about the models used to estimate the food sources of illnesses. Specifically, it describes development of pathogen-specific statistical models of outbreak size, and the approach used to weight recent outbreaks more heavily than older outbreaks.

Analysis of variance models
Log-transforming outbreak size resulted in relatively normally distributed outbreak illness numbers that could be modeled using straightforward analysis of variance (Te) modeling techniques. We explored several modeling approaches, including analysis of covariance (ANCOVA), generalized linear models, and least absolute shrinkage and selection operator (LASSO) models, among others. We decided to use ANOVA based on structural simplicity and interpretability, and because our data were not sufficient to credibly describe the complexity of interactions between the epidemiologic characteristics of reported outbreaks.
We developed pathogen-specific models because we did not want to smooth over differences in outbreak size by pathogen, as this variation likely results from epidemiologic factors, not random variation. We found outbreaks caused by different Salmonella serotypes varied in foods implicated and other epidemiologic factors. In particular, serotype Enteritidis outbreaks had some distinct patterns. Thus, we decided to model serotype Enteritidis separately from all the other serotypes for estimating outbreak size; when we calculate attribution percentages in a subsequent stage, we do so after summing the 2 sets of model estimates.
Based on preliminary modeling analysis and considerations of epidemiologic importance, we included 3 variables as the predictors of outbreak size in all 5 pathogen-specific models: food category, the type of location at which the food was prepared, and whether outbreak exposures occurred in a single state or in multiple states. Each outbreak was assigned to 1 of 17 food categories, as described previously. The food preparation location variable used in the model included 5 categories, based on 24 individual location types identified in outbreak reports, as shown in Appendix Table 2. We reduced the number of categories to 5 to address the relatively sparse data across most locations other than restaurant or private home. A dichotomous variable was used to indicate whether exposures occurred in multiple states or a single state. We desired a model that was portable in that it could be similarly described across the 4 etiologies included in the study and expandable to additional pathogens. Summary measures for model fit are shown in Appendix Table 3, including traditional lack-of-fit, R-squared, overall model significance, and significance of each predictor. Appendix Table 3 also shows, in the last 3 columns, variance explained via random forest decomposition using identical predictors.
Appendix Figure 3 compares the number of reported illnesses with the number of modelestimated illnesses and shows that, as expected, our ANOVA models reduce variation in outbreak size and the influence of very large outbreaks.

Recency weighting
The decision to down-weight older data was made because recent outbreaks are likely to be more representative of current foodborne illness attribution than older outbreaks. Changes in attributable risk may result from changes over time in food consumption patterns, food production and processing practices, food safety activities, regulatory interventions, and other factors. This decision is supported by characteristics of the underlying data. Appendix Figure 4 presents a heat map with the number of outbreaks by pathogen and food category over time.
White cells indicate no outbreaks due to that pathogen-food category pair in that year, with color from pale orange to red indicating between 1 and 25 outbreaks in that year. Appendix Figure 4 illustrates the variability in data sparseness across many pathogen-food categories.
We examined the impacts of excluding older data by estimating attribution for 3, 5-year Based on this and other analyses, we decided that outbreaks older than 5 years should be included in estimates of attribution but down-weighted to increase the relative influence of more recent outbreaks on attribution estimates.
As described in the article, we determined that the most appropriate approach would be to use an exponential decay function to define the recency-weighting multiplier w for an outbreak in year y, as a function of decay parameter a: We evaluated various options for the decay parameter a and the resulting weighting factor by year, as shown in Appendix Figure 6. Our preference was for more than half of the information in our estimates to come from the most recent 5 year period, and a small amountaround 5% -from data older than 10 years. Because the distribution of outbreak illnesses is not constant over time or by pathogen (as shown in Appendix Figure 4), we selected a decay parameter that best met our preferences for all pathogens. As shown in Appendix Table 4, with a decay parameter value of 0.7142 (5/7), 67% of the total down-weighted model-estimated outbreak illnesses used in the attribution calculation were from outbreaks that occurred during the most recent 5-year period (2008-2012), with ≈28% from the middle 5-year period, and 5% from the oldest 5-year period.

Sensitivity Analyses
This appendix describes sensitivity analyses conducted to assess the robustness of our attribution estimates. We compare our model-based estimates to those derived used in prior studies and explore sensitivities to modeling decisions and underlying data.

Sensitivity to ANOVA Model Specifications
As noted in the text and other appendices, we conducted exploratory analyses to determine which predictors should be included in the pathogen-specific ANOVA models. The final 3-predictor model specifications were based both on epidemiologic reasoning and our findings that these variables were statistically significant predictors of outbreak size.
We conducted sensitivity analyses around the final model specification by estimating attribution percentages using 3 alternative ANOVA models: one without the dichotomous multistate variable, one without the categorical preparation location variable, and one without either.
The results (Appendix Figure 8) show that our model is robust to model specification decisions in comparison with the baseline model specification.

Sensitivity to Etiology Status
As described previously, we included outbreaks with "suspected" etiology in addition to those with laboratory-confirmed isolates from patients or food in the analysis. Those without confirmed etiology comprise ≈9% of the outbreaks used in the analysis, though of these, most had at least one laboratory-confirmed illness. We conducted a sensitivity analysis around this decision. Appendix Figure 9 presents our baseline attribution estimates and 90% credibility intervals alongside estimates based on data excluding the 83 outbreaks with suspected etiology.
Appendix Figure 9 shows that for all but a few pathogen-food category pairs, the differences in point estimates are minimal, though credibility intervals are wider when outbreaks of suspected etiology are excluded.

Sensitivity to Influential Outbreaks
We conducted a series of analyses to identify which outbreaks are most influential on our attribution estimates and to assess model sensitivity to these outbreaks. This was done in part to ascertain the extent to which our estimates were sensitive to very large outbreaks, though because our estimates are based on a 3-parameter statistical model of log-transformed outbreak size, with recency-weighting, we needed a systematic approach to identify influential outbreaks.
The first step was to define an influence metric for each outbreak based on the aggregate difference in attribution estimates when that outbreak was excluded from the analysis. That is, for each of 952 outbreaks, we estimated attribution percentages without that single outbreak. We defined an "influence metric" as the sum of mean differences squared across all pathogen-food category pairs between the baseline estimate and the estimate without that outbreak; the attribution percentages change only for the pathogen for which an outbreak was excluded. We then calculated the overall "influence rank" for each outbreak based on the rank order of the "influence metric." Appendix Figure 10 presents, for each pathogen, the calculated influence metric for each outbreak, in descending order. These plots show that most outbreaks have influence metrics at or very close to zero, but a small number do have measurable influence metrics. Appendix Figure   10 shows that the 10 outbreaks most influential on attribution estimates were caused by L.
monocytogenes and Campylobacter. Appendix Table 5 provides details for the 5 outbreaks most influential on attribution estimates for each pathogen. Although the plots in Appendix Figure 10 show that some outbreaks have large influence metric values, the actual impacts of these individual outbreaks on attribution estimates is minimal.
Appendix Figure 11 presents estimates for scenarios in which each of the 5 outbreaks most influential on attribution estimates (from Appendix Table 5) for each pathogen was excluded one at a time. These scenarios are shown alongside the baseline attribution percentages.
Appendix Figure 11 shows that for all but the single most influential outbreak (L. monocytogenes in cantaloupe), the exclusion of any single outbreak results in negligible differences in attribution estimates, and no differences in the rank order of food categories. We therefore concluded that our model is robust to all but the most extreme outliers, and that only our estimates for L.
monocytogenes are sensitive to the impact of individual outbreaks. *Overall influence rank is based on the rank order of outbreaks when sorted by the influence metric, shown in the last column. The influence metric is defined as the sum of mean differences squared across all pathogen-food category pairs between the baseline estimate and an attribution estimate with that outbreak excluded.