Geogenomic Segregation and Temporal Trends of Human Pathogenic Escherichia coli O157:H7, Washington, USA, 2005–2014

The often-noted and persistent increased incidence of Escherichia coli O157:H7 infections in rural areas is not well understood. We used a cohort of E. coli O157:H7 cases reported in Washington, USA, during 2005–2014, along with phylogenomic characterization of the infecting isolates, to identify geographic segregation of and temporal trends in specific phylogenetic lineages of E. coli O157:H7. Kernel estimation and generalized additive models demonstrated that pathogen lineages were spatially segregated during the period of analysis and identified a focus of segregation spanning multiple, predominantly rural, counties for each of the main clinical lineages, Ib, IIa, and IIb. These results suggest the existence of local reservoirs from which humans are infected. We also noted a secular increase in the proportion of lineage IIa and IIb isolates. Spatial segregation by phylogenetic lineage offers the potential to identify local reservoirs and intervene to prevent continued transmission.

2 of 570 isolates (0.4%) showed aberrant lineage assignment. With this, we felt that the assumption that isolates of the same PFGE subtype would be in the same lineage held adequately well to use the SNP-typing results to assign lineage to non-SNP-typed isolates. We were able to assign lineage to 328 additional isolates by using this approach.

Spatial Segregation by Diggle's Kernel Estimation Method
Diggle's kernel estimation provides smoothed estimates of spatial segregation that take into account multiple neighbors of each case. It provides an overall test of spatial segregation and identifies statistically significant regions in the lineage-specific probability surfaces. Diggle's method assumes an underlying Poisson point process for each phylogenetic lineage. The degree of smoothing is dependent on the choice of a bandwidth. A cross-validated log-likelihood function can be used to calculate the bandwidth (4). We tested bandwidths between 0.02 and 1 degrees at 0.0098-degree increments to identify and then select for analysis the bandwidth (0.6472 degrees) associated with the greatest cross-validated log-likelihood. Using the selected bandwidth, we determined the lineage-specific probabilities based on the surrounding cases for each case location and plotted the lineage-specific probability surfaces on individual maps. We then calculated a test statistic for spatial segregation by summing the square of the difference between the kernel regression-estimated lineage-specific probability at a given location and the overall probability that a case isolate belongs to that lineage over all lineages and all case locations. To determine statistical significance, we performed 999 Monte Carlo replications with lineage randomly relabeled at each case location, maintaining the observed number of cases of each lineage. The proportion of test statistics greater than that observed from the data was the pvalue. The analysis was conducted in R (5) using the spatialkernel package (6).
The bandwidth selected for the main analysis was used for all lineages within a given analysis. To identify the sensitivity of the kernel estimation results to the bandwidth of 0.6472 degrees that was selected, alternate bandwidths were tested: 0.02, 0.1, 0.2, 0.4, and 0.9. All yielded p = 0.001 for the overall test for spatial segregation. The segregation maps for individual lineages grew predictably smoother as the bandwidth was increased and identified statistically significant areas of segregation consistent with the primary result from a bandwidth of 0.6472.
Temporal variation in segregation was tested across 3 intervals: 2005-2007, 2008-2010, and 2011-2014. The slightly longer last interval is not expected to affect the validity of the results. However, because of the greater number of cases in this interval, greater precision is expected. We calculated a new bandwidth for each new analysis and subset of the data using the cross-validated log-likelihood function. For the overall test of variation of spatial segregation across time intervals using the kernel regression method, we chose a bandwidth of 0.8236 degrees. The bandwidths chosen for each of the individual intervals were 1.0000 for 2005-2007, 0.7256 for 2008-2010, and 0.9314 for 2011-2014. Not unexpectedly, given the high degree of smoothing in the first and last periods, only the middle period had detectable overall spatial segregation (p = 0.001). However, all periods displayed some statistically significant spatial segregation for individual lineages (Technical Appendix Video). A bandwidth of 0.4 was also tested for each of the intervals, resulting in statistically significant tests for overall spatial segregation in each interval (2005-2007 p = 0.037, 2008-2010 p = 0.001, 2011-2014 p = 0.014).

Multinomial Generalized Additive Model
The multinomial GAM provides a smoothed risk surface relative to Ib, the most common lineage. Unlike the direct measures of spatial segregation, the GAM captures spatial trends without selecting a specific distance or number of neighbors across which to smooth. It does this through a flexible spline function. The GAM also supports adjustment for covariates, providing some assurance that the associations observed are not due to factors such as the distribution of cases by age. The multinomial analysis entailed logistic-type equations for each of the 3 lineage comparisons. Results of the GAM multinomial models must be interpreted conditional on having a reported E. coli O157:H7 illness. As such, odds ratios presented estimate risk proportional to that in the most common lineage, Ib.
We tested multiple aspects of the GAM specification. Latitude and longitude were specified individually and jointly to allow interaction. The basis dimension of the penalized regression smoother was altered to improve the effective degrees of freedom. Age and sex covariates were removed, and the form of the spline smoother was altered. Lineage IIa was used as the comparison lineage. These sensitivity analyses are summarized in Technical Appendix Table 2. None of the model perturbations meaningfully changed the primary model results. In the set of GAMs incorporating year, a trivariate smooth of latitude, longitude, and year was also tested and found to be statistically significant for lineages IIa and IIb (Technical Appendix Table   2).

Spatial Segregation by Dixon's Nearest-Neighbor Method
Another measure of spatial segregation, Dixon's nearest-neighbor method, considers only the closest neighbor of each case. It conducts no smoothing and can be expected to be sensitive to clustered outbreaks. This method does not indicate areas in which spatial segregation exists but does provide an overall test of spatial segregation, as well as for segregation of individual lineages and pairwise segregation tests. We created a 4  4 contingency table of nearest-neighbor counts for each lineage group. A  2 test with 12 degrees of freedom was used to test overall spatial segregation, and segregation was tested for each individual lineage group (Technical Appendix Table 3). We calculated Dixon's segregation index for each nearest-neighbor combination (e.g., from Ib to IIa; Technical Appendix Table 4). Dixon's pairwise segregation index is defined as: where i and j in this analysis are phylogenetic lineages (7). A positive value of S indicates association, and a negative value indicates segregation. We calculated Z-scores for each combination by comparing the observed nearest-neighbor count in each cell to the expected count. We calculated a p-value based on the Z-scores assuming an asymptotic normal distribution. We used the Dixon R package for this analysis (8).
We used Dixon  2 tests for segregation to indicate statistically significant segregation overall (p < 0.001) and for lineages Ib (p = 0.046), IIa (p = 0.002), and IIb (p < 0.001), but not for the group of clinically rare lineages (Technical Appendix Table 3). This is consistent with the findings of the kernel estimation method, which found statistically significant overall spatial segregation and identified areas of segregation for lineages Ib, IIa, and IIb. Dixon's method also tests associations between individual lineages. Pairwise nearest-neighbor comparisons showed statistically significant positive association from each of lineages Ib, IIa, and IIb to itself.
Segregation was observed from Ib to IIa, IIa to the rare lineages, IIb to all other lineages, and the rare lineages to Ib (Technical Appendix Table 4).
We examined spatial segregation with Dixon's method for the 3 intervals analyzed with the kernel estimation method. Spatial segregation was found to be statistically significant with p < 0.001 during all 3 periods, contrasting with Diggle's method, which identified statistically

Multinomial Spatial Scan Statistics
We used multinomial spatial scan statistics (9) in SaTScan (10) to identify clusters within which the distribution of lineages differed significantly from the distribution of lineages outside the cluster. The spatial scan statistics are designed to identify clusters of disease. In the multinomial framework used here, the clusters reflect areas within which the distribution of cases by lineage is skewed compared with the area outside the cluster. These are similar to the areas of segregation identified by the kernel regression method. However, the scan statistics look at the distribution of all 4 lineages simultaneously and not individually, thus allowing detection of clusters in which multiple lineages may be out of proportion. Like the multinomial GAM models, the multinomial spatial scan statistics must be interpreted conditionally on having a reported E. coli O157:H7 illness.
For the primary spatial scan statistic model, we used a maximum cluster size of 20% of cases. Statistical significance of the clusters was determined based on Monte Carlo replications under the null. Relative risks presented estimate risk of one's infection being from the given lineage inside the cluster compared with the risk outside that cluster.
We identified 3 statistically significant clusters in which the distribution of cases by phylogenetic lineage varied from the distribution in the rest of the state (Technical Appendix Figure 2). The first cluster (p = 0.001) contained 203 cases, was centered in the southwest region of the state, and was characterized by a higher proportion of lineage IIb cases than observed elsewhere in the state (relative risk [RR] 2.59). The second cluster (p = 0.001), encompassing the sparsely populated northern reaches of the state, contained 185 cases and had somewhat more Ib (RR 1.37) and rare lineage (RR 1.88) cases and fewer IIb cases (RR 0.29). The final significant cluster (p = 0.006) contained 79 cases in the south-central region of the state; lineage IIa was more common than elsewhere in the state (RR 1.70), IIb was uncommon (RR 0.13), and cases due to rare lineages were nearly absent (RR 0). The first cluster, dominated by IIb, and third cluster, dominated by IIa, recapitulate the results of the kernel estimation maps and, for IIb, the GAM-generated risk surface. The second cluster, dominated by lineage Ib, is larger and centered somewhat further east than the area of segregation identified for Ib by the kernel estimation method, though still similar.
Altering the parameters of the analysis to allow lower or higher percentages of the cases to be included in clusters did not meaningfully affect the position of the clusters identified. We tested allowing clusters up to 50% of cases and 10% of cases. From the former, the main IIbdominant and Ib/rare-dominant clusters were identified, but the IIa-dominated cluster was not.
Limiting clusters to 10% of cases, all 3 clusters identified in the primary analysis were identified but with smaller numbers of included cases.
We detected variant clusters using multinomial spatiotemporal scan statistics, using year as the time scale and allowing up to 50% of the study period in a cluster, as well as purely spatial clusters. We identified 3 statistically significant clusters (Technical Appendix Figure 3). The first This cluster included part of Seattle, Washington's largest urban area, and areas immediately south and east.

Secondary Cases
To separate the effect of person-to-person transmission from other potential environmental factors that may result in segregation, we conducted sensitivity analyses after excluding known secondary cases. To be excluded, the most likely source of the infection had to have been identified during the public health investigation as person-to-person, or the notes had to indicate that another individual in the household or childcare situation had previously received such a diagnosis. Based on these criteria, 82 secondary cases were excluded. No meaningful changes in the results were observed. The overall test of spatial segregation was statistically significant using the kernel estimation method (p = 0.002) and the nearest-neighbor method (p < 0.001). The latitude/longitude smooth of lineage IIb from the multinomial GAM is statistically significantly different from that of lineage Ib (p < 0.001). However, the cluster identified in the southwest region of the state, dominated by lineage IIb, through multinomial spatial scan statistics moved somewhat northward and decreased in size without the secondary cases.

Reporting Bias
We assessed potential reporting bias by county. Reporting of patients who have tested positive is considered near 100% (11), but testing intensity may vary by provider. E. coli O157:H7 is most often detected by fecal specimen culture, a test that also detects Campylobacter, Salmonella, and Shigella. If providers in an area have heightened awareness of E. coli O157:H7 and are more likely to test for it than in other areas, we would expect that detection of these other pathogens would also be higher. There is overlap in the epidemiology of E. coli O157:H7, Campylobacter, and Salmonella, so some correlation is expected. However, risk factors for Shigella are generally different (12). If there were reporting bias, we would expect this to have the greatest impact on the observed incidence of milder E. coli O157:H7 strains.
Case counts by county for 2005-2014 for campylobacteriosis, salmonellosis, and shigellosis were obtained from the Washington State Communicable Disease Reports for 2009 and 2014 (each contained 5 years of data) (13,14). We calculated incidence rates using county populations as reported in 2010 U.S. Census TIGER/Line Shapefiles (15). Using the GISTools (16) package in R, we mapped the incidence quintile of each of the 4 pathogens at the county level for the study period to assess the potential for reporting bias (Technical Appendix Figure   4). Two counties, Yakima and Grant, appear in the uppermost quintile of incidence for each of the 4 diseases. However, incidence of rare lineage E. coli O157:H7 in this region is remarkably low (main article Figure 1; Technical Appendix Figure 2). Infections caused by these bacteria are generally milder (main article Table) and would be the type whose numbers would be exaggerated in the presence of heightened testing. Thus, it is unlikely that reporting bias is responsible for the observed results.

Data
Genomic data, with limited metadata, on all isolates used in the study are provided in Technical Appendix 2 (https://wwwnc.cdc.gov/EID/article/23/1/17-0851-Techapp2.xlsx). These include genomic data on all 1,160 E. coli O157:H7 isolates from reported, culture-confirmed cases in Washington state, 2005-2014. Phylogenetic lineage was determined directly using the 48-plex SNP assay developed by Jung et al. (17) or was inferred from a typed isolate with the same PFGE profile. Shiga toxin bacteriophage insertion typing and typing for clade according to the method used by Manning et al. (2) were conducted on only a subset of isolates. NT, not typed; PFGE, pulsed field gel electrophoresis; SBI, Shiga toxin bacteriophage insertion typing; SDM, Shannon Manning clade/genotype. *All analyses are multinomial logistic regression, using lineage Ib as the reference group, adjusted for age, sex, and year. The statewide analysis was conducted using a generalized additive model to additionally adjust for latitude and longitude using a thin plate spline bivariate smooth. Statistically significant results are shown in bold text. "Rare lineage" includes 12 different clinically rare lineages. CI, confidence interval; DNC, did not converge; Inf, infinity; NA, not applicable; OR, odds ratio; Ref, reference †Odds ratios of 0 are reported where 0 cases of the lineage under analysis existed in the category. Odds ratios of infinity are reported where 0 cases of the reference lineage (Ib) existed in the category. Confidence intervals were not estimated for these ORs, indicated by (0, Inf).
‡ Analyses marked NA could not be performed or were considered unreliable because of sparse data in these categories. Not all models converged because of sparse data in some categories. § p < 0.05 ¶ p < 0.01