Suggested citation for this article
Genomic characterization of an Escherichia coli O157:H7 strain linked to leafy greens–associated outbreaks dates its emergence to late 2015. One clade has notable accessory genomic content and a previously described mutation putatively associated with increased arsenic tolerance. This strain is a reoccurring, emerging, or persistent strain causing illness over an extended period.
Escherichia coli O157:H7 is estimated to cause ≈63,000 domestically acquired foodborne illnesses and 20 deaths in the United States each year (1). E. coli O157:H7 infections are typically associated with abdominal cramps, bloody diarrhea, and vomiting; however, a rare but serious condition called hemolytic uremic syndrome can develop, resulting in anemia and acute renal failure (2). Healthy cattle serve as the main reservoir for E. coli O157:H7, and contaminated food, water, and environmental sources, as well as contact with animals, have been the source of outbreaks of E. coli O157:H7 infections (3,4). More recently, contaminated leafy greens have been recognized as a major source of E. coli O157:H7 illnesses and outbreaks. In foodborne illness attribution estimates for 2020 based on outbreak data, 58.1% of E. coli O157:H7 illnesses were attributed to vegetable row crops, a category that includes leafy greens (https://www.cdc.gov/foodsafety/ifsac/annual-reports.html). During 2009–2018, a total of 32 confirmed or suspected outbreaks of E. coli O157:H7 infections linked to contaminated leafy greens occurred in the United States and Canada (5).
A large E. coli outbreak in late 2019, hereafter referred to as outbreak A, caused 167 cases, hospitalized 85 persons from 27 states, and was associated with the consumption of romaine lettuce from Salinas Valley, California, USA (https://www.cdc.gov/ecoli/2019/o157h7-11-19/index.html). We characterized isolates from outbreak A and highly related isolates by using a variety of molecular methods.
A query of the PulseNet database revealed 356 isolates related to the outbreak strain that had <15 core-genome multilocus sequence typing (MLST; cgMLST) allele differences (Table 1; Appendix 1 Table 1) (6). Of those, 302 isolates corresponded to human cases associated with 6 outbreaks spanning 3 years; dates of isolation ranged from September 27, 2016, to January 3, 2020. An additional 54 isolates were either clinical isolates not associated with a recognized outbreak (n = 14) or from environmental (n = 20), food (n = 8), or animal (n = 12) samples. Seven-gene MLST and Manning clade typing revealed all isolates were sequence type (ST) 11 and belonged to Manning clade 2 (Appendix 2). In silico PCR of the Shiga toxin (stx) genes revealed that all but 2 isolates contained stx2a, whereas 2 remaining isolates had no detectable stx genes. We generated a closed-reference genome, 2019C-3201 (Strain: PNUSAE020169; BioSample: SAMN10432148), using PacBio Sequel technology (https://www.pacb.com) and assembled with Flye version 2.6 (7). The sequence data assembled into a single complete chromosomal contig and 3 plasmids (Table 2).
We selected a subset of 245 isolates for further genomic analysis to more evenly sample across outbreaks and to reduce computational demands. Isolates were characterized by core genome MLST implemented in BioNumerics 7.6 (6) and high-quality single-nucleotide polymorphism (SNP; hqSNP) methods using Lyve-SET version 1.1.4f (9), using the chromosomal sequence of 2019C-3201 as a reference and the Lyve-SET presets for E. coli. Overall, hqSNP was more discriminatory, differentiating isolates by a median of 10 pairwise hqSNPs (0–39 SNPs), whereas cgMLST differentiated isolates by a median of 2 allele differences (0–8 alleles) (Table 1). This finding was foreseeable because hqSNP does not depend on a predefined scheme; therefore, intergenic SNPs between loci, multiple SNP differences within a given locus, or SNPs in loci not included in the cgMLST schema can result (9).
Time-tree analysis using BEAST version 2.6.3 (10) revealed the divergence of this strain into 2 clades that last shared a common ancestor around late 2015 (median December 19, 2015; 95% highest posterior density interval December 7, 2014–July 10, 2016) (Figure 1). After outbreak D in 2016, sequences corresponding to a given outbreak belonged to 1 of 2 clades; outbreaks B2 and C were associated with clade 1, and outbreaks A, B1, and B3 were associated with clade 2. Of note, outbreak A was traced to romaine lettuce from Salinas Valley, whereas traceback and sampling in outbreak B2 linked some illnesses to romaine lettuce from Santa Maria, California (https://www.fda.gov/food/outbreaks-foodborne-illness/investigation-summary-factors-potentially-contributing-contamination-romaine-lettuce-implicated-fall; https://www.fda.gov/food/outbreaks-foodborne-illness/outbreak-investigation-e-coli-romaine-salinas-california-november-2019). Lettuce from Salinas was not considered a source of any illnesses in outbreak B2. Environmental sampling in Santa Maria in 2019 yielded isolates clustering closely with outbreak B2 in the time tree.
We analyzed the closed reference sequence of 2019C-3201 using Prokka version 1.8 to enable SNP annotation (11). We examined output from Lyve-SET to determine the SNPs differentiating the 2 clades in our phylogenetic analysis (Appendix 1 Table 2). This work confirms a previous study reporting a nonsense mutation in the arsR gene, an arsenical resistance operon repressor (12). All clade 1 isolates in this study possess a G→A mutation resulting in a premature stop codon. This mutation could decrease the activity of this repressor and lead to constitutive expression of this operon. Agricultural soils and water sources can contain increased arsenic levels because of natural processes, industrial sources, or agricultural uses of arsenic, such as application of arsenic-containing herbicides, pesticides, or animal drugs (13). This mutation could provide an ecologic advantage in environments containing high levels of arsenic. This finding underscores the potential need to routinely screen enteric bacterial strains for heavy metal resistance determinants, as well as to consider heavy metal levels in soil as part of traceback investigations.
We further characterized isolates through assembly and annotation using Shovill-SPAdes version 1.0.9 and Prokka version 1.14.5 (11) and subsequent analysis in Roary version 3.11.2 (14) and scoary version 1.6.16 (15) to identify differences in the pangenome among isolates. We compared differentially distributed genes with the reference genome using BLASTn (https://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify feature location (chromosome/plasmid). Roary/scoary analysis revealed a subset of clade 1 isolates with additional genomic content. A total of 156 genomic features had >90 sensitivity and >90 specificity to this subset of clade 1. Of those, 87 (56%) are on plasmid p2019C-3201_1, and 69 (44%) are on p2019C-3201_2 (Figure 2; Appendix 1 Tables 3, 4). Prokka-annotated features associated with p2019C-3201_1 (Figure 2; Appendix 1 Table 3) were predominantly genes encoding hypothetical proteins with unknown functions and common plasmid-associated genes. Annotated features associated with p2019C-3201_2 (Figure 2; Appendix 1 Table 4) were predominantly associated with conjugation and span a large portion of that plasmid. Additional work is necessary to characterize the role of these plasmids in clade 1. When visualizing the distribution of these clade 1–specific features alongside the maximum-clade credibility tree (Figure 1; Appendix 1 Table 5), it appears those features were acquired after clade 1 and clade 2 diverged. Given the geographic distribution of isolates, these features might be a result of adaptation to a particular niche or environment.
In summary, a specific strain of E. coli O157:H7 associated with leafy greens has been the source of ongoing enteric illness since late 2016. This strain is estimated to have emerged in late 2015 and consists of 2 clades with different geographic distributions, 1 of which has notable genomic features. After this analysis, an additional outbreak associated with this strain was detected in late 2020 in which a reported 40 infections occurred in 19 states; 20 persons were hospitalized, and 4 developed hemolytic uremic syndrome (https://www.cdc.gov/ecoli/2020/o157h7-10-20b/index.html). After that outbreak, no further outbreaks have been detected, and only a single clinical isolate associated with this strain has been identified by PulseNet. The Centers for Disease Control and Prevention has classified this strain as a reoccurring, emerging, or persistent (REP) strain (https://www.cdc.gov/ncezid/dfwed/outbreak-response/rep-strains.html) with the designation REPEXH02. REP strains represent a new paradigm in enteric molecular surveillance, distinct from discrete outbreaks where numerous cases occur in a relatively short time frame. Detailed genomic characterization of additional REP strains, using the types of approaches outlined in this study, is necessary to elucidate factors contributing to their emergence and persistence in specific environments.
Dr. Chen is a bioinformatician in the Division of Foodborne, Waterborne, and Environmental Diseases, National Center for Emerging and Zoonotic Infectious Diseases, CDC. Her research interests are focused on understanding the evolution of bacterial foodborne pathogens of public health concern. Mr. Patel was an epidemiologist in Division of Foodborne, Waterborne, and Environmental Diseases, CDC, leading the response to multistate foodborne and zoonotic E. coli and Salmonella outbreaks within the Outbreak Response and Prevention Branch, as well as focusing on the response to REP strains of various pathogens. He now leads a team focused on building disease surveillance and outbreak response technology platforms in the private sector.
We thank state and local health departments for sequencing of E. coli O157:H7 associated with these outbreaks. The authors also thank Matthew Wise for helpful discussions and feedback.
This work was made possible by support from the Advanced Molecular Detection initiative at the Centers for Disease Control and Prevention and is covered by activities approved by the Centers for Disease Control and Prevention Internal Review Board (approval no. 7172).
Suggested citation for this article: Chen JC, Patel K, Smith PA, Vidyaprakash E, Snyder C, Tagg KA, et al. Reoccurring Escherichia coli O157:H7 strain linked to leafy greens–associated outbreaks, 2016–2019. Emerg Infect Dis. 2023 Sep [date cited]. https://doi.org/10.3201/eid2909.230069
Outbreak | No. sequences | Timeframe | No. sequences, subset | Median subset allele differences (min–max) | Median hqSNP subset differences (min–max) | Outbreak source | Growing region |
---|---|---|---|---|---|---|---|
D | 20 | Sep 27–Dec 5, 2016 | 20 | 2 (0–5) | 3 (0–17) | Unknown | NA |
C | 23 | Nov 10–-Dec 14, 2017 | 23 | 0 (0–3) | 1 (0–5) | Leafy greens | Likely SW USA, Mexico |
B3 | 7 | Jul 31–Aug 15, 2018 | 7 | 0 (0–3) | 1 (0–10) | Unknown | NA |
B2 | 71 | Oct 8–Dec 7, 2018 | 69 | 1 (0–4) | 2 (0–10) | Romaine lettuce | Santa Maria, CA |
B1 | 19 | Nov 1–Dec 18, 2018 | 18 | 2 (0–5) | 4 (0–10) | Leafy greens | NA |
A |
179 |
Sep 27, 2019–Jan 3, 2020 |
84 |
1 (0–5) |
2 (0–12) |
Romaine lettuce |
Salinas Valley, CA |
Nonhuman samples collected in Santa Maria | 23 | Nov 14, 2019 | 12 | 0 (0–1) | 1 (0–5) | NA | Santa Maria, CA |
Not associated with known outbreak |
14 |
Oct 18, 2016–Aug 4, 2019 |
12 |
1 (0–3) |
8 (0–18) |
NA |
NA |
All | 356 | Sep 27, 2016–Jan 3, 2020 | 245 | 2 (0–8) | 10 (0–39) | NA | NA |
*hqSNP, high-quality single nucleotide polymorphism; NA, not applicable.
Contig name | GenBank accession no. | Length, bp | Sequence coverage | Replicon | PTU (8) |
---|---|---|---|---|---|
2019C-3201 chromosome | CP090856 | 5,488,442 | 866 | None | NA |
p2019C-3201_1 | CP090857 | 87,920 | 732 | IncI1-I(gamma) | PTU-I1 |
p2019C-3201_2 | CP090858 | 61,933 | 560 | IncFII(pHN7A8), IncFII(pSFO) | PTU-FE |
p2019C-3201_3 | CP090859 | 92,724 | 623 | IncFIB, IncFII | PTU-E5 |
*PTU, plasmid taxonomic unit.