Genomic Analysis of Salmonella enterica Serovar Typhimurium DT160 Associated with a 14-Year Outbreak, New Zealand, 1998–2012

During 1998–2012, an extended outbreak of Salmonella enterica serovar Typhimurium definitive type 160 (DT160) affected >3,000 humans and killed wild birds in New Zealand. However, the relationship between DT160 within these 2 host groups and the origin of the outbreak are unknown. Whole-genome sequencing was used to compare 109 Salmonella Typhimurium DT160 isolates from sources throughout New Zealand. We provide evidence that DT160 was introduced into New Zealand around 1997 and rapidly propagated throughout the country, becoming more genetically diverse over time. The genetic heterogeneity was evenly distributed across multiple predicted functional protein groups, and we found no evidence of host group differentiation between isolates collected from human, poultry, bovid, and wild bird sources, indicating ongoing transmission between these host groups. Our findings demonstrate how a comparative genomic approach can be used to gain insight into outbreaks, disease transmission, and the evolution of a multihost pathogen after a probable point-source introduction.


SNP Comparison
SNPs (single nucleotide polymorphisms) are single base pairs that differ between isolates. Two software programs were used to identify SNPs shared by the 109 DT160 isolates: Snippy (https://github.com/tseeman/snippy) and kSNP3 (1). Snippy was used to align reads from each isolate to a reference genome, in this case S. enterica serovar Typhimurium strain 14028s (NC_016856), and then to compare the alignment results and identify single base pairs that were found in all isolates but differed in sequence (core SNPs). kSNP was used to identify kmers of a fixed length that differed in one nucleotide between de novo-assembled genomes and NC_016856. kSNP identified 731 SNPs shared by the 109 DT160 isolates, while Snippy identified 771 SNPs (Technical Appendix Figure 2). 709 SNPs were identified by both methods, leaving 22 kSNP-unique and 62 Snippy-unique SNPs. The kSNP-unique SNPs mostly consisted of SNPs found on reads that did not align to the reference genome, while the Snippy-unique Page 2 of 22 SNPs mostly consisted of SNPs that were in close vicinity, unable to be picked up by kSNP as kmers of a fixed length would differ in more than one nucleotide. By using both methods a larger number of SNPs were identified than if a single method alone was used.
773 out of the 793 core SNPs shared by the 109 DT160 isolates were also located on the reference genome, NC_016856. The order of these SNPs on the reference genome identified several small clades associated with close clusters of SNPs (Technical Appendix Figure 3).
However, most of the SNPs in these clusters were synonymous and unlikely to result from selection pressures. The order of these SNPs also identified the non-synonymous SNPs responsible for the formation of two distinct DT160 clades and the proteins they were located within: glycogen debranching enzyme (A), 2-dehydro-3-deoxyphosphooctonate aldolase (B), a YggT family protein (C), galactose-1-epimerase (D), uvrABC system protein B (E) and acrylyl-coA reductase (F). Many of these proteins are involved in carbohydrate metabolism, suggesting that the two DT160 clades may have distinct carbohydrate metabolism phenotypes.

Global DT160 Strains
Petrovska et al. (2) previously published the genomes of two DT160 isolates: ERS015626 that was isolated from a horse in 1998 and ERS015627 that was isolated from a bird in 1997. The raw reads from these isolates were downloaded from the European Nucleotide Archive (http://www.ebi.ac.uk/ena) and assembled de novo. kSNP and Snippy identified 1,521 core SNPs in total shared by these two isolates and the 109 New Zealand DT160 isolates analyzed. The average pairwise SNP distance between the two UK DT160 isolates and the New Zealand isolates was 0.0287, compared to an average pairwise distance between NZ isolates of 0.0151 The two DT160 isolates from the United Kingdom were genetically distinct from each other and from the 109 New Zealand DT160 isolates (Technical Appendix Figure 4). To our knowledge these were the only DT160 isolates published to date.

Protein Coding Gene Analysis
The 109 DT160 isolates shared 684 protein differences. Primer-E v6 (3) was used to predict the Euclidian distance matrix based on the presence of these protein differences.
Two isolates were excluded from protein analyses as they lacked a large number of genes and were skewing the multi-dimensional scaling, functional plots and PermDisp calculations (Technical Appendix Figure 6). These outliers shared similar epidemiologic information: collected from human sources from 2004-2006. However, they were missing different sets of genes.
Multidimensional scaling helps visualize the amount of similarity or dissimilarity between data points. In multi-dimensional scaling, the centroid is the central point for a group of data points. PERMANOVA found that the centroids were indistinguishable between isolates collected from different sources or time periods (Table), as these isolates appeared to radiate out from a point source (Technical Appendix Figure 7).
The distance from the centroid to each isolate (z-value) is a measure of dispersion and equivalent to the accumulation of protein differences. The z-values were calculated using PermDisp (4) and were modeled using a regression model. The residuals for this model lacked normality (Technical Appendix Figure 8). To normalize the residuals, the z-values could have been transformed. However, with such a low p-value for the date of collection, this would not have changed the conclusions and would have made interpretation more difficult.
The 684 protein differences shared by the DT160 isolates were associated with a large number of functions. For each COG functional group, the proportion of proteins that differed in sequence was calculated (Technical Appendix Figure 9). Fisher exact test provided evidence that these proportions differed (p = 0.0002). However, there was little variation in the proportions However, the total number of protein differences within each functional group was smaller than the total number of samples (Technical Appendix Figure 11). Therefore, a regression model could not be used to model the effect of source and date of collection on the number of differences in each functional group, as a large number of isolates would have the same z-value.

Discrete Phylogeographic Model
The discrete phylogeographic model was designed to use phenotypic or molecular data to predict the ancestral migration of organisms from distinct geographies (5). However, the model has been applied to outbreaks to predict transmission between distinct host groups that share the same geography (6). Twenty-two datasets were formed from the 109 DT160 isolates and the 793 core SNPs they share, to determine if the discrete phylogeographic model was appropriate for investigating this outbreak. The real dataset consisted of the 109 isolates split into those from animal sources (n = 74) and those from human sources (n = 35) (real dataset). Ten datasets were formed by randomly assigning the 109 isolates as animal or human, while keeping the total number of animal and human isolates the same (datasets A-J). Eleven datasets were formed by randomly assigning one of the isolates as human, while assigning the rest as animal, before progressively assigning random isolates as human, until a range of data was formed with different numbers of human and animal isolates. Each dataset was exported into BEAUti to create an .xml file for BEAST 1.8.3 (7). For simplicity's sake, each dataset was given a separate Hasegawa Kishino Yano (HYK) substitution model (8) and strict molecular clock. The GMRF Bayesian skyride model (9) was used to allow for variation in the effective population size of each model and the discrete phylogeographic model (5) was used to predict the time spent in the animal and human host groups (Markov rewards) over the course of the outbreak, and the number of transmission between these host groups (Markov jumps). Each .xml file was run in BEAST for 10 million steps.

Page 5 of 22
The discrete phylogeographic model predicted that DT160 spend most of the time in the animal host group, and that there was a larger amount of transmission from the animal to the human host group than the reciprocal. However, the same result was obtained when the isolates were randomly assigned as human or animal, but the sample proportions were kept the same (Technical Appendix Figures 12 and 13). In addition, the proportion of samples assigned as human had a significant effect on the Markov rewards and jumps (Technical Appendix Figures   14 and 15). This indicates that the results obtained from the discrete phylogeographic model are the result of an uneven sample size and not true migration events.
The proportion of samples that are human and Markov rewards share a step-like or sigmoid association (Technical Appendix Figure 14). This is due to the deep DT160 branches that are predominantly one source until the proportion of samples that are human meets a threshold (30%-40% of samples are human), where they suddenly all switch (Technical Appendix Figure 16). However, the relationship between the proportion of samples that are human and Markov jumps is more complex (Technical Appendix Figure 15).  *Df, degrees of freedom; SS, sum of squares; MSS, mean sum of squares; Pseudo-F, F-value from the data; P(perm), proportion of permuted datasets whose F-value exceeds Pseudo-F; Unique perms, number of unique permutations. †Coefficient interaction.