Monitoring SARS-CoV-2 circulation and diversity through community wastewater sequencing

The current SARS-CoV-2 pandemic has rapidly become a major global health problem for which public health surveillance is crucial to monitor virus spread. Given the presence of viral RNA in feces in around 40% of infected persons, wastewater-based epidemiology has been proposed as an addition to disease-based surveillance to assess the spread of the virus at the community level. Here we have explored the possibility of using next-generation sequencing (NGS) of sewage samples to evaluate the diversity of SARS-CoV-2 at the community level from routine wastewater testing, and compared these results with the virus diversity in patients from the Netherlands and Belgium. Phylogenetic analysis revealed the presence of viruses belonging to the most prevalent clades (19A, 20A and 20B) in both countries. Clades 19B and 20C were not identified, while they were present in clinical samples during the same period. Low frequency variant (LFV) analysis showed that some known LFVs can be associated with particular clusters within a clade, different to those of their consensus sequences, suggesting the presence of at least 2 clades within a single sewage sample. Additionally, combining genome consensus and LFV analyses we found a total of 57 unique mutations in the SARS-CoV-2 genome which have not been described before. In conclusion, this work illustrates how NGS analysis of wastewater can be used to approximate the diversity of SARS-CoV-2 viruses circulating in a community.

INTRODUCTION particular city or county, where the titers in sewage seem to correlate with the number of reported cases in the population, suggesting a potential role for sewage surveillance as an early warning tool 18,[20][21][22] . Therefore, sewage testing is currently considered globally as an adjunct to patient-based surveillance, and has promise as an early warning indicator of increasing virus circulation. Enhanced surveillance is a key pillar of the current containment strategy aiming to control the spread of SARS-CoV-2 and includes frequent testing of people with mild symptoms, investigation of clusters of infection to identify possible common exposures, and monitoring of hospital and ICU admissions. Whole genome sequencing of SARS-CoV-2 directly from clinical samples has been developed as an additional tool, to provide information on diversity of circulating strains as a basis for cluster identification.
Particularly in areas with minimal circulation, sequencing of viruses from patients can help to identify a possible source, provided that sufficient background sequencing has been done.
So far, little work is done trying to correlate the SARS-CoV-2 diversity in sewage and patients 23,24 . Here we aimed to evaluate the potential of next generation sequencing (NGS) of SARS-CoV-2, from RT-PCR positive wastewater samples, to assess if they reflect the diversity of SARS-CoV-2 circulating within the population of the Netherlands and Belgium.

Correlation between RT-qPCR and the percentage of genome recovered by NGS
Previously, sewage samples were collected from different locations in The Netherlands and Belgium to investigate the levels of SARS-CoV-2 in sewage using RT-qPCR 18 . To further investigate the genetic diversity of SARS-CoV-2 a total of 55 wastewater samples obtained from 13 different locations in the Netherlands (48 samples) and 7 different locations in Belgium (7 samples) with Ct values of <36 were selected for whole genome sequencing using Nanopore sequencing. The samples covered a time span of 70 days (from March 25 th to June 3 rd 2020). Two samples (Franeker-92719 and AmsterdamWest-92852) were sequenced by Nanopore twice, while 24 samples were also sequenced by Illumina (Table 1).
Four primers/probe sets targeting the N (N1-N3) 25 and the E genes 26 were used to evaluate the presence and concentration of SARS-CoV-2 in sewage samples as described previously 18 . All samples and their Ct values are shown in Table 1. The percentage of the genome covered by the assembly of Nanopore reads (>10X coverage per site) ranged from 0 to 99.2%. We observed an inverse sigmoidal correlation between the percentage of the genome assembled from Nanopore sequencing reads and the Ct values of both the N2 and the E primers/probe sets (Fig. 1). The inflection point (Ct value at which half of the genome can be obtained) for N1, N2, N3 and E primers/probe sets were Ct values of 34.6, 33.8, 33.2 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint and 32.5, respectively. No correlation was observed between Ct values and the percentage of the genome assembled from Illumina sequencing data ( Supplementary Fig. S1). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint In order to associate specific mutations to particular a clade or cluster, all consensus sequences, including partial sequences, were compared to the Wuhan-Hu-1 reference sequence. A total of 145 single nucleotide polymorphisms (SNPs) compared to the Wuhan-Hu-1 reference sequence were detected in our dataset (supplementary Table S1). From these, 24 SNPs were present in more than one sequence. The maximum number of mutations in individual samples compared to the Wuhan-Hu-1 reference genome were 11 for hCoV-19/env/Netherlands/Amersfoort-92503-N/2020, hCoV-19/env/Netherlands/Delft-92965-N/2020 and hCoV-19/env/Netherlands/Schiphol-94335-N/2020 (supplementary Table   Fig  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint S1). The presence of clade-defining mutations in the consensus sequence suggests the dominance of a certain clade within a sample, but assessing their presence can also be used to check for virus mixtures in a sample. Nextstrain has defined each clade by the presence of at least two linked mutations (https://nextstrain.org/). 19A is the root clade and contains the Wuhan-Hu-1 reference sequence. Both 19B and 20A emerged from 19A, where two and three linked mutations define these major clades, respectively: T28144C and C8782T for 19B; and C3037T, C14408T and A23403G for 20A. Clades 20B and 20C emerged from 20A, where the trinucleotide substitution GGG > AAC at positions 28881-28883 defines 20B, and the linked mutations C1059T and G25563T define 20C. Nucleotide substitution A23403G, a signature mutation of clades 20A, 20B and 20C, and that generates the D614G amino acid substitution in the S glycoprotein, was present in 83.6% (51/61) of the samples that were sequenced at this region (supplementary Table S1). The  Table S1). One of the two mutations defining clades 20C and 19B (C1059T and T28144C) were found in 2 and 3 consensus sequences, respectively. The regions containing the other clade-defining mutations (C25563T for 20C; and C8782T for 19B) were not sequenced at high enough coverage to determine a consensus sequence in these samples. The hCoV-19/env/Netherlands/Amersfoort-92503-N/2020 sequence contained a mix of the clade-defining mutations C1059T, T28144C and GGG28881-28883AAC, that define clades 20C, 19B and 20B, respectively. This indicates that the obtained consensus sequence is a combination of several different viruses and does not represent a single strain.
In addition to the clade-defining mutations, we detected 49 and 63 SNPs in sewage samples that were not present in either the Dutch (1544 sequences) or Belgian (888 sequences) GISAID datasets, respectively, but were present in the Global dataset (55074 sequences, as 8 th of July 2020), although with < 1% prevalence (supplementary Table S2). Moreover, we detected a total of 51 novel mutations present in sewage consensus sequences that were not previously reported (supplementary Table S2), of which 48 were supported by coverage above the set thresholds to be considered as high quality (coverage >30x for Nanopore; and coverage >5X and Phred score >30 for Illumina). Additionally, it is noteworthy to mention that some samples presented discrepancies between the consensus sequences obtained by Nanopore and/or Illumina sequencing. For example, sample AmsterdamWest-92852 was sequenced 3 times (twice with Nanopore and once with Illumina), in which 4 positions with discrepancies were found between the consensus sequences (Supplementary Table S1). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint These differences were not due to an assembly error, since the alignment of the reads were manually checked and corrected for each discrepancy in every sequencing run. These differences were explained by the presence of two different nucleotides in the reads covering a particular position with varying percentages between sequencing runs, resulting in consensus sequences that could differ between each sequencing run. These results were likely caused by both or either the presence of multiple strains and low viral RNA titers in the samples, leading to an amplification bias during library preparation.

SARS-CoV-2
Given that sewage samples are likely to contain a mixture of SARS-CoV-2 strains, we decided to perform a variant analysis with Illumina data to determine whether LFVs were confidently present in a sample. Using a coverage > 50X, Phred score > 30 and a frequency threshold of > 10% as settings, we found a total of 21 positions with at least one sample containing major and minor variants (Table 2) Table 2).The other variants were present at similar frequencies in the Dutch, Belgian and Global datasets ( Table 2).

Presence of specific LFVs may be associated to the presence of viruses belonging to a particular cluster
In addition to consensus sequence, LFV analysis is of importance to be able to identify potential local outbreaks from sewage. To try to associate the presence of a minor variant to . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint    is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint sequences belonging to unique clusters, we mapped the 4 most highly prevalent LFVs onto both Dutch-Belgian and global subsample phylogenetic trees (Fig 3). This analysis indicated that for 3 variants (1440A, 11109T and 24862G) there were clear associations between the presence of the mutation and their clustering on the phylogenies (Fig 3). However, when one of these 3 variants was present as an LFV in a sewage sample the consensus sequence (blue arrows in Fig. 3) of this sample did not group with the cluster of clinical samples that contains the variant (magenta lines in Fig. 3). For example, 24862G variant in sample Tilburg-94339 was present in two unique clusters within clade 20A, while its consensus sequence (hCoV-19/env/Netherlands/Tilburg-94339-I/2020) was clustered within clade 20B (Fig 3 and Supplementary Figs. S2 and S3), suggesting the presence of both clades in this sample. Although mutation 11083T was most prevalent in clade 19A, it was also present scattered along the trees, indicating poor association with a particular clade.

DISCUSSION
Given the high chemical and biological complexity of wastewater samples, virus concentration and RNA extraction methods are crucial parts of the process to reach enough  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint viral RNA yield for sequencing 29 . In this study, we showed that our method was capable of obtaining complete or near complete genomes from wastewater samples with Ct values of at least 5 or 6 Cts below the limit of detection (LoD) (Ct < 39) and partial genomes for samples with higher Ct values. Therefore, only samples with enough viral RNA can be used to effectively analyze SARS-CoV-2 diversity in sewage samples. In order to increase the percentage of genome covered of the samples, we used a threshold of 10X coverage per position to generate the consensus sequences from Nanopore reads. Based on previous analysis of viral sequencing data, the error rate with this threshold is less than 0.03% 30 Table S2 have a coverage of at least 30X, which produces an error rate of 1 in 585,000 nucleotides sequenced 30 .
The use of sewage as a tool to understand the epidemiology and diversity of SARS-CoV-2 at a community level offers many advantages over human sampling. Sewage samples are relatively easy to collect, because no invasive sampling is required, there is no sampling bias towards sequences from moderate and severe cases, there are limited ethical issues, and potentially few samples are required to give a picture of the temporal changes of viral infections in the community 28,29 . Nevertheless, comprehensive comparisons with clinical surveillance and other epidemiological approaches are required to determine the extent and limits of using sewage as a surveillance/early warning tool. Furthermore, before the broad use of sewage samples to characterize viral diversity within a population, some obstacles need to be overcome, such as: low viral titers that complicate the retrieval of complete genomes and the distinction of multiple strains within a sample. Here we have used two of the most common NGS technologies (Nanopore and Illumina) to study the diversity of SARS-CoV-2 found in sewage samples, from the Netherlands and Belgium, and compared these results with the virus diversity found in sequenced clinical samples. In order to evaluate this diversity in a comprehensive fashion, we have used Nextstrain clade classification system because it is based on signature mutations to assign a sequence to a clade (https://nextstrain.org/) 5 , facilitating the association of SNPs or LFV to a particular clade, especially for genome sequences with <75% coverage.
Sewage samples can contain a mixture of SARS-CoV-2 viruses reflecting the multiple viruses circulating within a community. They may also partially reflect the presence of SARS-CoV-2 from animal origin, as SARS-CoV-2 has now been detected in domestic and livestock animals such rabbits, minks, cats and ferrets [31][32][33][34][35] . The analysis of a consensus sequence genome from a wastewater sample may identify the predominant virus strain present in a . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint population, which might be suitable for locations where only 1 or few introductions of closely related viruses have occurred, as it seems to be the case for 2 previous studies in Italy and USA 23,24 . Nonetheless, the consensus genome approach cannot reflect the diversity of the viruses circulating in a population with a high degree of viral diversity. Moreover, in some cases samples containing several diverging strains at significant levels might lead to retrieve artificial consensus genomes that do not represent an existing virus, which seems to be the case for the hCoV-19/env/Netherlands/Amersfoort-92503-N/2020 sequence, where signature mutations of 3 different clades are present at the consensus level. In this study, we could not detect a genome belonging to the least prevalent clades (20C and 19B), despite the circulation of these viruses in the human population in both countries during the same period of time (Figs 2a and b). Although, it is necessary to mention that mutations associated with clades 19B and 20C were found in 2 and 3 samples, respectively (Supplementary Table   S1). However, these consensus sequences were either too short or had a mixture of signature mutations that did not allow to confirm whether they belong to clades 20C and 19B by the Nextclade tool. Another reason to explain why we did not find consensus sequences belonging to these clades is the limited number of locations represented on the phylogeny by our sewage sample dataset compared with that of the clinical samples, especially for Belgium (only 2 sequences from sewage).
In depth NGS analysis could help to unravel the diversity of viruses within a complex sample such as wastewater, particularly unbiased sequencing of the sewage virome can give a good picture of the general viral diversity contained in a sample 36 . Nevertheless, the detection of variants of a particular virus in a single sample can be difficult due to the relative low number of reads obtained for each virus type. Targeted amplification of a genome region of the virus taxa of interest is potentially more sensitive and cheaper to perform. Recently, examples of this have been described for enteroviruses, human mastadenoviruses and norovirus 16,37,38 .
In general, for each virus a specific small fragment (< 400 bp) of the genome is amplified and deep-sequenced, then sequencing reads can be aligned and assigned to a particular genotype or serotype, identifying and determining the prevalence of several virus variants within a single wastewater sample 16,37,38 . As the diversity of SARS-CoV-2 is still limited 39 , this approach would not be as useful for this virus because no single small piece of the genome can reliably differentiate between clades or lineages. However, we tried to overcome this issue by using a variant analysis of sewage samples. We showed that some LFVs can be linked to particular clusters or clades within the trees (Fig 3), without the need of a complete genome. Although, in order to confidently determine the presence of a particular clade/cluster within a sample, at least 2 or 3 LFVs associated with such clade/cluster should is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint be present at significant levels. Furthermore, variant analysis can also be used to monitor the prevalence of biological interesting mutations in a population. One of the most interesting is the D614G (or A23403G) mutation in the S glycoprotein, that has been shown to increase infectivity in vitro by stabilizing the S1/S2 interaction 40 , and has been associated with higher transmission and mortality rates, although the latter is under debate 41,42 . Unfortunately, the region containing the D614G mutation was not sequenced at high enough coverage to perform a variant analysis in most of the tested samples, and we were not able to find any sample with a mixture of both variants (614D and 614G).
The combination of whole-genome sequencing of clinical samples with epidemiological data has shown to be important for public health decision-making 43  Wastewater can also be used to monitor novel mutations. Our consensus and LFV analyses revealed a total of 57 mutations that were not present in the global database. From these, 51 were found at the consensus level, 8 as LFVs and 2 were common to both analyses. It is possible that these novel mutations were not previously detected due to several reasons: 1) Genetic drift eliminated these variants before they could be further spread and detected; 2) These viruses cause only asymptomatic or mild disease that made them to be less likely to be detected through clinical sampling; 3) They are associated with reduced transmission/replication and did not became fixed in the population; 4) They originate from an unknown animal hosts; 5) They are associated with enhanced enteric shedding/replication; 6) They are associated with intra-host defective genomes. The latter has previously been suggested for the detection of LFVs that generate stop codons in ORF1ab and S genes 44  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint polymerase mistake during the initial PCR cycles of the library preparation can be amplified and identified as a variant. Phenotypical studies could help to determine the likelihood and biological relevance of some of these novel mutations.
In conclusion, this work illustrates how NGS analysis of wastewater can be used as a tool to approximate the diversity of SARS-CoV-2 viruses circulating in a community. Sequencing of wastewater samples could be a powerful tool to complement clinical surveillance or be used as a standalone procedure in settings where wide clinical sequencing is not feasible.
Additionally, in-depth NGS analysis of wastewater samples can help in assessing changes in virus diversity to determine emergence of epidemiologically or clinically relevant mutations, aiding public health decision making. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint

Sample preparation
A total of 55 wastewater (WW) specimens were included in this study. RNA from 7 of these WW samples (all from March 25 th ) were collected, processed and extracted previously by KWR 18 . The other 48 WW specimens were collected as 24 h flow-dependent composite samples and processed as previously described 18

Next generation sequencing (NGS) of SARS-CoV-2 genomes by Nanopore and
Illumina SARS-CoV-2 specific multiplex PCR for Nanopore sequencing was performed as described by Oude Munnink, et al. (2020) 43 . In short, primers for 89 overlapping amplicons spanning the entire genome were used in 2 PCR pools. The amplicon length was set to 500bp with 75bp overlap between the different amplicons. The used concentrations and primer sequences have been described previously 43 . Libraries were generated using the Oxford Nanopore's native barcode kits (Catalog numbers: EXP-NBD104, EXP-NBD114 and SQK-LSK109) and sequenced on a R9.4 flow cell multiplexing up to 24 samples per sequence run.
Illumina sequencing was performed as described by Richard, et al. (2020) 45 . Amplicons were generated by the SARS-CoV-2 specific multiplex PCR described above for the whole . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Nanopore sequence data analysis
The resulting raw sequence data were processed as previously described by Oude Munnink et al. 2020 43 . Briefly, an automated snakemake script 46 was used to demultiplex fastq raw reads using Porechop (https://github.com/rrwick/Porechop), trim primers using Cutadapt 47 and perform a reference-based alignment using minimap2 to GISAID sequence EPI_ISL_412973. The run was monitored using RAMPART (https://articnetwork.github.io/rampart/). The consensus genome was extracted and positions with a coverage < 10X or <30X were replaced with an "N". Mutations in the genome were confirmed by manually checking the alignment in Ugene 48 and homopolymeric regions were manually resolved consulting reference genomes. Based on previous validation studies 30 , mutations with a cut-off of 30X coverage were considered as high quality, whereas mutations with less than 30X coverage were marked as low quality (Supplementary Table   S2).

Illumina Data Filtering, Genome assembly and Variant calling
All the processing, reference-based alignment and variant analysis of the Illumina generated data was performed using a customized workflow on the Galaxy EU server (https://usegalaxy.eu/) 49 . First, raw sequencing reads were filtered using Fastp 50 to remove adaptor contamination, ambiguous bases (N), low quality reads (Phred score <30) and fragments below the length of 50 nt. For mapping purposes, reads were aligned against the GISAID sequence EPI_ISL_412973 using the default penalty settings of the BWA-MEM 51 .
Reads were re-aligned using the leftalign utility from FreeBayes package 52 . All reads with mapping scores of less than 30 were discarded. Both consensus sequences and variants were generated using iVar 53 . Final consensus sequences (frequency >50%) were . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint constructed using all mapped sequence reads that covered each site at least 5 times and had a minimum quality Phred score of 30. For detection of low-frequency variants (LFV), parameters were set as follows: a minimum coverage of 50X, Phred score >30 and a Minimum frequency threshold of 10%. Manual inspection of the aligned reads was also performed to confirm or dismiss the variant calling in Ugene 48 . Variant positions are given with respect to the Wuhan-Hu-1 strain (Genbank accession number: MN908947) 1 .

Phylogenetic analysis
Two reference datasets were used to perform the phylogenetic analysis. The first dataset included all Dutch and Belgian full-length SARS-CoV-2 genomes (1544 and 888 sequences, respectively) from GISAID database (https://www.gisaid.org/) publicly available up to the 8 th of July 2020. The second dataset is a subsample of all SARS-CoV-2 sequences available in GISAID (https://www.gisaid.org) covering the global diversity of SARS-CoV-2 genomes up to the 1 st of June 2020. This global 'backbone' dataset contains 2552 subsampled high-quality sequences (full length, with Ns <5%) to include one unique genome per country/state per week. For the maximum-likelihood (ML) trees, only sequences in this study with >75% genome coverage were included in the analysis. Our sequences were aligned with both datasets using MAFFT (https://mafft.cbrc.jp/alignment/server/). The alignment was manually checked for discrepancies and the ends were trimmed, after which IQ-TREE 54 was used to perform a ML phylogenetic analysis under the GTR + F + R3 model for the Global subsample and the GTR + F + R2 model for the Dutch-Belgian dataset as the best predicted models using the ultrafast bootstrap option with 1,000 replicates. The phylogenetic trees were visualized using Figtree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). Clades were assigned by using the Nextclade tool within Nextstrain (https://clades.nextstrain.org/).  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2020. . https://doi.org/10.1101/2020.09.21.20198838 doi: medRxiv preprint