Sentinel Surveillance System Implementation and Evaluation for SARS-CoV-2 Genomic Data, Washington, USA, 2020–2021

Genomic data provides useful information for public health practice, particularly when combined with epidemiologic data. However, sampling bias is a concern because inferences from nonrandom data can be misleading. In March 2021, the Washington State Department of Health, USA, partnered with submitting and sequencing laboratories to establish sentinel surveillance for SARS-CoV-2 genomic data. We analyzed available genomic and epidemiologic data during presentinel and sentinel periods to assess representativeness and timeliness of availability. Genomic data during the presentinel period was largely unrepresentative of all COVID-19 cases. Data available during the sentinel period improved representativeness for age, death from COVID-19, outbreak association, long-term care facility–affiliated status, and geographic coverage; timeliness of data availability and captured viral diversity also improved. Hospitalized cases were underrepresented, indicating a need to increase inpatient sampling. Our analysis emphasizes the need to understand and quantify sampling bias in phylogenetic studies and continue evaluation and improvement of public health surveillance systems.

Ongoing global circulation of SARS-CoV-2 and repeated emergence of new variants indicate the need for robust genomic surveillance to inform public health responses (4).In Washington, USA, surveillance of SARS-CoV-2 is passive and, therefore, focused on cases of COVID-19 in persons seeking testing.In addition, methods for conducting nextgeneration sequencing introduce limitations on sampling; specimens must contain adequate quantities of viral RNA for sequencing efforts to be successful.Therefore, persons who had mild illness, delayed testing, reinfection, or other characteristics that might lower viral loads are less likely to be represented in sequencing data.Knowing those limitations, the Washington State Department of Health sought to establish a genomic sentinel surveillance system for SARS-CoV-2 in March 2021.
Before sentinel surveillance was initiated, large amounts of genomic data were produced by academic and clinical laboratories in Washington and shared publicly via the GISAID EpiCoV database (5)(6)(7).Studies using those data to rapidly produce critical viral transmission and evolution information were published early during the pandemic; however, the populations captured in those data remain unknown (8)(9)(10)(11)(12).Sampling bias or systematic differences in sample characteristics between COVID-19 cases with sequenced specimens and total COVID-19 cases is a concern.Using large datasets from a limited number of geographically sparse institutions might produce inaccurate phylogenetic representations of virus distribution and migration within the population (13,14).Specifically, discrete trait analysis is a type of phylogeographic analysis that treats lineage migration between locations as if the location was a discrete trait; models relying on this analysis type assume that sample sizes across subpopulations are proportional to their relative size and random sampling occurs (15).If 1 population is oversampled, large biases are expected in model output (15).This concern extends beyond state or country borders because representative sampling is often assumed for contextual data, which provides the backdrop upon which phylogenetic inference is based.
We describe implementing a sentinel surveillance system that enables pairing of genomic and epidemiologic data.In addition, we assessed representativeness and timeliness of genomic data availability before and after system implementation.By performing this evaluation, we provide information regarding populations of sampled cases and limitations on inference affecting genomic data use.To support planning efforts to obtain more equitable and representative sampling, we identified subpopulations that might be systematically excluded from sequencing surveillance.More broadly, we raise awareness regarding sampling bias in conveniencebased genomic surveillance systems and support development of robust genomic surveillance systems in additional jurisdictions.

Sentinel Surveillance System Design
In March 2021, the Washington State Department of Health partnered with multiple laboratories to establish a sentinel surveillance program to monitor genomic epidemiology of SARS-CoV-2 within the state.Partner laboratories were selected to maximize geographic coverage and specimen numbers.The initial proportion of randomly selected positive specimens submitted for sequencing was designed to balance geographic coverage regionally and match available sequencing capacity; statewide case coverage varied from 8% to 25% during the study period (16).In addition to the Washington State Public Health Laboratories, the 6 sentinel laboratories are Atlas Genomics, Confluence Health/Central Washington Hospital, Interpath Laboratories, Incyte Diagnostics Spokane, Northwest Laboratories, and University of Washington Virology Division.PCR cycle threshold (Ct) is capped at 30 for this surveillance system.The surveillance program is supplemented by a national surveillance effort supported by the Centers for Disease Control and Prevention (CDC), which includes multiple commercial laboratories sequencing randomly selected specimens (2).Methods for next-generation sequencing vary across laboratories, but >90% sequences are generated by using an Illumina platform (https://www/illumina.com);assembly methods also vary.

Study Population Evaluation
We included all confirmed COVID-19 cases (SARS-CoV-2 RNA detected by molecular amplification) reported in the Washington Disease Reporting System from January 21, 2020, through December 31, 2021.Using laboratory accession numbers or patient demographics, we linked those cases to sequences uploaded to the GISAID EpiCoV database (5-7) from January 21, 2020, through January 31, 2022, that indicated the state of Washington in the geographic tag.We classified cases as presentinel surveillance if specimens were sequenced before March 1, 2021.We classified cases as sentinel surveillance if specimens were sequenced on or after March 1, 2021, and submitted SYNOPSIS through the Washington State Department of Health sentinel surveillance program, or if the sequencing laboratory indicated that specimens were randomly selected.Specimens specifically selected for targeted sequencing as part of outbreak investigations because of travel history, known vaccine breakthrough status, or spike gene target failures were not considered sentinel surveillance if sampled outside the random selection process.Washington state and University of Washington Institutional Review Boards determined this project to be a surveillance activity and exempt from review.

Data Analysis
We assessed representativeness of data before and after implementing sentinel surveillance by comparing COVID-19 cases with sequenced specimens to all COVID-19 cases during the same period according to sex, age, race, ethnicity, language, long-term care facility (LTCF) association, occupation, county of residence, outbreak association, travel history, hospitalization, or death.All epidemiologic data analyses were performed using R version 4.0.3(17).We compared categorical data by using Pearson χ 2 test or the formula Σ(|E-O|)/E, where E was expected and O observed counts.Expected counts were calculated by standardization to overall reported cases during the same period.We visualized geographic comparisons by mapping standardized ratios of observed versus expected cases at the county level.We graphed the percentage of cases with sequenced specimens by county and month to visualize spatiotemporal sampling.We evaluated areas with high presentinel sequencing coverage and high or low sentinel sequencing coverage to determine representativeness because data from those areas enabled robust phylogeographic studies.
To determine variability of genomic data, we constructed phylogenetic trees for 4 scenarios using the Nextstrain (18) pipeline for SARS-CoV-2.The scenarios were presentinel surveillance with high coverage, low representativeness; presentinel surveillance with high coverage, high representativeness; sentinel surveillance with high coverage, high representativeness; and sentinel surveillance with low coverage, low representativeness.We performed rarefaction analysis to examine how sampling affected the diversity of sequences captured in each of those 4 scenarios.For each value from 1 to n, where n is the total number of available sequences for a location/timeframe of interest, we generated 10 subsampled datasets (sampling without replacement).We counted and plotted the number of unique haplotypes as a function of the number of sampled sequences.
We assessed timeliness of data by comparing the interval between initial specimen collection and genomic data upload to the GISAID database.We assessed median timeliness by month and compared categorical data uploaded within <14 days, 14-27 days, and >28 days after specimen collection.

Results
During the presentinel surveillance period, 10,653 (3.3%) COVID-19 cases had sequencing information available, compared with 56,106 (12.1%) cases sampled during sentinel surveillance.For all categorical comparisons using Pearson χ 2 tests, we observed statistically significant differences between presentinel and sentinel cases that had sequencing data.To avoid having a single large discrepancy dominate the representativeness measurement, we used the formula Σ(|E-O|)/E instead of Pearson χ 2 test to directly compare representativeness between populations (Table ).
Both presentinel and sentinel cases with sequencing data were generally representative of all COVID-19 cases for sex at birth.During the presentinel surveillance period, older age groups and hospitalized persons with sequenced specimens were overrepresented.Persons who died of COVID-19 were overrepresented by ≈3-fold among presentinel cases with sequencing data compared with cases that had no sequencing data.Sentinel surveillance implementation resolved overrepresentation of decedents, but persons with COVID-19 who were hospitalized or >65 years of age were underrepresented.
Early during the pandemic, specimens from known outbreak-associated COVID-19 cases were more commonly sequenced, likely reflecting preferential sample selection of those cases for studies.Similarly, sequencing of specimens from LTCF-associated COVID-19 cases was enriched by 2.5-fold.Sentinel surveillance implementation decreased but did not completely resolve enrichment of outbreak-associated cases, whereas LTCF-associated case enrichment was substantially resolved.
Presentinel COVID-19 cases with sequenced specimens had more complete symptom information when compared with all COVID-19 cases.Both presentinel and sentinel cases with sequenced specimens had symptom information reported more frequently compared with all cases.
Persons self-reporting as a racial or ethnic minority were generally overrepresented among presentinel CO-VID-19 cases with sequenced specimens; race/ethnicity data were less likely to be missing among those cases than among total COVID-19 cases.After sentinel surveillance implementation, persons reporting Hispanic ethnicity or Spanish language preference were overrepresented among COVID-19 cases with sequenced specimens.Differences in missing race data were resolved after sentinel surveillance implementation.
Industry information was missing for most cases.According to the available industry information, agriculture, forestry, fishing and hunting, and healthcare and social assistance were overrepresented among cases with sequenced specimens.Industry information was missing for >90% of cases during the sentinel surveillance period; therefore, industry representation was not assessed in this study.
More persons with sequenced specimens during the presentinel period traveled outside the United States than expected, indicating likely enrichment for international travelers.Travel information was missing for >95% of cases during the sentinel surveillance period; therefore, traveler representation was not assessed in this study.Reinfection data were captured starting on September 1, 2021; therefore, case-level data were not available for most of the study period.From September through December 2021, reinfection cases were underrepresented in the sequencing data, which might reflect a higher average Ct in this population.
Before sentinel surveillance implementation, geographic sequencing coverage was variable and focused on western Washington (Figure 1); King, San Juan, Pacific, and Yakima Counties had high coverage.Some areas of the state had little or no data available.After sentinel surveillance implementation, geographic coverage equalized regionally across the state; variable coverage because of sentinel laboratory service areas occurred as expected (Figure 1).
We investigated representativeness further in areas with high presentinel sequencing coverage and high cases numbers (Appendix Figure 1, https:// wwwnc.cdc.gov/EID/article/29/2/22-1482-App1.pdf).During March-June 2020, Yakima County had 19%-30% sequencing coverage for all COVID-19 cases; high-quality genomic data were available for 1,696 cases.High coverage was partially driven by sequencing specimens from LTCF-associated cases.A total of 25% of cases with sequenced specimens were affiliated with LTCFs, compared with 11% of all COVID-19 cases during that period.Persons with sequenced specimens were more commonly >65 years of age and less commonly of Hispanic descent or with Spanish language preference.
We performed phylogenetic analysis of all sequenced specimens from Yakima County cases with COVID-19 onset dates during March-June 2020 (Appendix Figure 2 Sequencing coverage was also high in Yakima County in February 2021.Sequencing coverage was 26% across all COVID-19 cases, and high-quality genomic data were available for 271 cases.During this period, we observed smaller differences between cases with sequenced specimens and all cases for ethnicity and outbreak-association; otherwise, cases with sequenced specimens were largely representative of all cases during this time.We performed phylogenetic analysis of Yakima cases during February 2021 (Appendix Figure 2, panel B).The most common lineage identified was 21C (Pango lineage B.1.427/429or Epsilon), representing 33% of sequences, then 20G (Pango lineage B.1.2) at 29%, 20A at 13%, 20B at 9%, and 20C at 15%.In Washington, 30% of sequences in GISAID were Epsilon in February 2021.
After sentinel surveillance implementation, variability in geographic coverage was diminished regionally but persisted at the county level.We investigated counties with high and low sentinel sequencing coverage to determine effects of variable sentinel specimen sampling.We specifically compared Whatcom County, a county with high coverage from a sentinel laboratory, and Clark County, a county with low coverage.During the sentinel surveillance period, cases with sequenced specimens from Whatcom County were representative of all COVID-19 cases from the county for age, sex, race, death from COVID-19, and LTCF-association.Persons hospitalized for COVID-19 were underrepresented among sentinel surveillance cases, reflecting statewide findings.Outbreak-associated cases and symptomatic persons were slightly overrepresented among sentinel surveillance cases.We performed phylogenetic analysis of cases from Whatcom County during the sentinel surveillance period (Appendix Figure 2, panel C) and showed a transition from clade 20I (Alpha) to 21A/21I/21J (Delta) dominance, similar to what was observed in Washington overall.
Clark County had very low sequencing coverage over the sentinel surveillance period, ranging from 0.8% of cases in April 2021 to 4.9% of cases in June 2021.Persons <45 years of age and outbreak-associated cases were overrepresented among cases with sequenced specimens, and hospitalized persons were underrepresented.We performed phylogenetic analysis of cases from Clark County during the sentinel surveillance period (Appendix Figure 2, panel D).Despite limited coverage, we observed a variant profile similar to that of Whatcom County and Washington overall.We performed rarefaction analysis and found sentinel sampling from Clark and Whatcom counties displayed higher viral diversity than Yakima County at 2 presentinel timepoints (Figure 2).Additional sampling will be required in all scenarios to fully capture circulating viral diversity.
Timeliness of available genomic data in the GI-SAID database varied over the study period (Figure 3).During the presentinel period, median timeliness ranged from 23 days in February to 98 days in October of 2020; >50% of sequences were uploaded to GISAID >28 days after specimen collection for most months.During the sentinel period, median timeliness was 26 days in August and 15 days in December of 2021; most sequences were uploaded to GISAID <28 days after specimen collection in all months after sentinel surveillance implementation.

Discussion
After a sentinel surveillance system for sequencing SARS-CoV-2 specimens was implemented in Washington, the available data were more epidemiologically and genomically representative of all COVID-19 cases and timelier than data before sentinel surveillance began.Specifically, representativeness of age, death from COVID-19, outbreakassociation status, LTCF-affiliated status, and geographic coverage improved; increased viral diversity was also noted.Before sentinel surveillance began, we were unable to identify a county or period with representative sampling, except for Yakima County during February 2021.After implementation, representativeness improved across multiple areas.Increased representativeness is a critical achievement because genomic data are routinely available to public health leaders and decisionmakers; ensuring equitable sampling coverage has substantial implications for response planning and interventions.Measuring effects of genomic surveillance on public health responses in Washington was not included in this study; however, methods for measuring and evaluating effectiveness should be explored.
Overrepresentation of older persons in presentinel genomic data was partly driven by selection of LTCFassociated COVID-19 cases and COVID-19 cases resulting in hospitalization or death.After sentinel surveillance began, the decrease in representation of persons >65 years of age improved overall representativeness but actually resulted in undersampling this age group, possibly indicating poor sequencing coverage by facilities where this population seeks care.Indeed, the sentinel surveillance system underrepresents hospitalized cases; further consideration is needed to improve data capture of both inpatient and outpatient COVID-19 cases.Before sentinel surveillance, outbreak-associated and symptomatic COVID-19 cases were oversampled.After implementation, overrepresentation of those cases decreased but was not resolved.At least 3 possible explanations exist for those findings: specimens from symptomatic SARS-CoV-2-infected persons are more likely to be sequenced because of higher average viral loads, which improves sequencing success; asymptomatic persons might be detected through screening programs not associated with sentinel laboratories; and outbreak-associated specimens might be sent to sentinel laboratories to ensure sequencing for investigative purposes.Random sampling among specimens received at sentinel laboratories could, thereby, still lead to biased samples.
Minority race and ethnicity were more commonly reported among presentinel cases with sequenced specimens; data were also more complete among those cases.Whether true overrepresentation occurred or race data were differentially missing among all cases is unclear.After sentinel surveillance implementation, persons reporting Hispanic ethnicity and Spanish language preference were overrepresented compared with overall cases statewide, which likely reflects the catchment areas of sentinel laboratories.Geographic coverage variability was identified during both presentinel and sentinel surveillance periods.Presentinel coverage focused on western Washington, where laboratories were connected to sequencing capacity.Sentinel surveillance enabled access to sequencing for additional laboratories and ensured greater equitable regional coverage, although variability at the county and subcounty levels remains.Variable coverage and representativeness at the substatewide level should be considered when using genomic data for specific analyses.Increasing geographic coverage will require additional sentinel laboratories that contribute specimens from areas of low coverage.
Other epidemiologic information was of interest in assessing representativeness, including industry and occupation, travel history, and reinfection status.However, data for those variables was incomplete, limiting their usefulness.As public health systems pivot away from capturing data through individual case interviews, datasets available for assessing sampling of specimens for sequencing should be considered.The full potential of genomic epidemiologic surveillance for improving public health requires pairing epidemiologic metadata with genomic data.
Viral diversity has been and continues to be dynamic over the course of the COVID-19 pandemic.
Measuring true viral diversity requires random or complete sampling.Actual circulating viral diversity likely differed across locations and timepoints included in our study; if circulating diversity generally increased over time, our conclusions would be biased toward assumption of improved capture because of surveillance.
Other states and countries have used various practices to select SARS-CoV-2 specimens for sequencing.Methods that rely on convenience samples, such as our presentinel system, likely have sampling biases that affect phylogenetic inference.In those settings, weighting cases for inclusion in estimates by using selection probabilities might help to correct bias.Alternatively, approaches to correct for nonrepresentative sampling during analysis, such as inverse probability weighting, should be considered.Even after sentinel surveillance system is put in place, some biases remain, such as undersampling of hospitalized cases, that should be corrected by diversifying sources of specimens.Ongoing evaluation and improvement of systems is necessary, especially in the context of performing epidemiologic studies.Many epidemiologic studies of COVID-19 have availability of genomic data as an inclusion criterion; if sampling biases are not clarified, biased conclusions might be drawn.Codevelopment of genomic epidemiology programs alongside bioinformatics programs is needed in public health departments because epidemiologic and phylogenetic analyses are best performed after sampling methods and data limitations are considered.
Although representativeness and timeliness were the focus of this study, other features should be considered in the design of surveillance systems, such as simplicity, flexibility, sensitivity, and stability (4).Sentinel surveillance systems are complicated and require ongoing coordination with laboratory partners; stability requires public health resources.Alternative systems to enable representativeness and timeliness while increasing simplicity and stability could include requirements for specimen submission, such as those commonly used for foodborne pathogens and other notifiable conditions.Sensitivity is essential for the surveillance system goals of rare variant detection and timely surveillance of circulating virus variants.Right-size sampling, such as that performed for influenza surveillance, should be considered (19; S. Wohl et al., unpub.data, https://www.medrxiv.org/content/10.1101/2021.12.30.21268453v1).
Even after careful consideration of surveillance system design for pathogen sequencing and pairing with epidemiologic data, limitations remain because of specimen requirements for sequencing.Studies using surveillance sequencing data should report the following limitations: application of laboratorybased diagnostic testing might depend on many factors that are difficult to assess and increasingly complex because of availability of improved athome testing, and, among positive test results, those with a low PCR Ct are more likely to be sequenced.Therefore, representativeness of sequencing data is inherently limited.
Assessment of representativeness during presentinel and sentinel surveillance is limited in the causal inferences that can be drawn.Other concurrent factors might have affected representativeness and timeliness during this study period.For example, CDC surveillance efforts were also increased during this timeframe; samples sequenced under CDC surveillance were coded as sentinel and were analyzed as part of the sentinel surveillance system in Washington.
In conclusion, implementing a sentinel surveillance system for sequencing SARS-CoV-2 specimens was associated with improved genomic and epidemiologic representativeness and timeliness of available sequence data in Washington.Ongoing evaluation and improvements will be necessary to ensure representative capture of inpatient settings.As public health leaders discuss changes to COVID-19 surveillance systems nationally, datasets required to assess representativeness of sampling for sequencing should be considered.Cross-jurisdictional sampling bias is a concern when validating phylogeographic methods applications; attention to sampling will improve the usefulness of those datasets for public health practice.

Figure 1 .
Figure 1.Geographic extent of sequencing data available for COVID-19 cases in study of sentinel surveillance system implementation and evaluation for SARS-CoV-2 genomic data, Washington, USA, 2020-2021.A) Presentinel surveillance (specimens sequenced before March 1, 2021).B) Sentinel surveillance (specimens sequenced on or after March 1, 2021, through the sentinel surveillance program).Standardized ratios (observed/ expected counts) of cases with sequenced specimens are indicated by county.No sequence data were available for 3 counties during the presentinel period.

Figure 2 .
Figure 2. Rarefaction analysis of virus haplotype diversity in Yakima, Clark, and Whatcom Counties in study of sentinel surveillance system implementation and evaluation for SARS-CoV-2 genomic data, Washington, USA, 2020-2021.Presentinel COVID-19 cases (sequenced before March 1, 2021) with sequenced specimens from Yakima County (2 timepoints) were compared with sentinel COVID-19 cases (sequenced on or after March 1, 2021, through the sentinel surveillance program) with sequenced specimens in Clark and Whatcom Counties.Haplotype count indicates virus diversity.

Table .
Comparison of demographic characteristics between COVID-19 cases with sequenced specimens and all confirmed COVID-19 cases in study of presentinel and sentinel surveillance system implementation and evaluation for SARS-CoV-2 genomic data, *Values are no.or no.(%).We included all confirmed COVID-19 cases (SARS-CoV-2 RNA detected by molecular amplification) reported among Washington residents from January 21, 2020, through December 31, 2021, in the Washington Disease Reporting System.E, expected counts; LTCF, long-term care facility; NA, not applicable; O, observed counts.†Cases were classified as presentinel if specimens were sequenced before March 1, 2021.‡Cases were classified as sentinel if specimens were sequenced on or after March 1, 2021, through the sentinel surveillance program.§Formula used to directly compare representativeness between populations.¶Counts <10 are censored.