Genomic Epidemiology of Severe Acute Respiratory Syndrome Coronavirus 2, Colombia

Coronavirus disease (COVID-19) in Colombia was first diagnosed in a traveler arriving from Italy on February 26, 2020. However, limited data are available on the origins and number of introductions of COVID-19 into the country. We sequenced the causative agent of COVID-19, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), from 43 clinical samples we collected, along with another 79 genome sequences available from Colombia. We investigated the emergence and importation routes for SARS-CoV-2 into Colombia by using epidemiologic, historical air travel, and phylogenetic observations. Our study provides evidence of multiple introductions, mostly from Europe, and documents >12 lineages. Phylogenetic findings validate the lineage diversity, support multiple importation events, and demonstrate the evolutionary relationship of epidemiologically linked transmission chains. Our results reconstruct the early evolutionary history of SARS-CoV-2 in Colombia and highlight the advantages of genome sequencing to complement COVID-19 outbreak investigations.

Coronavirus disease  in Colombia was first diagnosed in a traveler arriving from Italy on February 26, 2020. However, limited data are available on the origins and number of introductions of COVID-19 into the country. We sequenced the causative agent of COVID-19, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), from 43 clinical samples we collected, along with another 79 genome sequences available from Colombia. We investigated the emergence and importation routes for SARS-CoV-2 into Colombia by using epidemiologic, historical air travel, and phylogenetic observations. Our study provides evidence of multiple introductions, mostly from Europe, and documents >12 lineages. Phylogenetic findings validate the lineage diversity, support multiple importation events, and demonstrate the evolutionary relationship of epidemiologically linked transmission chains. Our results reconstruct the early evolutionary history of SARS-CoV-2 in Colombia and highlight the advantages of genome sequencing to complement COVID-19 outbreak investigations. public databases, mainly GISAID (https://www. gisaid.org). SARS-CoV-2 is an RNA virus with an estimated substitution rate of 0.8-1.1 × 10 -3 substitutions/site/year (S. Duchene et al., unpub data, https://www.biorxiv.org/content/10.1101/2020.05. 04.077735v1; M. Worobey et al., unpub. data, https:// www.biorxiv.org/content/10.1101/2020.05.21.10932 2v1), which means it rapidly evolves as it is transmitted. The availability of SARS-CoV-2 genomes enabled us to detect a rapidly generating variation, demonstrating that genomic epidemiology is a powerful approach for characterizing the outbreak (14). Genomic epidemiology relies on phylogenetic analysis and has enabled researchers across the world to detect SARS-CoV-2 emergence in humans, reveal the importation and local transmission chains not detected by travel history and traditional contact-tracing strategies, and trace the geographic spread and prevalence of strains bearing specific mutations of epidemiologic relevance (15)(16)(17)

Sample Collection and Preparation
Colombia is made up of 32 departments, which are groups of municipalities, and a capital district. INS received nasopharyngeal swabs samples from patients with clinical signs and symptoms of SARS-CoV-2 from departments across the country as part of the virological surveillance of COVID-19. INS performed quantitative reverse transcription PCR to diagnose suspected COVID-19 cases by using a method recommended and transferred by the Pan American Health Organization and World Health Organization (18). Because of scarce resources, we selected a total of 43 samples for genome sequencing that represented >1 of the earliest documented samples in each affected department or samples linked to transmission chains (Appendix 1

Genomic Library Preparation and Sequencing
Library preparation and sequencing were performed following the ARTIC network (https://artic.network) real-time molecular epidemiology for outbreak response protocol and by using both nanopore and next-generation sequencing technologies (19). We processed 10 samples by using the MinION sequencer (Oxford Nanopore Technologies, https:// nanoporetech.com). We processed the remaining 33 samples by using the Nextera XT DNA library prep kit (Illumina, https://www.illumina.com) and performed sequencing by using the MiSeq Reagent Kit Version 2 and MiSeq sequencer (Illumina).

Genomic Sequence Assembly
We performed base calling on nanopore reads by using Guppy version 3.2.2 (Oxford Nanopore Technologies) and then demultiplexed and trimmed reads by using Porechop version 0.3.2_pre (20). We aligned processed reads against a SARS-CoV-2 reference genome (GenBank reference no. NC_045512.2) by using Burrows-Wheeler Aligner's Smith-Waterman Alignment (21). We performed base calling for single-nucleotide variants with a depth of >200× and then generated polished consensus by using Nanopolish version 0.13.2 (22). MiSeq reads were demultiplexed and we used fastp (23) to perform quality control using a Q-score threshold of 30. Processed reads were aligned against the SARS-CoV-2 reference genome, we performed bas calling for single nucleotide variants with a depth of >100× and generated consensus genomes by using Burrows-Wheeler Aligner's Smith-Waterman Alignment version 0.7.17 (21) and BBMap (24).

Phylogenetic Analysis of SARS-CoV-2 in Colombia
Sequence data covered the 20 affected departments and the capital district of Colombia. We collected 43 SARS-CoV-2 genome sequences from this study and 79 other sequences from Colombia deposited in GISAID. We combined the 122 sequences from Colombia with 1,461 representative genome sequences from South America-focused subsampling available from NextStrain (https://nextstrain.org) (25) as of May 20, 2020 (Appendix 1 Table 2) plus reference MN908947.3 from the GenBank nucleotide database (accesssion no. NC_045512). Across departments, a median of 1.5 sequences (mean 3.9; range 1-45) were available per department. We classified the full genomic dataset into lineages by using Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) and aligned these with 10 iterative refinements by using MAFFT (26)(27)(28). We removed all alignment positions flagged as problematic for phylogenetic inference, including highly homoplasic positions and 3′ and 5′ ends (29). We performed maximum-likelihood phylogenetic reconstruction on the curated alignment and a Hasegawa-Kishino-Yano plus gamma distribution 4 substitution model by using IQ-TREE (30,31). We estimated branch support by using an SH-like approximate likelihood ratio test (SH-aLRT) and considered >0.75 a high SH-aLRT (32). We removed 6 sequences from Colombia from further analysis because they had an inconsistent temporal signal in a clock analysis in TreeTime (33). We inferred time-scaled trees and rooted these with least-squares criteria and the evolutionary rate of >1.1 × 10 -3 substitutions/site/year estimated by S. Duchene et al. (unpub data, https://www.biorxiv. org/content/10.1101/2020.05.04.077735v1) by using TreeTime (33) and least-squares dating (34).
We considered geographic locations of sequence data, aggregated by continent except for Colombia, as discrete states, used these data for migration inference, and modeled transitions as a time reversible process by using TreeTime (33). We interpreted the number of state transitions into Colombia as a proxy for the minimum number of introductions.
In sensitivity analysis and to measure the effect of the SARS-CoV-2 uneven genomic representativeness across the world, we implemented 2 downsampling strategy datasets in which, based on location, the sequences were randomly resampled 100 times and the phylogenetic and migration inference was replicated. The downsampling strategies were as follows: retaining several sequences per region, when possible, equal to the number of sequences available for Colombia; or retaining 50 sequences per region and the total number of sequences from Colombia, which was the most even sampling per region for the South America-focused subsample.

Potential Routes of SARS-CoV-2 Importation into Colombia
We inferred the relative proportion of expected SARS-CoV-2 importations per country by considering COVID-19 incidence per number of international air passengers arriving in Colombia and the available flight travel. We obtained the number of international flights and number of passengers arriving during January 1-March 9, 2020 from the Special Administrative Unit of Civil Aeronautics of Colombia (Aerocivil, http://www.aerocivil.gov.co). The air travel data consists of direct flights from 14 countries to 7 main cities. We calculated COVID-19 incidence for each of the 14 countries with direct flights to Colombia by using the number of confirmed cases reported by the World Health Organization as of March 17, 2020, the date when travel restrictions started in Colombia (35)

Ethics Statement
According to the national law 9/1979, decrees 786/1990 and 2323/2006, the Instituto Nacional de Salud is the reference lab and health authority of the national network of laboratories and in cases of public health emergency or those in which scientific research for public health purposes as required, the Instituto Nacional de Salud may use the biological material for research purposes, without informed consent, which includes the anonymous disclosure of results. The information used for this study comes from secondary sources of data that were previously anonymized and do not represent a risk to the community.

Epidemiologic Investigation of SARS-CoV-2 Introductions, Contact-Tracing, and Community Transmission
In Colombia, preventive isolation and monitoring for passengers arriving from China, Italy, France, and Spain started on March 10, 2020.  (Figure 1, panel A).
Most ( The number of symptomatic imported cases steadily increased and peaked on March 14, when local cases were on the rise, but before border closures and the international air travel ban. Our estimate is based on the average incubation time of COVID-19 (44) and symptom onset but is 4.8 days earlier than the actual peak on March 18 ( Figure 1, panel B). Initial introductions were predominantly linked to Europe; however, both Europe and the Americas were prominent geographic sources of infections during the onset of the epidemic. The introductions after the peak mainly occurred from countries in South America.

SARS-CoV-2 Diversity
To elucidate the dynamics of SARS-CoV-2 spread into Colombia, we combined the 43 whole-genome sequences obtained in our study with sequences from Colombia deposited in GISAID, which provided a set of 122 complete genomes. Sequences from Colombia were classified into 12 sublineages: A.1.2, A.2, A.5, B,  B.1, B.1.1, B.1.3, B.1.5, B.1.8, B.1.11, B.2,  On average, we identified 1 lineage per department. For instance, the number of documented lineages was highly correlated with the availability of samples (Pearson product-moment correlation coefficient [PPMCC] = 0.72; p<0.001) and uncorrelated with the number of local cases (PPMCC = 0.35; p = 0.049). We noted 5 different lineages in the departments of Valle del Cauca and Antioquia and 3 different lineages in Cundinamarca; these departments have the most populated capitals and we had more samples from them ( Figure 2, panel B). We observed a moderate positive correlation between the number of lineages documented in a department and the number of imported cases (PPMCC = 0.51; p = 0.002).

Molecular Evolution of SARS-CoV-2 in Colombia
We identified 133 single-nucleotide variants (NVs) by using the full genome sequences from Colombia and the reference sequence (GenBank accession no. NC_045512.2). Most NVs (131; 98.5%) fell into the coding region, and 1 NV was identified at each noncoding end. Among NVs in coding sites, 71 (54.2%) led to nonsynonymous substitutions. Most NVs (92/133) were unique to a sequence. Among the shared NVs, 38/41 were associated with a specific lineage (Appendix 1 Tables 3, 4). These observations suggest that the substitutions are not laboratory-specific and most likely the outcome of in situ evolution, shared ancestry, or both (Appendix 2).
Most patients from Colombia for whom genomic sequences were available were symptomatic (n = 90); 59.6% had cough and fever and the others had >1 symptom; 10 died, 70% of whom had underlying conditions (Appendix 1 Table 1). However, given the limited number of sequences available, we could not reliably investigate any genomic determinant of clinical outcome.

Evolutionary Relationships between Local and Global SARS-CoV-2 Isolates
The time-stamped phylogeny of 122 isolates from Colombia and 1,462 representative global SARS-CoV-2 isolates showed that the estimated time to the most recent common ancestor for the sampled sequence data is December 7, 2019 (range October 25-December 26, 2019) (Figure 3, panel A). Asia was the inferred ancestral state at the root. Both these observations are in line with the known epidemiology of the pandemic. A root-to-tip regression of genetic distance against sampling time evidenced consistent temporal signal in the sequence data (Figure 3, panel B). The isolates from Colombia were interspersed among the isolates from other countries, suggesting multiple introductions ( Figure  3, panels A, C). However, considerable phylogenetic uncertainty appears along the tree and the fine-grained relationships of the isolates from Colombia could not be resolved with confidence (Appendix 2 Figure 1).
Phylogenetic uncertainty and uneven sampling made the quantification of the number of introductions 2858 Emerging Infectious Diseases • www.cdc.gov/eid • Vol. 26, No. 12, December 2020 During January-March 2020, a total of 7 cities in Colombia received 1,593,211 international passengers from 14 countries. Bogotá was the most concentrated city for flights, receiving around 77% of passengers; other cities included Medellín with 11%, Cartagena with 6%, and Cali with 4% of passengers. In total, 35% of international passengers started their journeys in the United States, 17% in Mexico, and 12% in Chile. However, we estimate 87% of all imported CO-VID-19 cases in Colombia came from Europe, 9.5% from North America, and 3.4% from South America. When stratified by country, the primary source of importation was Spain, which had 71.4% of imported cases; the United States had 8.4%, Germany had 8%, and France had 3.4% (Appendix 2 Figure 2). Our data show most (65.2%) COVID-19 cases were among travelers arriving in Bogotá; 20% were among those arriving in Medellín, and 9% among those arriving in Cali. We estimate that the Spain-Bogotá route carried 42% of the total imported cases.
Since the first COVID-19 case was identified in Colombia on February 26, 2020, contact-tracing efforts had been put in place. We obtained multiple sequences from 7 distinct early epidemiologically linked transmission chains (Appendix 1 Table 1) and mapped these data into the phylogeny (Figure 3, panel C). All but 1 set of sequences did not group, but it appeared very close in the tree. These data underscore the potential utility of genomic epidemiology to link persons with incomplete information, such as cases that are disconnected due to intermediate asymptomatic carriers, and complement outbreak transmission investigations.
Our study has some limitations. First, the geographic sources of infection relied on persons selfreporting symptom onset and travel histories, which are subject to inaccuracies. Second, we used air travel data from likely destinations in Colombia, but other locations also might have fueled COVID-19 emergence and dissemination in the country; flight travel data was not available for dates after March 9, 2020. Third, the number of sequences sampled represented a tiny fraction of the documented number of imported cases into Colombia. The sample was selected as a countrywide representation, given limited resources for genome sequencing; thus, the introduced viral diversity also might have been underestimated. Another limitation is the inherent uncertainty stemming from global unsystematic sampling. Therefore, the inferences about the number of introductions and the corresponding geographic sources should be interpreted with caution. We attempted to overcome this by undertaking sensitivity analyses and contrasting the results with the available epidemiologic data and our estimates from travel data. However, more sequence data from Colombia and undersampled countries, together with information of sampling representativeness per country, are needed to account for sampling uncertainty in a more statistically rigorous manner.

Discussion
We describe the complete genome sequences of SARS-CoV-2 from 43 clinical samples, results of an epidemiologic investigation of imported cases, and the phylogenetic findings of 122 genome sequences from Colombia that characterize the epidemic onset of CO-VID-19 in the country. Our study provides evidence that several independent COVID-19 introductions occurred in Colombia and documents >12 SARS-CoV-2 lineages. Most of the notified introductions to countries in Latin America occurred from Europe, an observation that was supported by phylogenetic and air travel data (48; C. Salazar et al., unpub data, https:// www.biorxiv.org/content/10.1101/2020.05.09.0862 23v1). Although the sequence data do not represent the actual number of epidemiologically linked transmission chains, our phylogenetic findings validated the linkage for epidemiologically linked transmission chains with available sequence data. Our results further underscore the advantages of genome sequencing to complement COVID-19 outbreak investigations and support the need for a more comprehensive country-wide study of the epidemiology and spread of SARS-CoV-2 in Colombia.