A novel multi SNP based method for the identification of subspecies and associated lineages and sub-lineages of the Mycobacterium tuberculosis complex by whole genome sequencing

The clinical phenotype of zoonotic tuberculosis, its contribution to the global burden of disease and prevalence are poorly understood and probably underestimated. This is partly because currently available laboratory and in silico tools have not been calibrated to accurately identify all subspecies of the Mycobacterium tuberculosis complex (Mtbc). We here present the first such tool, SNPs to Identify TB (‘SNP-IT’). Applying SNP-IT to a collection of clinical genomes from a UK reference laboratory, we demonstrate an unexpectedly high number of M. orygis isolates. These are seen at a similar rate to M. bovis which attracts much health protection resource and yet M. orygis cases have not been previously described in the UK. From an international perspective it is possible that M. orygis is an underestimated zoonosis. As whole genome sequencing is increasingly integrated into the clinical setting, accurate subspecies identification with SNP-IT will allow the clinical phenotype, host range and transmission mechanisms of subspecies of the Mtbc to be studied in greater detail.

isolates. These are seen at a similar rate to M. bovis which attracts much health protection resource and yet M. orygis cases have not been previously described in the UK. From an international perspective it is possible that M. orygis is an underestimated zoonosis. As whole genome sequencing is increasingly integrated into the clinical setting, accurate subspecies identification with SNP-IT will allow the clinical phenotype, host range and transmission mechanisms of subspecies of the Mtbc to be studied in greater detail. The host range of the various subspecies differs; for instance M. bovis is more frequently observed in animals than humans, in whom it is traditionally thought of as a zoonosis secondary to unpasteurized dairy consumption 3  The so-called 'animal' subspecies often grow poorly in culture, making drug-susceptibility testing difficult, and are not all well differentiated by commonly available PCR-based assays 11 . Their incidence and clinical significance is likely to be underestimated as a result. Routine whole-genome sequencing (WGS) of mycobacteria is now either underway or planned in an increasing number of settings 12 , providing the opportunity to better identify these subspecies and thereby improve our understanding of their clinical phenotype.
There is a notable geographic association with some subspecies of Mtbc . M. canettii for example is found almost exclusively in the Republic of Djibouti and the West African lineages ( Mtb lineages 5 and 6) are strongly associated with West Africa 13 . Even within lineages there are some sub-lineages with a global distribution and others that are more restricted 14 . The reasons for this are not fully known but may include an interplay between host and/or pathogen genetics as well as historical geographical distribution in environmental/animal reservoirs.
From a clinical perspective there are conflicting reports regarding effects of M. tuberculosis lineage on clinical phenotype 15,16 . Genotypes vary in their association with multidrug and extensively drug resistant TB. A clear example of this is the Beijing genotype (later re-named lineage 2). In Asia, Europe and South Africa this sub-grouping has been significantly associated with transmission of resistant forms of TB 17 . In addition there is an apparent association between the lineage 2/Beijing strain and increased pathogenicity and virulence whilst M. bovis subspecies bovis and M. canettii are known to be naturally resistant to pyrazinamide 18,19 . It is therefore important to identify these as the causative organism at an early stage and adjust treatment accordingly. Other phenotypic differences between lineages include in vitro growth rates, hydrophobicity, clinical outcomes, transmissibility and host inflammatory response 20 .
In addition to differences in natural susceptibility to antituberculosis drugs, the identification of the subspecies of a M. tuberculosis complex isolate has important implications for contact investigations and source case finding. M. bovis BCG can cause complications usually, but not always, in immunosuppressed patients following BCG vaccination or be isolated as a result of bladder carcinoma treatment and should not be misdiagnosed as M. tuberculosis 21,22 . M. microti is found in wild cats and rodents and causes human infection which is usually in association with rodent contact 23 . M. pinnipedii is a bacterium causing tuberculosis in seals which is 4 sometimes transmitted to humans during outbreaks in zoos or wildlife parks 24 . M. orygis is mostly isolated from gazelle species but in the past few years has also been seen in humans although how they contract this bacterium is still unclear 6 . The other animal subspecies are rarely encountered in wild animals. The specific clinical phenotype of animal strains is largely unknown at present. In order to facilitate studies to investigate this it is first necessary to develop a methodology to allow accurate identification.
Efforts to differentiate members of the Mtbc and study the phylogeny of the complex have thus far included analysis of large genomic deletions 25 , variable number tandem repeats (VNTR), spacer oligonucleotide typing, multilocus sequence analysis (MLST), and more recently single nucleotide based phylogenies 26 . Numerous tools now exist that make in silico predictions of lineages within the complex from WGS data, using a variety of approaches including the detection of single nucleotide polymorphisms (SNPs) from both unassembled and mapped genomes, comparison of de Brujin graphs, and minHash based comparisons 27-30 . None of these tools have yet been calibrated to reliably differentiate between all subspecies.
Whole genome sequencing has allowed greater understanding of the microbiological diversity within the complex. The diversity of clinical phenotypes associated with infection by animal lineages is largely unknown, partly because the identification of these organisms is currently difficult. Accurate identification of the causative subspecies in all cases would allow the burden of disease associated with animal lineages to be characterised and diversity in clinical phenotypes to be fully appreciated and subsequently better managed. A higher level of knowledge on the spread and host range of the subspecies would also provide a better basis on which to study the history of the evolutionary development of the complex as a whole. Here we use WGS data to identify SNPs that define each subspecies, lineage and sub-lineage within the Mtbc and use this SNP catalogue (implemented as 'SNPs to Identify TB', or 'SNP-IT') to characterize strains from a large collection of clinical isolates.
SNP-IT is freely available at github.com/samlipworth/SNP-IT

Calibration set
First, a set of isolates was defined from which to identify SNPs associated with subspecies, These were identified using a combination of spoligotyping, RFLP, HAIN Genotype MTBC, and VNTR in accordance with current and previous standard practice.

Bioinformatics
Parallel bioinformatics approaches were taken to assess applicability across pipelines. As such, reads from Illumina platforms were independently mapped to two different versions of the H37Rv reference genome. Reads were mapped to NC000962.3 with Breseq v0.28.1, using a minimum allele frequency of 80% and minimum coverage of 5, for SNP calls. Separately, reads were mapped again to NC000962.2, for which Snippy v3.1 was used with default settings (minimum coverage 10, minimum allele frequency 90%) 32,33 . We extracted all SNPs shared exclusively by members of each subspecies, lineage and sub-lineage. As the lineage-defining bases for lineage 4 as a whole do not vary from the reference genome, as it is itself lineage 4, we identified these positions by mapping a core SNP alignment to a maximum likelihood tree using Mesquite version 3.30 34 . These nucleotide loci were added to the catalogue of phylogeny-determining SNPs.
All newly sequenced genomes are available on NCBI under project accession number PRJNA418900 .

Validation set
To validate the algorithm, an independent collection of genomes (N = 516) was compiled using clinical isolates sequenced by Public Health England (PHE) Birmingham, identified as Mtbc , and not included in the calibration set. These were augmented with data from the public nucleotide archives to increase the representation of 'animal' subspecies (N = 102). To maximise inclusion of animal strains from the public archives, especially since we expected their identification to be problematic, we used the new Coloured Bloom Graph (CBG) software 35 .
Using CBG, we searched a snapshot of the Sequence Read Archive (to Dec 2016, N = 7 455,632) using our new set of reference kmers for Mykrobe predictor (see below). If CBG and SNP-IT agreed on the subspecies identity of the isolate, we included it in the validation set.
FASTA files (i.e. of the whole genome) were compared to the catalogue of phylogeny-determining SNPs to make predictions for PHE isolates, whereas for isolates downloaded from the nucleotide archive, only limited variant calling format (vcf) files were created using Snippy (i.e. only SNPs with respect to the reference genome were included to increase computational efficiency). To ensure genomic loci defining lineage 4 were included, a mutated reference genome was used to create these limited vcf files. SNPs in the query sample were compared to reference libraries of lineage specific SNPs for each clade. Query genomes were assigned to particular subspecies or lineage if at least 10% of lineage or subspecies specific SNPs were detected in the strain in question. All predictions were assessed against the ML phylogeny. For M. mungi , only one genome could be located in the public sequence libraries and so we could not validate this subspecies.

Clinical isolates
To assess the caseload across the different members of the Mtbc seen by the Public Health laboratory in Birmingham, UK, we applied the algorithm to 3,128 Mtbc genome sequences from consecutively obtained clinical isolates. H37Rv is routinely sequenced by the laboratory on WGS plates; these isolates were not removed and their identification served as an internal control.

Comparison to existing tools
Results were compared to those from existing software tools (KvarQ, Mykrobe Predictor and TB-Profiler). In order to allow our new data to be integrated with published SNP libraries 36 and for practical reasons when modifying existing tools, we created a minimal SNP dataset. We filtered our larger SNP catalogue for synonymous SNPs which occurred in coding regions (as annotated by SnpEff version 4.3 37 ) and selected one representative SNP for each subspecies, lineage and sub-lineage at random. The existing software packages were then modified to include reference SNPs (or kmers for Mykrobe predictor) for those subspecies, lineages or sub-lineages which they initially failed to identify.

Calibration
In total, 13,544 SNPs were identified as being predictive of taxonomic and phylogenetic groups of interest (median of 265 SNPs per group, inter quartile range 345).

Validation
All predictions made by SNP-IT across all of the subspecies, lineages and sub-lineages were consistent with the ML phylogeny (Table 1).
Within the validation set we included 18 genomes from lineage 4 for which no further sub-lineage could be assigned by SNP-IT and which we therefore defined by the presence of 9 conserved bases relative to the reference. Their position on a ML tree revealed that these were phylogenetically distinct from other currently named sub-lineages of lineage 4. We therefore  (7) were not further classifiable from lineage 4 by Coll nomenclature. We subsequently updated our SNP catalogue for lineage 4 to include these unnamed sub-lineages using the same methods as we described for the original training set. 368 samples (all belonging to lineage 4.9) were found to be H37Rv when we unblinded ourselves to their corresponding lab records.

Phylogenetic SNPs in drug resistance associated genes
Using a previously published list of drug resistance associated genes for M. tuberculosis 39 , we searched all animal subspecies for phylogenetic SNPs in drug associated genes (supplementary table 3). All animal subspecies contain phylogenetic SNPs in these genomic regions, but on the basis of our data we were unable to determine whether any of these mutations are linked to lineage specific resistance as we did not have the corresponding phenotypic drug susceptibility testing data.

Updating existing software
A recent review found that KvarQ and TB-Profiler had a 100% accuracy for lineage prediction, compared to 93.4% for Mykrobe predictor 40 . Given each of these systems uses the same principle of lineage defining SNPs to predict lineage, the differential performance is likely to stem from the SNP catalogues deployed. We therefore modified the SNP catalogues of KvarQ.
Mykrobe predictor and TB-Profiler using our minimum (single) SNP catalogue and re-ran the software on our collection of clinical isolates. After modifying the KvarQ, TB-Profiler and Mykrobe predictor databases to our minimal SNP catalogue, all systems identified all of the subspecies, lineages and sub-lineages with 100% accuracy. SNP-IT was unable to identify 10 samples because less than 10% of type-specific SNPs were present in these strains. This was because our pipeline made no call at the lineage informative sites due to the presence of a minor allele, most likely due to contamination or a mixture of two strains in the sample. TB-Profiler and Mykrobe predictor were however both able to identify two of these isolates as mixed Beijing strain/ M. orygis using our new minimal SNP dataset.

Discussion
SNP typing is a powerful method for discriminating between subspecies of the M. tuberculosis complex, which are often not discernible by conventional laboratory methods such as the Reverse Line Blot Methodology. The SNP databases of Coll, Stucki and Comas are currently used as the knowledge base for KvarQ, TB-Profiler and Mykrobe predictor 36,41,42 . None of these however provide adequate resolution for the animal lineages. In contrast, SNP-IT is able to assign lineage to all samples tested with 100% accuracy. Implementing this fine-resolution algorithm into a routine diagnostic work flow would be a major step towards understanding the epidemiology and pathogenicity of the less common members of the Mtbc . It would furthermore provide an important basis for studies on the development and the phylogeny of the complex.
The global burden of zoonotic tuberculosis, largely caused by Mycobacterium bovis , is thought to be both underestimated and increasing 43 . In the UK approximately 1% of tuberculosis cases are caused by M. bovis , mostly due to re-activation in older individuals from previous exposure to unpasteurised milk products or travel/migration, although human to human transmission clusters have been identified 44 . This number is likely to be higher in endemic areas where humans have closer contact to animals, however accurate assessment of prevalence is made difficult by a lack of diagnostic tools and surveillance in these areas 45  In order to demonstrate that there is no material difference between existing platforms other than the SNP databases which they use, we adjusted their reference databases and re-ran the programmes on a collection of clinical isolates. We here demonstrated that all three systems tested (KvarQ, Mykrobe predictor and TB-Profiler) are identical in performance when given the same SNP reference database. There are differences between the packages including the user interface, method applied (eg mapped vs. unmapped calling), programming architecture, ability to call mixed samples and speed which ultimately come down to user preference and end application. We demonstrate here however that the clinically meaningful differences highlighted in a recent review are easily ameliorated by improving the underlying dataset 40 .
In all of the 'animal' lineages there are phylogenetic SNPs in drug resistance associated genes.
Where these are not associated with drug resistance, they can be helpfully annotated as such by diagnostic algorithms and set aside for purposes of predicting susceptibility. An unavoidable weakness of any SNP-based approach is its vulnerability to null-calls due to minor alleles at lineage-informative positions, or a lack of coverage. SNP-IT uses the entire library of lineage defining SNPs such that this is not an issue unless it occurs at the majority of lineage informative positions.
In conclusion, in this study we present SNP-IT, which is the first SNP typing system to identify all subspecies, lineages and sub-lineages with 100% accuracy within the Mtbc . When the knowledge base behind SNP-IT is applied to three existing software packages, each performs equally well. By demonstrating the practical application of this knowledge base to a routinely collected clinical dataset, we reveal an unexpectedly high prevalence of M. orygis , providing an opportunity to further explore the clinical phenotype of this infection, and that of other animal lineages, in the future. As more healthcare systems begin to routinely use whole genome sequencing, there is an opportunity to accurately diagnose the causative subspecies of TB in all cases. This will allow previously unrecognised zoonosis and reverse zoonosis to be identified and control interventions to be implemented in the interests of 'One Health'.   identified as belonging to one of the named sub-lineages, 368/880 were H37Rv which is commonly re-sequenced by the lab. .