Genomic analysis of hypervirulent Klebsiella pneumoniae reveals potential genetic markers for differentiation from classical strains

Dataset characteristics

We analyzed 79 hvKp isolates defined as isolated patient liver samples. These were collected from China (n=39), Singapore (n=26), USA (n=8), Brazil (n=2) and one sample each from Ecuador, Guadeloupe, South Korea and Viet Nam (Table 1) . Among the 36 sequence types (ST) present in the 79 hvKp samples, ST23 was the most frequent (n=27) followed by ST86 (n=9) and ST258 (n=4). All other STs had two or fewer samples. The 79 hvKp were compared to a large dataset of Kp isolates. This large dataset consisted of two groups: (i) 520 Kp assemblages with locations and collection dates similar to liver isolates, representing the broader genetic landscape of the bacterium; (ii) 126 Kp isolates from three hospitals in Thailand11, used to assess whether our analytical approach was robust, in particular to overfitting when downscaling the data. Overall, the resulting comparison dataset (n=646) had samples from 302 different STs including ST23 (n=17), ST15 (n=29), ST147 (n=29), ST11 ( n = 25) were the most common.

Table 1 Characteristics of study samples.

Association analysis of liver invasive phenotype

We identified single nucleotide variants (SNVs) in the core genome (5.4 Mbp; 318,458 SNVs, with a minor allele frequency (MAF) of 3 Kp isolates). We used a genome-wide association study (GWAS) strategy to identify all SNVs associated with the liver-invasive phenotype, by adjusting the population structure (Fig. 1A). None of the SNV associations reached our level of strict statistical significance (P -ten). A similar gene-wide analysis was performed on the presence or absence of accessory loci (n=15,852), determined from a robust assembly of contigs. While the frequency of accessory genes in representative and liver isolates is largely correlated (rho = 0.79), the overrepresentation of ST23 (34%) among liver isolates leads to nonlinearity (Fig. 2A), which s improves when ST23 liver isolates are removed (Fig. 2B) (rho = 0.89). Clustering of isolates based on the accessory genome demonstrates that related genes are related to ST and not geography, with ST23 being a tight cluster (Fig. S1). We performed the GWAS analysis taking this clustering into account and found 29 putative genes associated with a higher risk of hepatic phenotype, including known hypervirulence loci. iron (odds ratio (OR): 29.8) and uic (OR: 14.1), three other genes related to metal transport, a c-type lysozyme inhibitor (OR: 14.5) and 8 unannotated loci that could not be annotated (P -ten; Fig. 1B; Table 2). These accessory loci are of lower frequency in representative samples compared to liver isolates, independent of the inclusion of ST23 (Fig. 2). Of the 79 liver isolates, 15 (19.0%) had none of these 29 putative accessory genes associated with the liver invasive phenotype.

Figure 1

Association analysis of liver versus non-liver versus individual genome-wide SNVs (n=318,458) in genome core (A) and accessory genes (n = 15,852) (B), taking into account the structure of the population. Each point represents a result of a single SNV or gene, and P -ten is the significance level.

Figure 2
Figure 2

Frequency of genome accessory genes in all livers (A) (n = 79) and not ST23 (B) (n=52) liver isolates compared to a representative dataset (n=646). The iron and uic outliers are clearly visible. Each dot is a gene and the legend is consistent with Figure 1.

Table 2 Relative abundance of accessory genes associated with the liver-invasive phenotypes identified in Figure 1B.

Association between the identified biomarkers and the rest of the accessory genome

After identifying 29 accessory genes, including iron and uic, with potential strong associations with the hvKp phenotype, we were interested in how they relate to each other, i.e. their coexistence. As summarized in a recent review12, plasmids such as pLVPK, pK2044 and pSGH10 are known carriers of genes associated with hypervirulence. Since the identified biomarkers do not occur at the same frequency, we hypothesized that they might be on different parts of the hypervirulence plasmids. To test this hypothesis, we performed a cluster analysis of all accessory genes using a umap (Principal Component-like) (see “Materials and methods”) (Fig. 3). The 29 association loci belonged to a cluster of 121 (92 additional) genes (Fig. 3A; Data S3). Focusing on this cluster, iron and uic the loci are part of different gene clusters (Fig. 3B) consistent with these loci occurring independently of each other and potentially linked to different hypervirulence plasmids (Fig. S2).

figure 3
figure 3

Pooled analysis of accessory genes. (A) Projection of the matrix of presence/absence of genes in a umap 2-dimensional view; (B) Structure of the iron and uic containing a cluster of genes in (A). Hepatic phenotype genes (Table 2) are visible in both (A) and in more detail in (B) for which the size reduction algorithm was rerun with a subset of genes in (A). The axes are dimensionless. Each dot is an accessory gene.

Association between liver invasive phenotype and plasmid replicons

We assessed the prevalence of identified plasmids. Using PlasmidFinder nomenclature, pLVPK, pK2044 and pSGH10 carry IncHI1B (pNDM-MAR) replicons. In pLVPK and pK2044, the replicon sequences are identical. However, based on visual examination of the sequences, the first 97 nt of pSGH10 are different, while the remaining 472 nt are identical to pLVPK and pK2044. In our dataset, 100 isolates had a pLVPK/pK2044-like sequence (20/100; 20.0% liver isolates), while 39 isolates had a pSGH10-like replicon sequence (24/39; 61 .5% liver isolates) (Table 3). We observed that pSHG10-like replicons occurred almost exclusively in ST23 isolates (37/39), while a pLVPK/pK2044-like was much more widely distributed, with ST86 (11/100) being the most frequent. There was another variant of IncHI1B (pNDM-MAR) present in single liver isolates from South Korea, which differed from the above variants in the first 120 nts. Overall, the most frequent replicon family among liver isolates was IncHI1B(pNDM-MAR) (45/79) followed by IncFIB(K) (16/79).

Table 3 Prevalence of IncHI1B (pNDM-MAR) plasmid replicons.

Liver isolates without identified biomarkers

Fifteen (19.0%) of 79 hepatic Kp isolates lacked all 29 accessory genes associated with hepatic phenotype and included four ST258, two ST1165, and 9 other sequence types. Assuming that the liver-invasive phenotype was not misclassified for these 15 samples, we investigated whether there were other genes in the accessory genome that differentiated this group from the representative set. By examining the differences in allele frequencies between the 15 isolates compared to the representative set, we found no plausible biomarkers (Figure S3A). We also repeated the GWAS core genome for these 15 samples, but again no SNV reached significance (all P > 10-ten). It is possible that a combination of accessory genes could predict the phenotype, and we used nine different machine learning approaches to assess whether such a complex gene relationship exists. The imbalance between the 15 hvKp and the 646 representative isolates can lead to poor classifier performance in machine learning models. We therefore ran 100 different datasets with the 15 livers and 15 randomly selected representative isolates. The resulting predictive accuracy in all approaches was no better than 50% of the random estimate (Figure S3B), suggesting that there are no strong predictors of the 19% liver isolates in our data set.

Comments are closed.