Prediction

This is the work flow of the gene cluster detection and scaffold pipeline.

Gene prediction

If only DNA data is provided, the genes are predicted using Prodigal. If the genes are already assigned, SeMPI can try to parse the genes and use them for further analysis.

Domain detection

The detection of relevant domains in the predicted proteins is performed using profile hidden Markov models (profile-HMMs) created with HMMER 3.0. The detected domains are shown in Table 1. Since the profile-HMMs are derived and modified from the MIBiG 2 database, the nomenclature is mainly based on the antiSMAHS definition as described in their documentation.

Table 1 HMM profiles used for domain detection. The F1-scores and thresholds are based on the detection performance of the profile-HMMs on the MIBiG 2.0 database.
Domain Full name F1-score HMM detection threshold (domT)
ACP Acyl carrier protein 0.96 13.9
AT Acyltransferase 1 47.4
A Adenylation domain 1 19.6
CAL Coenzyme A ligase 0.94 221.8
C Condensation domain 1 25.9
DH2 Dehydratase 0.98 33.2
DH Dehydratase 0.99 31.1
DHt Dehydratase 0.88 44.3
ER Enoylreductase 1 68.8
E Epimerization domain 1 60.4
KR Ketoreductase 1 20
KS Keto-synthase 1 72
PCP Peptide carrier protein 0.97 21.9
TD Reductive Thioesterase 1 42.1
TE Thioesterase 1 36.5
bACP β-branching acyl carrier protein 0.59 28.9
cMT C-Methyltransferase 1 104.7
nMT N-Methyltransferase 0.99 42.7
oMT O-Methyltransferase 0.99 70.6
tATd Trans-acyltransferase docking domain 0.98 50

Gene cluster curation

After the protein detection the genome is converted into a DataFrame pandas, which simplifies the subsequent operations. Additional, pandas allows to apply vectorized functions, which speeds up the prediction pipeline. A mysql dump of the DataFrames is provided with the final output.

The DataFrame is curated and modified in order to prepare the BGC module detection (see Example Fig. 8).

  1. Close genes (the threshold can be set in the options of the Upload form) which encode proteins on the same strand are combined into blocks. A detailed observation of the MIBiG annotated NRP and PK clusters showed, that for this condition the co-linearity principle applies for the very most cases.
  2. The domains are ordered based on the occurrence in the genome (id), but for module assignment it is more useful to order the domains based on the occurrence in the block, therefore an additional index is assigned based on the block order (Corr. id).
  3. In very rare cases the detected protein domains overlap. This can be the case especially for very short sequences (for example the ACP domain). If the domains overlap to more then 20% the domain with the lower bitscore (see HMMER) is removed.
  4. Sometimes the HMM algorithm detects two domains in succession instead of one domain, these domains are automatically joined to one domain.
prediction

Fig. 8 Curation steps: (a) Block generation, (b) Domain reordering, (c) Domain overlap removal, (d) Domain duplication joining.

Module detection

SeMPI v2 detects functional PKS and NRPS modules based on a in-depth analysis of known gene clusters stored in the MIBiG 2.0. The modules can be be observed either in the tabular representation (see Example Fig. 9) of the domains of a gene cluster or in the interactive cluster browser (see Example Fig. 10).

mods1

Fig. 9 Tabular domain representation of a gene cluster with modules highlighted in red.

mods2

Fig. 10 Cluster browser representation of a gene cluster with modules highlighted in red.

Assembly logic for disjointed gene clusters

If the domains are not organized in a single consecutive block an module ordering algorithm is applied. In most cases (<= 3 blocks) the assignment of starting modules to the beginning and terminal modules to the end of an gene cluster is sufficient for a correct module order. For gene clusters with more than 3 blocks the modules are ordered based on comparison of the gene cluster assembly with the set-up of more than 250 reference gene clusters where the module order is known.

Postsynthetic modifications

In order to detect postsynthetic modifications additional Pfam domains are detected in proximity of the gene cluster. The Pfam domains are correlated to specific postsynthetic modifications as shown in Table 2.

Table 2 Detected Pfam domains correlated to common postsynthetic modifications.
Postsynthetic modification Pfam ID Description Correlation
Glyco PF00908.16 dTDP-4-dehydrorhamnose 3,5-epimerase 0.67
Glyco PF01370.20 NAD dependent epimerase/dehydratase family 0.64
Glyco PF03559.13 NDP-hexose 2,3-dehydratase 0.61
Glyco PF00201.17 UDP-glucoronosyl and UDP-glucosyl transferase 0.59
Glyco PF03033.19 Glycosyltransferase family 28 N-terminal domain 0.56
Glyco PF01041.16 DegT/DnrJ/EryC1/StrS aminotransferase family 0.52
Glyco PF08421.10 Putative zinc binding domain 0.47
Glyco PF16363.4 GDP-mannose 4,6 dehydratase 0.44
Glyco PF04101.15 Glycosyltransferase family 28 C-terminal domain 0.28
Glyco PF01075.16 Glycosyltransferase family 9 (heptosyltransferase) 0.21
Glyco PF00728.21 Glycosyl hydrolase family 20, catalytic domain 0.18
Glyco PF02838.14 Glycosyl hydrolase family 20, domain 2 0.18
Glyco PF01915.21 Glycosyl hydrolase family 3 C-terminal domain 0.13
Glyco PF00933.20 Glycosyl hydrolase family 3 N terminal domain 0.13
Glyco PF14885.5 Hypothetical glycosyl hydrolase family 15 0.1
Cl PF04820.13 Tryptophan halogenase 0.66
Cl PF00999.20 Sodium/hydrogen exchanger family 0.51
Sphingo PF12680.6 SnoaL-like domain 0.61
Sphingo PF00890.23 FAD binding domain 0.54
SS PF07992.13 Pyridine nucleotide-disulphide oxidoreductase 0.46
NO2 PF01678.18 Diaminopimelate epimerase 0.67
6-Ring PF16197.4 Ketoacyl-synthetase C-terminal extension 0.53
6-Ring PF02801.21 Beta-ketoacyl synthase, C-terminal domain 0.52
6-Ring PF08990.10 Erythronolide synthase docking 0.51
6-Ring PF00743.18 Flavin-binding monooxygenase-like 0.46
6-Ring PF13377.5 Periplasmic binding protein-like domain 0.23
6-Ring PF13191.5 AAA ATPase domain 0.23
5-Ring PF12680.6 SnoaL-like domain 0.56
5-Ring PF00890.23 FAD binding domain 0.53
5-Ring PF01551.21 Peptidase family M23 0.46
5-Ring PF08990.10 Erythronolide synthase docking 0.45
5-Ring PF00486.27 Transcriptional regulatory protein, C terminal 0.26
5-Ring PF16197.4 Ketoacyl-synthetase C-terminal extension 0.26
5-Ring PF00109.25 Beta-ketoacyl synthase, N-terminal domain 0.25

Based on the detected domains the postsynthetic modifications are predicted using regression models. The predicted numbers of postsynthetic modifications are shown together with the predicted scaffold as shown in Fig. 11.

psm example

Fig. 11 Example of an PK scaffold prediction with indicated halogen (Cl) postsynthetic modification.

Scaffold generation

The scaffolds are generated by joining of each building block of a gene cluster. If the modules of a cluster cannot be joined multiple blocks are generated (see Example Fig. 12). The database screening queries all blocks in the target molecules (see Example Fig. 13). The example is taken from BGC0000317. The multiple blocks scaffoled scoring algorithm is further described in Combined score ranking.

mods1

Fig. 12 Two blocks predicted for one cluster.

mods2

Fig. 13 Both blocks are screened in each of the target compounds.