ecoli report of mobile genetic elements (MGEs)

About

Mobile genetic elements (MGEs) are a type of genetic material that can move around within a genome, or that can be transferred from one species or replicon to another. Newly acquired genes through this mechanism can increase fitness by gaining new or additional functions. On the other hand, MGEs can also decrease fitness by introducing disease-causing alleles or mutations. For instance, prophages are bacteriophages that have been inserted and integrated into the bacterial chromosome or plasmid. It is the latent form of a phage. ICEs (integrative and conjugative elements), on the other hand, are integrative mobile genetic elements that encode a conjugation machinery. They can confer selective advantages and can also encode resistance determinants and virulence factors.

In this context, this pipeline is capable of automatically annotating some mobile genetic elements using public available resources such as:

  1. PHAST database;
    • PHAST (PHAge Search Tool) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids.
    • Although it does not have a command line interface it has a protein database of prophage genes that were added to this pipeline and are scanned via BLASTp
  2. Phigaro;
    • Phigaro is a standalone command-line application that is able to detect prophage regions taking raw genome and metagenome assemblies as an input.
    • It also produces dynamic annotated “prophage genome maps” and marks possible transposon insertion spots inside prophages.
  3. PhiSpy;
    • PhiSpy identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.
  4. ICEberg database;
    • ICEberg 2.0 is an updated database of bacterial integrative and conjugative elements.
  5. Plasmidfinder;
    • Plasmidfinder is a tool for the in silico detection of plasmids.
  6. Platon;
    • Platon detects plasmid contigs within bacterial draft genomes from WGS short-read assemblies.
    • Therefore, Platon analyzes the natural distribution biases of certain protein coding genes between chromosomes and plasmids.
  7. MOB Suite;
    • Software tools for clustering, reconstruction and typing of plasmids from draft assemblies.
    • In the pipeline, only the typer tool is used.
  8. IslandPath.
    • IslandPath-DIMOB is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes.
  9. digIS.
    • digIS is a command-line tool for detection of insertion sequences (IS) in prokaryotic genomes.
  10. Integron Finder.
    • a command line tool to identity integrons in DNA sequences

Prediction thresholds

All the predictions were passed through a user defined threshold for minimum coverage and identity:

  • Min. Identity (%): > 85
  • Min. Coverage (%): > 85

PHAST is a protein database scanned via BLASTp; ICEberg is a protein and nucleotide database that contains the full-length sequences of known ICEs and also contains the sequences of a multitude of proteins commonly found inside these ICEs. Full-length ICEs are blasted to the genome via BLASTn while the protein sequences are blasted tto the predicted genes via BLASTp; Plasmidfinder is a nucleotide database scanned via BLASTn. The other software have its own metrics.

Genomic Islands prediction

Genomic Islands (GIs) were predicted with islandPath. The predicted genomic islands are integrated into the JBrowse genome viewer so that users can interactively interrogate the results and check the genes found inside these islands. The resulting genome browser are provided in the jbrowse directory inside the query main output directory. This genome browser can be opened with the http-server command or the JBrowse Desktop software.

Additionally, these genomic islands were parsed in a very generic manner in order to provide a simple visualization of the annotation in these regions. The plots were rendered with the python package gff-toolbox and are available at the directory: genomic_islands/plots in the main query output directory. An example of these plots is shown in Figure 1.

Examplification of the visualization of genomic islands regions with the gff-toolbox package.

Figure 1: Examplification of the visualization of genomic islands regions with the gff-toolbox package.

As discussed, these images were rendered in a very generic manner just to show some visualization possibilities to the user. If desired, users can check the gff-toolbox package to produce more customized plots.

Plasmid detection

Plasmidfinder

Plasmidfinder is a tool for the in silico detection of plasmids. Its results are summarized in Table 1

  • The complete results can be found in the directory plasmids/plasmidfinder under the main output directory.
Table 1: In silico detection of plasmids with Plasmidfinder

Platon

Platon detects plasmid contigs within bacterial draft genomes from WGS short-read assemblies. Therefore, Platon analyzes the natural distribution biases of certain protein coding genes between chromosomes and plasmids. This analysis is complemented by comprehensive contig characterizations upon which several heuristics are applied. Its results are summarized in Table 2.

  • The complete results can be found in the directory plasmids/platon under the main output directory.
Table 2: In silico detection of plasmids with Platon

MOB suite (typer)

MOB-typer provides in silico predictions of the replicon family, relaxase type, mate-pair formation type and predicted transferability of the plasmid. Using a combination of biomarkers and MOB-cluster codes, it will also provide an observed host-range of your plasmid based on its replicon, relaxase and cluster assignment. This is combined with information mined from the literature to provide a prediction of the taxonomic rank at which the plasmid is likely to be stably maintained but it does not provide source attribution predictions.

  • The complete results can be found in the directory plasmids/mob_suite under the main output directory.
Table 3: In silico typing of plasmids with MOB suite

Prophage detection

All the prophage sequences and genes are available in the genome browser provided, it is worthy taking notes of prophage’s genomic regions for a better exploration when using it. The genome browser was automatically created (stored in a dir called jbrowse) and can be visualized with JBROWSE desktop ot http-server.

Phigaro

Phigaro is a standalone command-line application that is able to detect prophage regions taking raw genome and metagenome assemblies as an input. It also produces dynamic annotated “prophage genome maps” and marks possible transposon insertion spots inside prophages. Its results can be nicely visualized in its own html report file stored in its output directory. The genomic regions predicted as putative prophage sequences are also summarized in Table 4.

  • Check it out at:
    • Dir: prophages/phigaro in the main output directory
    • HTML: _ANNOTATION/prophages/phigaro/ecoli_phigaro.html

Table 4: Putative prophage sequences annotated with phigaro software

PhiSpy

PhiSpy is a standalone tool that identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions. The genomic regions predicted as putative prophage sequences are also summarized in Table 5.

  • Check the results at prophages/phispy in the main output directory

Table 5: Putative prophage sequences annotated with phispy software

PHAST database

All prophage genes from PHAST database that had good alignments to the genes of the query genome are summarized in Table 6. The protein sequences of these genes were aligned against the gene sequences predicted by Prokka via BLASTp. They are all available in the genome browser provided. A good way to interrogate this annotation is to visualize the putative prophage regions predicted by phigaro and phispy interpolating it with the prophage gene annotation provided with phast database.

Unfortunately, PHASTER database have no searchable interface to visualize its prophages. Therefore, this table has no links to external sources.

Table 6: Prophage genes annotated using PHAST database via BLASTp

ICEs detection

ICEberg database

Analysis of full-length ICEs

Full-length ICEs are available at ICEberg database in nucleotide fastas while the proteins found inside these ICEs are in protein fastas. Since the ICEfinder script has no licenses to be incorporated to the pipeline, we try to search for the full-length ICEs. However, they are very difficult to be completely found in new genomes, thus they are scanned without coverage or identity thresholds. The filtering and selection of these is up to you. We have found a total of 35 alignments in the query genome, check it out in table 7.

Users are advised to also use the ICEfinder tool to predict the putative genomic position of known ICEs since we are not allowed to include this step under this pipeline.


Table 7: Alignment of full-length ICEs to the query genome via BLASTn

Analysis of ICE’s proteins

All query genes predicted by Prokka that have a match in ICEberg database are shown in Table 8. It is summarized the ICE id and all its genes that were found in the query genome. All of them are linked to the database for further investigations.

Take note: The fact that the genome possess some proteins from ICEs does not necessarily means that the ICE is present in the genome. Please, check the number of proteins that the ICE of origin posses in the ICEberg database list of ICEs, and then make inferences based one the alignments you see.

Users are advised to also use the ICEfinder tool to predict the putative genomic position of known ICEs since we are not allowed to include this step under this pipeline.


Table 8: ICE genes annotated from ICEberg database via BLASTp
The number of genes from known ICEs (from [ICEberg](https://bioinfo-mml.sjtu.edu.cn/ICEberg2/index.php)) found in the query genome

Figure 2: The number of genes from known ICEs (from ICEberg) found in the query genome

IS detection

Insertions sequences have been predicted with digIS. The digIS search pipeline operates in the following steps:

  1. The whole input nucleic acid sequence is translated into amino acid sequences (all six frames).
  2. The translated sequences are searched using manually curated pHMMs.
  3. The seeds are filtered by domain e-value, and those that overlap or follow each other within a certain distance are merged.
  4. The seeds are extended according to sequence similarity with known IS elements in the ISFinder database.
  5. Extended seeds are filtered by noise cutoff score and length, and duplicated hits, corresponding to the same IS element, are removed.
  6. Remaining hits are classified based on sequence similarity and GenBank annotation (if available) to help assess their quality.
  7. Finally, the classified outputs are reported in the CSV and GFF3 format.

The program is executed with the GenBank annotation


Table 9: Insertions sequences predicted by digIS in GFF format.

Integron detection

Not a integron have been predicted with Integron Finder. This might have happened either because your genome really do not have integron sequences or due to misassemblies. You can always try to run the online version of the tool: https://integronfinder.readthedocs.io/en/latest/user_guide/webserver.html