This pipeline idenfitifes circular plasmids in in bacterial genome assemblies by aligning long sequencing reads to putative plasmids. When overlapping long reads confirm circular plasmids, resistance genes are identified and additional parameters calculated.
The pipeline includes the following steps:
- Gene prediction with Glimmer3
- Identification of antibiotic resistance genes in the CARD Database RGI
- Long read alignment against assembly
- Coverage analysis with Mosdepth
- GC Content and GC Skew
- Identification of reads that overlap the gap in the plasmid, indicating circular reads
- Linux or Mac OS
- Java 8.x
- Docker or Singularity container application or Conda package manager
- Install nextflow
curl -s https://get.nextflow.io | bash
This creates the nextflow
executable in the current directory
- Download pipeline
You can either get the latest version by cloning this repository
git clone https://github.com/imgag/plasmIDent
or download on of the releases.
- Download dependencies
All the dependencies for this pipeline can be downloaded in a docker container.
docker pull caspargross/plasmident
Alternative dependency installations:
The pipeline requires an input file with a sample id (string) and paths for the assembly file in .fasta format and long reads in .fastq
or .fastq.gz
. The paths can either be absolute or relative to the launch directory. In normal configuration (with docker), it is not possible to follow symbolic links.
The file must be tab-separated with three columns:
sample_id assembly_fasta longread_fq
The file must not have a header line and start directly with the data. Here is an example file:
myid1 /path/to/assembly1.fasta /path/to/reads1.fastq.gz
myid2 /path/to/assembly2.fasta /path/to/reads2.fastq.gz
The pipeline is started with the following command:
nextflow run plasmident --input read_locations.tsv
There are other run profiles for specific environments.
--outDir
Path of output folder--seqPadding
Number of bases added at contig edges to improve long read alignment [Default: 1000]--covWindow
Moving window size for coverage and gc content calculation [Default: 50]--max_cpu
Number of threads used per process [Default: 4]
--max_memory
Maximum amount of memory available
--targetCov
Large read files are subsampled to this target coverage to speed up the process [Default: 50]
The pipeline creates the following output folders:
alignment
: Contains the long read alignment (full genome)coverage
: Long read coverage for the whole input genome (compressed bedfile)gc
: Windowed GC content (full genome)genes
: Predicted gene locationsplasmids
: Nucleotide sequences for all confirmed plasmids in separate FASTA filesresistances
: GFF file with locations of identified resistance genes
Additional file:
sampleID_summary.csv
: Tabular text file with contig lengths, plasmid status and identified antibiotic resistance genes.