This tools provides a pipeline for annotating and clustering input genomes sequences into UniRef90/UniRef50 genes families and clustering unknown coding sequences. The output provided is a ready-to-use PanPhlAn pangenome. Thus, it will countain all genomes contigs in a multi-FASTA file, precomputed bowtie2 indexes, and a pangenome tsv file mapping gene location on contigs.
- Prokka runs over the provided genome to annotate them
- Using the UniRef annotator and the UniRef DIAMOND database, sequences are associated to UniRef90 and UniRef50 ID
- The remaining (not mapped by UniRef annotator) sequences are clustered together at the same thresholds (90% and 50 % similarity). This leads to the attribution of UniRef90_UNK and UniRef50_UNK (unknown) IDs
- Then the PanPhlAn pangenome is generated : concatenation of contigs of all genomes, generation of tsv mapping file, bowtie2 indexes building.
The following Python packages are needed .
- BioPython
- bcbio-gff
- gffutils
The following external tools should be installed (and the PATH variable properly configured) :
- Prokka (https://github.com/tseemann/prokka)
- MMSEQ2 (https://github.com/soedinglab/MMseqs2)
- DIAMOND (https://github.com/bbuchfink/diamond)
- BowTie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
On top on that, UniRef DIAMOND databases should be downloaded via the download_databases.py
script.
python panphlan_exporter.py --input [input_genomes_folder] \
--output [output_pangenome_folder] \
--db_path [path_to_UniRef_DIAMOND_databases]
- The
--input [input_genomes_folder]
should contain one fasta file per genome. The script assumes that the file name is the genome name - The
--output [output_pangenome_folder]
will be created if not existing
Additionnal parameters could be provided :
-t
or--tmp
specifies another directory for temporary files. Default is the output folder-c
or--clade_name
specifies a prefix for PanPhlAn output files. The best would be the full species name (e.g.Escherichia_coli
). Default ispanplhan_clade
-n
or--nprocs
the number of threads to use.
N.B : If the ouput folder is already a PanPhlAn pangeome folder (containing the 8 or 9 files of a PanPhlAn pangenome : 1 fna, 1 pangenome tsv, 6 indexes files and 1 optionnal annotation file), then the pangenome generated by the pipeline will extend the existing one.