sylph-utils: utility scripts and taxonomy for sylph
This repository contains scripts for incorporating taxonomic information into the output of sylph.
Warning
sylph-utils is now replaced with sylph-tax. Sylph-tax offers a cleaner interface. Sylph-utils still works as of sylph v0.8 but will no longer be updated.
The following databases are currently supported (with pre-built sylph databases available here) and can be found in the prokaryote, eukaryote, virus
subfolders within this repository.
- GTDB-R220 (April 2024) -
prokaryote/gtdb_r220_metadata.tsv.gz
- GTDB-R214 (April 2023) -
prokaryote/gtdb_r214_metadata.tsv.gz
- OceanDNA -
prokaryote/ocean_dna_metadata.tsv.gz
- Soil MAGs (SMAG) from Ma et al. -
prokaryote/smag_metadata.tsv.gz
- Refseq fungi representative genomes -
eukaryote/fungi_refseq_2024-07-25_metadata.tsv.gz
- TARA eukaryotic SMAGs -
eukaryote/tara_SMAGs_metadata.tsv.gz
- IMG/VR 4.1 high-confidence viral OTU genomes -
virus/IMGVR_4.1_metadata.tsv.gz
Important
(2024-11-05): fungi
directory renamed -> eukaryote
. TARA eukaryote taxonomy now available. See CHANGELOG.md for details.
- Python3
- Pandas
Run pip install pandas
if pandas is not installed.
python sylph_to_taxprof.py -m database1_metadata.tsv.gz database2_metadata.tsv.gz -s sylph_output.tsv -o prefix_or_folder/
-m
: taxonomy metadata file(s). Multiple taxonomy metadata files can be input (will be concatenated). Metadata files are of the formtsv.gz
and present in this repository. (e.g.prokaryote/gtdb_r220_metadata.tsv.gz
)-s
: the output from sylph. Sylph's databases must be the same as the-m
option's.-o
: prepends this prefix to all of the output files; one output for each sample in the sylph output.--annotate-virus-hosts
(new in v0.2): annotates found viral genomes with host information metadata (only available for IMG/VR 4.1 right now)- Output suffix is
.sylphmpa
.
Use the metadata file corresponding to the database used. E.g. if you use the GTDB-R220 database for sylph, you must use the gtdb_r220_metadata.tsv.gz
file.
See here for more information on
- taxonomy metadata files definitions
- the output format
- how to create taxonomy metadata for customized genome databases
Tip
In python/pandas, you can read the output with pd.read_csv('output.sylphmpa',sep='\t', comment='#')
.
Merge multiple taxonomic profiles from sylph_to_taxprof.py
into a TSV table
python merge_sylph_taxprof.py *.sylphmpa --column {ANI, relative_abundance, sequence_abundance} -o output_table.tsv
*.sylphmpa
files from sylph_to_taxprof.py.--column
can be ANI, relative abundance, or sequence abundance (see paper for difference between abundances)-o
output file in TSV format.
clade_name sample1.fastq.gz sample2.fastq.gz
d__Archaea 0.0 1.1
d__Archaea|p__Methanobacteriota 0.0 0.0965
...