Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sylph tool Wrapper #1518

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
027c8be
transfered sylph to galaxytools due to size of test database metadata…
tcollins2011 Oct 8, 2024
d7fc0e7
Update tools/sylph/sylph.xml
tcollins2011 Oct 9, 2024
6146fb0
Update tools/sylph/.shed.yml
tcollins2011 Oct 9, 2024
f1e4551
Update tools/sylph/.shed.yml
tcollins2011 Oct 9, 2024
6b9d9cb
Update tools/sylph/macros.xml
tcollins2011 Oct 9, 2024
55d54f1
Update tools/sylph/macros.xml
tcollins2011 Oct 9, 2024
3519a0b
Update tools/sylph/.shed.yml
tcollins2011 Oct 9, 2024
8944a8d
replaced all double quotes in the command section with single quotes …
tcollins2011 Oct 9, 2024
63ca10f
Update tools/sylph/sylph.xml
tcollins2011 Oct 9, 2024
46e5d68
updated database sylmlink to better reflect the name
tcollins2011 Oct 9, 2024
477e325
changed ouput name
tcollins2011 Oct 9, 2024
25d9bac
changed database path names and sample file
tcollins2011 Oct 10, 2024
1513faf
python linting and tab spacing
tcollins2011 Oct 10, 2024
691a915
Merge branch 'master' into sylph
tcollins2011 Oct 10, 2024
9221bb8
fixing flake8 linting problems
tcollins2011 Oct 10, 2024
d86bcf0
force adding the extra test files and fixing a spacing issue in python
tcollins2011 Oct 10, 2024
3c35f6f
actually remembring to add the correct whitespace file to my commit
tcollins2011 Oct 10, 2024
5566c61
changed profile and query to be different tools and upated the macros…
tcollins2011 Oct 25, 2024
21de493
some of the comments
Dec 14, 2024
3fae5b1
lint fix
Dec 15, 2024
da2821f
Merge branch 'bgruening:master' into sylph
tcollins2011 Dec 16, 2024
2dfa011
fixed database tabs
tcollins2011 Dec 16, 2024
bdc5e02
added history database test
tcollins2011 Dec 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions tools/sylph/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: sylph
owner: bgruening
description: sylph - fast and precise species-level metagenomic profiling with ANIs
long_description: sylph is a program that performs ultrafast (1) ANI querying or (2) metagenomic profiling for metagenomic shotgun samples.
homepage_url: https://github.com/bluenote-1577/sylph
remote_repository_url: https://github.com/bgruening/galaxytools/main/tools/sylph
categories:
- Metagenomics
type: unrestricted
auto_tool_repositories:
name_template: "{{ tool_id }}"
description_template: "{{ tool_name }} from the sylph suite"

10 changes: 10 additions & 0 deletions tools/sylph/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
For Galaxy admins and local runs:

The databases for sylph have associated metadata files. These files MUST be paired with the correct databases to output correctly. Here is the easiest location to download databases and metadata files:
For databases: https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases
For metadata: https://github.com/bluenote-1577/sylph-utils

The tool assumes the directory the data_table references to be
<name_of_organism>
- database.syldb
- metadata.tsv.gz
148 changes: 148 additions & 0 deletions tools/sylph/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
<macros>
<token name="@TOOL_VERSION@">0.6.1</token>
<token name="@VERSION_SUFFIX@">0</token>
<token name="@LICENSE@">MIT</token>
<token name="@DB_SELECTOR@"><![CDATA[
#if $database_select.select == 'cached':
ln -s '$database_select.sylph_database.fields.path/database.syldb' 'database.syldb' &&
tcollins2011 marked this conversation as resolved.
Show resolved Hide resolved
#else:
ln -s '$database_select.sylph_database' 'database.syldb' &&
#end if
]]></token>
<token name="@SINGLE_INPUT@"><![CDATA[
#if $sketch.input.ext == 'fastqsanger'
#set $ext = 'fastq'
tcollins2011 marked this conversation as resolved.
Show resolved Hide resolved
#else if $sketch.input.ext == 'fastqsanger.gz':
#set $ext = 'fastq.gz'
#else:
#set $ext = str($sketch.input.ext)
#end if
#if $sketch.input.element_identifier.endswith('.fastq') or $sketch.input.element_identifier.endswith('.fastq.gz'):
#set $input = re.sub(r'\s+', '_', $sketch.input.element_identifier)
#else:
#set $input = re.sub(r'\s+', '_', $sketch.input.element_identifier + '.' + str($ext))
#end if
ln -s '$sketch.input' '$input' &&
]]></token>
<token name="@SINGLE_GROUP@"><![CDATA[
#set input = ''
#for $number, $current_file in enumerate($sketch.input):
#if $current_file.ext == 'fastqsanger'
#set $ext = 'fastq'
#else if $current_file.ext == 'fastqsanger.gz':
#set $ext = 'fastq.gz'
#else:
#set $ext = str($current_file.ext)
#end if
#if $current_file.element_identifier.endswith('.fastq') or $current_file.element_identifier.endswith('.fastq.gz'):
#set $current_input = re.sub(r'\s+', '_', $current_file.element_identifier)
#else:
#set $current_input = re.sub(r'\s+', '_', $current_file.element_identifier + '.' + str($ext))
#end if
ln -s '${current_file}' '$current_input' &&
#set input = str($input) + ' ' + str($current_input)
#end for
]]></token>
<token name="@PAIRED@"><![CDATA[
#if $sketch.input_1.ext == 'fastqsanger'
#set $ext_1 = 'fastq'
#else if $sketch.input_1.ext == 'fastqsanger.gz':
#set $ext_1 = 'fastq.gz'
#else:
#set $ext_1 = str($sketch.input_1.ext)
#end if

#if $sketch.input_2.ext == 'fastqsanger'
#set $ext_2 = 'fastq'
#else if $sketch.input_2.ext == 'fastqsanger.gz':
#set $ext_2 = 'fastq.gz'
#else:
#set $ext_2 = str($sketch.input_2.ext)
#end if

#if $sketch.input_1.element_identifier.endswith('.fastq') or $sketch.input_1.element_identifier.endswith('.fastq.gz'):
#set $read1 = re.sub(r'\s+', '_', $sketch.input_1.element_identifier)
#else:
#set $read1 = re.sub(r'\s+', '_', str($sketch.input_1.element_identifier) + '.' + str($ext_1))
#end if
#if $sketch.input_2.element_identifier.endswith('.fastq') or $sketch.input_2.element_identifier.endswith('.fastq.gz'):
#set $read2 = re.sub(r'\s+', '_', $sketch.input_2.element_identifier)
#else:
#set $read2 = re.sub(r'\s+', '_', str($sketch.input_2.element_identifier) + '.' + str($ext_2))
#end if
ln -s '$sketch.input_1' '$read1' &&
ln -s '$sketch.input_2' '$read2' &&
]]></token>
<token name="@PAIRED_GROUP@"><![CDATA[
#if $sketch.input.forward.ext == 'fastqsanger'
#set $ext_1 = 'fastq'
#else if $sketch.input.forward.ext == 'fastqsanger.gz':
#set $ext_1 = 'fastq.gz'
#else:
#set $ext_1 = str($sketch.input.forward.ext)
#end if

#if $sketch.input.reverse.ext == 'fastqsanger'
#set $ext_2 = 'fastq'
#else if $sketch.input.reverse.ext == 'fastqsanger.gz':
#set $ext_2 = 'fastq.gz'
#else:
#set $ext_2 = str($sketch.input.reverse.ext)
#end if

#set $read1 = re.sub(r'\s+', '_', str($sketch.input.element_identifier) + '.' + str($ext_1))
#set $read2 = re.sub(r'\s+', '_', str($sketch.input.element_identifier) + '_r2.' + str($ext_2))
ln -s '$sketch.input.forward' '$read1' &&
ln -s '$sketch.input.reverse' '$read2' &&
]]></token>
<xml name="requirements">
<requirements>
<requirement type="package" version="@TOOL_VERSION@">sylph</requirement>
<requirement type="package" version="3.11.9">python</requirement>
<requirement type="package" version="2.2.3">pandas</requirement>
tcollins2011 marked this conversation as resolved.
Show resolved Hide resolved
</requirements>
</xml>
<xml name="description">
<description>fast and precise species-level metagenomic profiling with ANIs</description>
</xml>
<xml name="citation">
<citations>
<citation type="doi">10.1038/s41587-024-02412-y</citation>
</citations>
</xml>
<xml name="creator">
<creator>
<organization name="University of Toronto" url="https://github.com/bluenote-1577/sylph"/>
tcollins2011 marked this conversation as resolved.
Show resolved Hide resolved
</creator>
</xml>
<xml name="xrefs">
<xrefs>
<xref type="bio.tools">sylph</xref>
</xrefs>
</xml>
<xml name="input_database">
<conditional name="database_select">
<param name="select" type="select" label="Choose the source for databases and metadata">
<option value="cached">Cached data</option>
<option value="history">History</option>
</param>
<when value="cached">
<param label="Select a sylph database" name="sylph_database" type="select">
<options from_data_table="sylph_databases">
<validator message="No Sylph databases are available" type="no_options" />
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add here a filter for version 1 or something like that ... and then we include version 1 in the test file.

Whenever the tool changes to DB layout we increase this version.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, you are not filtering the DB here according to the version

</options>
</param>
</when>
<when value="history">
<param label="Select a history dataset" name="sylph_database" type="data" format="binary" />
<param label="Metadata file for metaphlan and krona outputs" name="metadata" type="data" format="tabular.gz" optional="true" help="The metata file MUST be directly associated with the input database. For more information, view the help text of the tool."/>
</when>
</conditional>
</xml>
<xml name="output_format">
<param label="Additional output formats" name="outputs" type="select" display="checkboxes" multiple="true" help="In addition to Sylph's tabular output, you may ouput a file converted to these formats">
<option value="metaphlan">Sylph's MetaPhlAn-like output</option>
<option value="krona">Krona compatible</option>
</param>
</xml>
</macros>
210 changes: 210 additions & 0 deletions tools/sylph/sylph_profile.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
<tool id="sylph_profile" name="sylph profile" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="23.0">
<expand macro='description'/>
tcollins2011 marked this conversation as resolved.
Show resolved Hide resolved
<macros>
<import>macros.xml</import>
</macros>
<expand macro='requirements'/>
<command detect_errors='exit_code'><![CDATA[
#import re
##SYMLINK SYLPH DB
@DB_SELECTOR@
##Single input
#if $sketch.type == 'single':
@SINGLE_INPUT@
##Single group
#else if $sketch.type == 'single_group':
@SINGLE_GROUP@
##Paired input
#else if $sketch.type == 'paired':
@PAIRED@
##Paired group
#else if $sketch.type == 'paired_group':
@PAIRED_GROUP@
#end if
##SKETCHING
sylph sketch
#if $sketch.type == 'single':
$input
#else if $sketch.type == 'single_group':
-r $input
#else if $sketch.type == 'paired':
-1 $read1
-2 $read2
#else if $sketch.type == 'paired_group':
-1 $read1
-2 $read2
#end if
-t \${GALAXY_SLOTS:-4}
-d sylph_sketches &&
##MAIN COMMAND
sylph profile
database.syldb
sylph_sketches/*.sylsp
#if $min_num_kmers:
--min-number-kmers ${min_num_kmers}
#end if
-t \${GALAXY_SLOTS:-4}
-o $output

#if $outputs:
#if $database_select.select == 'cached':
&& ln -s '$database_select.sylph_database.fields.path/database.tsv.gz' 'database.tsv.gz'
#else:
&& ln -s '$database_select.metadata' 'database.tsv.gz'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is database_select.metadata ?

Is this piece tested?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way sylph works, it has two separate bits of data it works with -- the database itself, and the metadatafile associated with it. The metadata file MUST be the same one associated with the database, otherwise it will not work. Because of that, we set up the database file to be a directory endpoint with the intent that that directory would hold the db and the associated metadata file. In this case, if a user wants to select it from the history, we added a secondary param to the history part that lets a user optionally add that metadata file, should they want to run that part of the tool

#end if
&& python '$__tool_directory__/sylph_to_taxprof.py' -s $output -m database.tsv.gz -o metaphlan_
#if 'krona' in $outputs:
&& python '$__tool_directory__/sylphformatoutput.py' format_for_krona --metaphlan_output *.sylphmpa --krona_output krona.tsv
&& mkdir krona_out && mv *krona.tsv krona_out
#end if
&& mkdir metaphlan_out && mv *.sylphmpa metaphlan_out
#end if
]]></command>
<inputs>
<conditional name="sketch">
<param name="type" type="select" label="Select the type of reads used">
<option value="single">Individual single-end reads</option>
<option value="single_group">Group of single-ended reads</option>
<option value="paired">One set of paired-end reads</option>
<option value="paired_group"> Group of paired-ended reads</option>
tcollins2011 marked this conversation as resolved.
Show resolved Hide resolved
</param>
<!-- Only permitting fastq as tool input only allows fastq and fastq.gz as file ext -->
<when value="single">
<param name="input" type="data" format="fastq,fastq.gz,fastqsanger,fastqsanger.gz" label="Single-end input reads"/>
</when>
<when value="single_group">
<param name="input" type="data" format="fastq,fastq.gz,fastqsanger,fastqsanger.gz" label="Single-end input reads" multiple="true"/>
</when>
<when value="paired">
<param name="input_1" type="data" format="fastq,fastq.gz,fastqsanger,fastqsanger.gz" label="Paired-end input reads 1"/>
<param name="input_2" type="data" format="fastq,fastq.gz,fastqsanger,fastqsanger.gz" label="Paired-end input reads 2"/>
</when>
<when value="paired_group">
<param name="input" type="data_collection" format="fastq,fastq.gz,fastqsanger,fastqsanger.gz" label="Paired-end input reads" collection_type="paired"/>
</when>
</conditional>
<param name="min_num_kmers" type="integer" min="1" value="50" label="Minimum number of k-mers for Sylph to output a result." help="States the minimum number of k-mers needed for sylph to output a result.This is (approximately) the contig length divided by -c. With default settings, --min-number-kmers 10 can work with contigs ~2500 bp. For smaller contigs, consider -c 100."/>
<expand macro="output_format"/>
<expand macro="input_database"/>
</inputs>
<outputs>
<data format="tabular" name="output" label="${tool.name} on ${on_string}"/>
<collection name="metaphlan_out" type="list" label="${tool.name} on ${on_string}: MetaPhlAn-style output">
<filter> outputs and 'metaphlan' in outputs</filter>
<discover_datasets pattern="__name_and_ext__" directory="metaphlan_out/" />
</collection>
<collection name="krona_out" type="list" label="${tool.name} on ${on_string}: Krona-useable output">
<filter>outputs and 'krona' in outputs</filter>
<discover_datasets pattern="__name_and_ext__" directory="krona_out/" />
</collection>
</outputs>
<tests>
<test expect_num_outputs="1">
<param name="sylph_database" value="sylph_db"/>
<conditional name="sketch">
<param name="type" value="single"/>
<param name="input" value="single_1.fastq.gz" ftype="fastq"/>
</conditional>
<output name="output" value="output_1.tabular"/>
</test>

<!--Test 2 Group of Single-end Inputs-->
<test expect_num_outputs="1">
<param name="sylph_database" value="sylph_db"/>
<conditional name="sketch">
<param name="type" value="single_group"/>
<param name="input" value="single_1.fastq.gz,single_2.fastq.gz" ftype="fastq"/>
</conditional>
<param name="min_num_kmers" value="49"/>
<output name="output" value="output_2.tabular" compare="sim_size"/>
</test>

<!-- Test 3 Paired-end reads -->
<test expect_num_outputs="1">
<param name="sylph_database" value="sylph_db"/>
<conditional name="sketch">
<param name="type" value="paired"/>
<param name="input_1" value="test R1.fq" ftype="fastq"/>
<param name="input_2" value="test R2.fq" ftype="fastqsanger"/>
</conditional>
<output name="output" value="output_3.tabular"/>
</test>

<!-- Test 4 Collection of Paired-end Reads -->
<test expect_num_outputs="1">
<param name="sylph_database" value="sylph_db"/>
<conditional name="sketch">
<param name="type" value="paired_group"/>
<param name="input">
<collection type="paired" name="test">
<element name="forward" ftype="fastq" value="test R1.fq"/>
<element name="reverse" ftype="fastq" value="test R2.fq"/>
</collection>
</param>
</conditional>
<output name="output" value="output_4.tabular"/>
</test>
<!-- Test 5 output format conversion -->
<test expect_num_outputs="3">
<param name="sylph_database" value="sylph_db"/>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database_select = history is not tested agaik

<conditional name="sketch">
<param name="type" value="single"/>
<param name="input" value="single_1.fastq.gz" ftype="fastq"/>
</conditional>
<param name="outputs" value="metaphlan,krona"/>
<!-- With test data, output will be empty for krona tool so only check against metaphlan converter, but keep num_outputs-->
<output_collection name="metaphlan_out" type="list">
<element name="metaphlan_single_1.fastq.gz" value="test.sylphmpa"/>
</output_collection>
</test>
</tests>
<help><![CDATA[
**What is sylph?**

Sylph is an extremely fast and memory efficient program for profiling and searching metagenomic samples against databases. It is 10-100x faster than other popular software such as MetaPhlAn or Kraken and more memory efficient too.

**What can sylph do?**

- Profile metagenomes: sylph can calculate the abundances of genomes in a sample using a reference database. This is the same type of output as Kraken or MetaPhlAn.
- Search genomes against metagenomes: sylph can check if a genome is contained in your sample (e.g. is this E. coli genome in my sample?).
- ANI querying: sylph can estimate the containment average nucleotide identity (ANI) of a reference genome to the genomes in your sample.
- Use custom reference databases: Eukaryotes, viruses, and any collections of fasta files are ok.
- Long-reads are usable: sylph is primarily optimized for short-reads, but it can utilize nanopore or PacBio reads with high precision.
- Calculate coverage: sylph can estimate the coverage (not just the abundance) of genomes in your database.

`[See here for more information on what sylph can and can not do]. <https://github.com/bluenote-1577/sylph/wiki/Introduction:-what-is-sylph-and-how-does-it-work%3F>`_

----

**Output**

Sylph profile outputs a TSV (tab-separated values) file. Each row is one genome detected in the metagenome sample.
- Sample_file: the filename of the reads/sample.
- Genome_file: the filename of the detected genome.
- Taxonomic_abundance: normalized taxonomic abundance as a percentage. Coverage-normalized - same as MetaPhlAn abundance
- Sequence_abundance: normalized sequence abundance as a percentage. The "number of reads" assigned to each genome - same as Kraken abundance
- Adjusted_ANI: adjusted containment ANI estimate.
- If coverage adjustment is possible (cov is < 3x cov): returns coverage-adjusted ANI
- If coverage is too low/high: returns Naive_ANI (see below)
- Eff_cov/True_cov: an estimate of the effective, or if -u specified, the true coverage. Always a decimal number.
- ANI_5-95_percentile: [5%,95%] confidence intervals. Not always a decimal number.
- If coverage adjustment is possible: float-float e.g. 98.52-99.55
- If coverage is too low/high: NA-NA is given.
- Eff_lambda: estimate of the effective coverage parameter. Not always a decimal number.
- If coverage adjustment is possible: lambda estimate is given
- If coverage is too low/high: LOW or HIGH is output
- Lambda_5-95_percentile: [5%, 95%] confidence intervals for lambda. Same format rules as ANI_5-95_percentile.
- Median_cov: median k-mer multiplicity for k-mers with >= 1 multiplicity.
- Mean_cov_geq1: mean k-mer multiplicity for k-mers with >= 1 multiplicity.
- Containment_ind: int/int showing the containment index (number of k-mers found in sample divided by total k-mers), e.g. 959/1053.
- Naive_ANI: containment ANI without coverage adjustment.
- kmers_reassigned: the number of k-mers reassigned away from the genome.
- Contig_name: name of the first contig in the genome

Additional files are able to be output. The metaphlan-style output is formatted similarly to that output by the `[Metaphlan <toolshed.g2.bx.psu.edu/repos/iuc/metaphlan/metaphlan/4.1.1+galaxy3>`_
This output is *NOT* compatible with Krona directly. For that, please select the Krona-style output option.


]]></help>
<expand macro="citation"/>
</tool>
Loading
Loading