Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIRA v4.0 de novo assembler does not output a collection for collection input #3

Open
tshtatland opened this issue Sep 13, 2017 · 8 comments

Comments

@tshtatland
Copy link

When used in a workflow, MIRA v4.0 de novo assember does not output a collection, as expected when the input is a collection. I am attaching the workflow screenshot with red and green arrows that highlight the issue. I am using the latest version of the tool on a local Galaxy installation v.17.05:
MIRA v4.0 de novo assember Takes Sanger, Roche 454, Solexa/Illumina, Ion Torrent and PacBio reads (Galaxy Version 0.0.11)
Galaxy Tool Shed - https://toolshed.g2.bx.psu.edu/repository?repository_id=efe8c48b382cb9cc&changeset_revision=1713289d9908
I expected (perhaps incorrectly) most Galaxy tools, such as MIRA assembler, designed so that collection input (N fastq/fasta files, each with millions of reads) produces a collection output (N fasta files, each with a small number of contigs), as shown on the screenshot. N is the number of biologically distinct samples/libraries. From my point of view, most Galaxy tools should not be "reduced". The reducing step should probably be done by a simple reducing tool, later, otherwise the combo tool is not collection-friendly. I wonder if this naive view make sense... Thank you!
mira_collection_in_single_out

@tshtatland tshtatland changed the title MIRA v4.0 de novo assember does not output a collection for collection input MIRA v4.0 de novo assembler does not output a collection for collection input Sep 13, 2017
@peterjc
Copy link
Owner

peterjc commented Sep 14, 2017

MIRA will happily take multiple FASTQ inputs, e.g. an organism sequenced with multiple libraries or runs, so yes, the tool does deliberately cater to mapping N input files to one assembly (i.e. "reduce" mode)

I'd hope the Galaxy GUI would allow you to deliberately choose to run N copies of MIRA instead (i.e. "map" mode), giving N assemblies.

Paging @jmchilton as the Galaxy collections expert.

@tshtatland
Copy link
Author

Thank you for the quick response! Let me also use this opportunity to thank you for writing and maintaining MIRA tools, we use them frequently in our Galaxy instance!
I asked about this issue yesterday on the Galaxy gitter channel: https://gitter.im/Galaxy-Training-Network/Lobby
Perhaps I do not understand something fundamental here, but many Galaxy tools such as spades (also an assembler) and cat (concatenate inputs = M files into output = 1 file) are collection-friendly. An example workflow with spades (compared to MIRA, as is) is shown in the screenshot attached above. Below I am attaching a screenshot of a similar workflow, where I added also "cat" tool for comparison, and a version of MIRA with hacked xml. Everything seems collection-friendly now. As you can see, multiple inputs got into 1 output, but one can use this on a collection. This is the desired goal for me. But the hack is not production-ready. It is based on the suggestion from Bjoern:

Remove the multiple=True
and the for loop in the command section

mira_collection_in_single_out_spades_cat

@peterjc
Copy link
Owner

peterjc commented Sep 14, 2017

The MIRA wrapper is collection aware (via multiple="true" on the input parameter which @bgruening mentioned). I would infer that the Spades wrapper is not collection aware, which would explain why you (only) get the default N inputs to N jobs behaviour from Galaxy (useful and practical for single input tools).

I've not had a chance to play with the interface to confirm how you'd run MIRA to get N jobs from N inputs, but I would expect the collections input control allowed that.

@bgruening
Copy link

There are two different concepts here that are clashing I think.
@peterjc every tool is kind of collection aware in a sense, that if you don't do any magic in your tool description it will simply start X jobs for X datasets. This is what @tshtatland probably wants/expects.

You are referring probably to the multiple=True with collection aware, which is true, but means that a job from this tool will consume X datasets at once. Which results that the current Mira wrapper can not iterate over a collection.

So both approaches are correct but they imply different UX. One solution would be to remove multiple=True from Mira and rely on the fact that people can merge FASTA files before they start the assemble, if they really need to provide multiple FASTA files to the assembly.

@peterjc
Copy link
Owner

peterjc commented Sep 15, 2017

I think this is a Galaxy UI limitation for tools with multiple="true", and welcome comment from @jmchilton - can Galaxy really not iterate over a collection in this situation?

I'd like to try this locally with Spades - @tshtatland which Spades wrapper are you using? Can you tell me the Tool Shed URL (as there are at least two different wrappers available)?

@jmchilton
Copy link

jmchilton commented Sep 15, 2017

can Galaxy really not iterate over a collection in this situation

Correct - it cannot currently. There was an issue in Trello but I cannot find on Github for this so I've created galaxyproject/galaxy#4623. I included workarounds you can add to Mira if you want the tool to support both modes of operation. Certainly some Galaxy developers would discourage those workarounds - but I tend to be a bit more pragmatic I think.

@peterjc
Copy link
Owner

peterjc commented Sep 15, 2017

Thanks John - this is a very difficult set of concepts to convey to the user, so I can understand why it isn't in the current Galaxy UI.

The suggestions you've given for workarounds make sense, but would I think break backward compatibility with the current versions of the MIRA wrapper. Given that, it would be nice for me to take more direct advantage of the paired collection infrastructure as part any changes to the input handling.

@tshtatland
Copy link
Author

This is the spades tool that corresponds to the screenshot above:
spades SPAdes genome assembler for regular and single-cell projects (Galaxy Version 1.0)
https://toolshed.g2.bx.psu.edu/repository?repository_id=6a122c80d3c9733e
https://toolshed.g2.bx.psu.edu/view/lionelguy/spades/21734680d921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants