Allow Mapping (Batch-Mode) over multiple data parameters. #4623

jmchilton · 2017-09-15T12:06:26Z

For tool parameters with type="data" multiple="true", Galaxy provides an interface for reducing a list collection over these parameters. Many more options should be available, the easiest and most essential of these is that you should be able to map a list over these parameters (so run N jobs each with a single input from the supplied list)

Once that is done there is still more work to do, you should be able to map multiple lists over these parameters (so supply two lists of size N and run N jobs - each with the matching two datasets from the two supplied lists), you should be able to map over the outer lists of a nested collection and reduce the inner ones (so if you have a list of samples where each list element is a list of replicates and you have a concatenation tool - you should be able to concatenate the replicates (reduce the replicates) and build a list of merged samples (map the samples) from that tool). (I no longer think this is a good idea - see note below.).

Each of these modes of operation described above can be worked around by modifying the tool itself, but this is definitely a hack and the GUI should have a common set of language and UI for describing these operations.

The hacks to workaround these limitations include...

Allowing both mapping and reduction of simple lists can be accomplished by replacing the type="data" multiple="true" with a conditional that has that same parameter as one path and a simple (non-multiple) data parameter as the second path. Modifying that conditional so that second path isn't just a data parameter but a repeat parameter with a minimum repeat number of 1 allows that second use case above of allowing to map multiple lists. Adding another case with a type="data_collection" collection_type="list" to the tool allows the mapping over the outer list, reducing the inner list operation described above. Wrapping that parameter in a repeat would allow you to do that with multiple list:lists. The last operation would also work for list:list:list if you wanted to map the outer two list depths but reduce the inner ones. If you wanted to instead reduce the inner two lists and map over the outer one you could add yet another conditional case with a list:list input.

The text was updated successfully, but these errors were encountered:

jmchilton · 2018-05-16T14:44:37Z

I've evolved on this issue, I actually don't think the tool form should supply advanced selection for reducing multiple layers of nesting - for instance reducing the inner two lists of list:list:list and mapping over the outer list. The GUI is too complicated for that, tracking the backend is difficult, and it would complicate the APIs. We have a better approach now - that is more explicit and more easy to understand (though a bit more work). The right and more explicit thing to do is have the user flatten the inner two parts of the list with the Apply Rules tool or some other collection operation that we could potentially add.

The thing we still definitely need though - is to be able to map a collection over a multi-input element - so right now there is one collection button for a multi-input element and that button reduces the collection (treats it as a set of datasets). There should instead be two buttons - one that does that and one that operates like the collection button on single inputs - and creates a job per dataset in the collection (and similar mapping semantics). There is no workaround for that and it is repeatedly requested.

@mvdbeek do you agree with this? I suspect it is a conclusion you reached quicker than me.

mvdbeek · 2018-05-16T14:56:02Z

That's a good summary, I agree completely.

here should instead be two buttons - one that does that and one that operates like the collection button on single inputs - and creates a job per dataset in the collection (and similar mapping semantics). There is no workaround for that and it is repeatedly requested.

<3, that would avoid this terrible conditional that asks if you want to reduce or not.

We don't track workflow step inputs in any formal way in our model currently. This has resulted in some current hacks and prevents future enhancements. This commit splits WorkflowStepConnection into two models WorkflowStepInput and WorkflowStepConnection - normalizing the previous table workflow_step_connection on input step and input name. In terms of current hacks forced on it by restricting all of tool state to be confined to a big JSON blob in the database - we have problems distinguishing keys and values when walking tool state. As we store more and more JSON blobs inside of the giant tool state blob - the worse this problem gets. Take for instance checking for runtime parameters or the rules parameter values - these both use JSON blobs that aren't simple values, so it is hard to tell looking at the tool state blob in the database or the workflow export to tell what is a key or what is a value. Tracking state as normalized inputs with default values and explicit attributes runtime values should allow much more percise state definition and construction. This variant of the models would also potentially allow defining runtime values with non-tool default values (so default values defined for the workflow but still explicitly settable at runtime). The combinations of overriding defaults and defining runtime values were not representable before. In terms of future enhancements, there is a lot we cannot track with the current models - such as map/reduce options for collection operations (galaxyproject#4623 (comment)). This should enable a lot of that. Obviously there are a lot of attributes defined here that are not yet utilized, but I'm using most (all?) of them downstream in the CWL branch. I'd rather populate this table fully realized and fill in the implementation around it as work continues to stream in from the CWL branch - to keep things simple and avoid extra database migrations. But I understand if this feels like speculative complexity we want to avoid despite the implementation being readily available for inspection downstream.

jmchilton mentioned this issue Sep 15, 2017

MIRA v4.0 de novo assembler does not output a collection for collection input peterjc/galaxy_mira#3

Open

jmchilton mentioned this issue Sep 27, 2017

Tool Panel - Advanced Data Selection #4707

Open

10 tasks

jmchilton mentioned this issue Apr 11, 2018

MultiQC for list of pairs (of FastQC output) galaxyproject/tools-iuc#1658

Closed

jmchilton mentioned this issue Aug 3, 2018

Galaxy plotEnrichment wrapper (arguably) does not handle collections properly deeptools/deepTools#740

Open

jmchilton mentioned this issue Oct 10, 2018

Track workflow step input definitions in our model. #6850

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Mapping (Batch-Mode) over multiple data parameters. #4623

Allow Mapping (Batch-Mode) over multiple data parameters. #4623

jmchilton commented Sep 15, 2017 •

edited

Loading

jmchilton commented May 16, 2018

mvdbeek commented May 16, 2018 •

edited

Loading

Allow Mapping (Batch-Mode) over multiple data parameters. #4623

Allow Mapping (Batch-Mode) over multiple data parameters. #4623

Comments

jmchilton commented Sep 15, 2017 • edited Loading

jmchilton commented May 16, 2018

mvdbeek commented May 16, 2018 • edited Loading

jmchilton commented Sep 15, 2017 •

edited

Loading

mvdbeek commented May 16, 2018 •

edited

Loading