Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour for glob process outputs #2425

Closed
cjw85 opened this issue Nov 1, 2021 · 11 comments
Closed

Inconsistent behaviour for glob process outputs #2425

cjw85 opened this issue Nov 1, 2021 · 11 comments

Comments

@cjw85
Copy link
Contributor

cjw85 commented Nov 1, 2021

Bug report

Expected behavior and actual behavior

According to, https://www.nextflow.io/docs/latest/process.html#multiple-output-files, a glob pattern can be used to emit a list of multiple items from a process into an output channel. What is not clear from the documentation is that if only a single file matches the glob, a length-1 list is not emitted but rather the plain value.

Returning different types from a function (process in this instance) is typically considered bad practice and burdens the caller with having to perform introspection. In the context of of Nextflow this means any operator on a channel first needs to check the returned type, possibly wrap the item thats assumed to be a list, and then safely perform operations such as mapping over the list.

Steps to reproduce the problem

nextflow.enable.dsl=2

process touchOne {
    input:
        val name
    output:
        path "item*"
    script:
        """
        echo "Hello $name"
        touch item1
        """
}

process touchTwo {
    input:
        val name
    output:
        path "item*"
    script:
        """
        echo "Hello $name"
        touch item1
        touch item2
        """
}

workflow {
    touchOne(1).view()
    touchTwo(2).view()
}

Program output

(Copy and paste here output produced by the failing execution. Please highlight it as a code block. Whenever possible upload the .nextflow.log file.)

Environment

  • Nextflow version: version 20.10.0 build 5430
  • Java version: OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
  • Operating system: Linux
  • Bash version: GNU bash, version 4.3.48(1)-release

Additional context

I can see there being an argument about making a breaking change to Nextflow by enforcing that all globs return a list. Despite Nextflow's eschewing of heavy pattern-matching of file artifacts, I suspect the globbing is obused a lot to emit single files such that a change to emitting lists will break much exisiting code.

However, I do think that at least optionally globbing outputs should be forceable to be lists. I suspect theres some corner cases around what a lenth-0 list might mean and how thats handled. Minimally the documentation needs updating to highlight that the returned type is not always a list.

@stale
Copy link

stale bot commented Mar 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 31, 2022
@stale stale bot closed this as completed Jun 12, 2022
@cjw85
Copy link
Contributor Author

cjw85 commented Jun 12, 2022

I'm not aware of any changes in consideration of this issue.

@ryankelley-lm
Copy link

I agree that this can make workflows brittle, can this issue be re-opened?

@abhi18av abhi18av reopened this Jun 21, 2022
@stale stale bot removed the stale label Jun 21, 2022
@bentsherman
Copy link
Member

I agree with your sentiment. It is poor form for a function to return a different type based on the outcome.

Perhaps we could add an option to path so that users can opt out if they want:

    output:
        path "item*", squeeze: false

The squeeze option should be true by default, but if users want a glob to always return a list then they can set the option to false.

Inspired by numpy.squeeze

@cjw85
Copy link
Contributor Author

cjw85 commented Jun 22, 2022

That might be a good compromise if you aren't willing to break backwards compatibility.

For my money I would break compatibility. I've not been in the Nextflow game long compared to many others, but I came across this bug the first time I tried to write a non-linear (but still fairly trivial) workflow with a process used to create dynamic inputs for downstream processing. This is the sort of dynamism that Nextflow espouses as a key benefit over say snakemake (where dynamic rules are historically rather clunky). It's almost a must in such a system to have well-typed functions, else we're endlessly boxing return values.

In trivial cases best practice is presumably not to use a glob but more tightly match a single output directly by name (either fixed or through a variable). In code review I would definitely question a:

output:
    path "item*"

as being lazy (or a potential error) and request a replacement to something more specific.

As soon as you are outside the realm of the trivial, it seems to me you will always want squeeze: false. The only use case I can immediately think of is some program outside of your control generating a (truly) random file name --- which is rather silly in itself, maybe timestamped files are a better example. I'd love to see examples of how these globs are used in practice and how many times they could/should be replaced with something more specific.

So for me a default of squeeze: false makes more sense, because what I perceive as a justifiable use case is to expect a multiplicity of values. This breaks backwards compatibility, but only in the sense that there are likely better ways to achieve what users were doing. Being able to set squeeze: true remains for edge cases like the curious examples above.

@bentsherman
Copy link
Member

The thing is that Nextflow already has a pretty aggressive update schedule, and it is known as such. I think people tolerate it because it's easy to switch Nextflow versions, but even so, we try not to push people herder than we already do. If we implement the squeeze option, we may very well change the default to false after a few releases, but there has to be a transition period where users are made aware of a feature and given the opportunity to use it before it becomes the default.

@bentsherman
Copy link
Member

Duplicate of #1236

@bentsherman bentsherman marked this as a duplicate of #1236 Jul 12, 2022
@pditommaso
Copy link
Member

pditommaso commented Jul 12, 2022

Well spotted. Here the plan was/is to extend the output (and input) declaration with the cardinality of the expected file to be captured e.g. cardinality: 1 for just one element , cardinality: 2 exactly two, cardinality: 1..* one or more, etc.

This would serve both to resolve the ambiguity of the result type and to validate the number of files expected.

update: the cardinality is a mere example, a better naming should be found

@stale
Copy link

stale bot commented Jan 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale

This comment was marked as off-topic.

@stale stale bot added the stale label Aug 12, 2023
@bentsherman bentsherman added pinned and removed stale labels Aug 13, 2023
@bentsherman
Copy link
Member

Fixed by 42504d3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants