Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSL2 - emit tuples with optional values #2678

Open
rcannood opened this issue Feb 25, 2022 · 4 comments
Open

DSL2 - emit tuples with optional values #2678

rcannood opened this issue Feb 25, 2022 · 4 comments

Comments

@rcannood
Copy link

rcannood commented Feb 25, 2022

Usage scenario

I'd like to be able to return a tuple with optional elements. For example, by defining the output as tuple val(id), path("output.txt"), path("output2.txt" optional: true), I'd like a process to be able to emit an event ["foo", path("output.txt"), null].

The process and downstream processes can take a while to run, so using a multi-channel output in combination with a groupTuple() (See Attempt 3) is very undesirable.

Suggested implementation

Probably this would require:

Reproducible examples

I made several attempts at getting this to run with the current implementation of Nextflow. To summarise:

  • Attempt 1: optional path in non-optional tuple → errors
  • Attempt 2: optional tuple → tuple with missing file is not emitted
  • Attempt 3: multi-channel output followed by a groupTuple → introduces a bottleneck in workflows with long execution times
  • Attempt 4: a messy workaround solution to this problem

Attempt 1: optional path in tuple

Because of TupleOutParam.groovy#L103-L105, this optional value is overridden by the tuple's value for 'optional', namely false.

If I try to run the code following code, Nextflow will produce an error when output2.txt is missing.

Attempt 1 reprex
nextflow.enable.dsl=2

process test_process1 {
  input:
    tuple val(id)
  output:
    tuple val(id), path("output.txt"), path("output2.txt", optional: true)
  script:
    """
    echo $id > output.txt
    if [[ "$id" == "foo" ]]; then
      echo $id > output2.txt
    fi
    """
}

workflow {
  Channel.fromList( ["foo", "bar"] )
    | view { "input: ${it}" }
    | test_process1
    | view { "output: ${it}" }
}

$ NXF_VER=21.10.6 nextflow run test_outputs_opt1.nf
input: foo
input: bar
output: [foo, work/81/e866d5e329c9ac9980a0c9313d347b/output.txt, work/81/e866d5e329c9ac9980a0c9313d347b/output2.txt]
[8c/e39e04] NOTE: Missing output file(s) `output2.txt` expected by process `test_process1 (2)` -- Error is ignored

Attempt 2: make the whole tuple optional

By making the whole tuple optional, Nextflow doesn't produce an error anymore, but my whole tuple is removed, which is undesirable.

Attempt 2 reprex
nextflow.enable.dsl=2

process test_process1 {
  input:
    tuple val(id)
  output:
    tuple val(id), path("output.txt"), path("output2.txt") optional true
  script:
    """
    echo $id > output.txt
    if [[ "$id" == "foo" ]]; then
      echo $id > output2.txt
    fi
    """
}

workflow {
  Channel.fromList( ["foo", "bar"] )
    | view { "input: ${it}" }
    | test_process1
    | view { "output: ${it}" }
}

$ NXF_VER=21.10.6 nextflow run test_outputs_opt2.nf
input: foo
input: bar
output: [foo, work/95/0e07ee0b94834d4587509b152aa354/output.txt, /home/rcannoodwork/95/0e07ee0b94834d4587509b152aa354/output2.txt]

Attempt 3: multichannel output

This approach is what is proposed in #1980. However, having to use 'groupTuple()' to merge the multichannel output back into a single event is also undesirable, as now the whole Channel needs to be executed before any events can be emitted downstream. Note that setting size: 2 doesn't work in this case, since some tuples should have one element, others two.

Attempt 3 reprex
nextflow.enable.dsl=2

process test_process2 {
  input:
    tuple val(id)
  output:
    tuple val(id), val("output1"), path("output.txt")
    tuple val(id), val("output2"), path("output2.txt") optional true
  script:
    """
    echo $id > output.txt
    if [[ "$id" == "foo" ]]; then
      echo $id > output2.txt
    fi
    """
}

workflow {
  Channel.fromList( ["foo", "bar"] )
    | view { "input: ${it}" }
    | test_process2
    | mix
    | groupTuple(by: 0)
    | map{ [ it[0], [it[1], it[2]].transpose().collectEntries() ]}
    | view { "output: ${it}" }
}

$ NXF_VER=21.10.6 nextflow run test_outputs_opt3.nf
input: foo
input: bar
output: [bar, [output1:work/9c/97b3a2884f97594532a19923e6c748/output.txt]]
output: [foo, [output1:work/60/984231826c9a9cc2a1e1cf29e16fdb/output.txt, output2:work/60/984231826c9a9cc2a1e1cf29e16fdb/output2.txt]]

Attempt 4: add junk to output

By adding a file known to exist (e.g. ".command.sh") to the output, I can force the Channel to always return a tuple. This works, but the code looks quite messy and I need to do postprocessing to remove the additional file.

Attempt 4 reprex
nextflow.enable.dsl=2

process test_process3 {
  input:
    tuple val(id)
  output:
    tuple val(id), path{[".command.sh", "output.txt"]}, path{[".command.sh", "output2.txt"]}
  script:
    """
    echo $id > output.txt
    if [[ "$id" == "foo" ]]; then
      echo $id > output2.txt
    fi
    """
}

workflow {
  Channel.fromList( ["foo", "bar"] )
    | view { "input: ${it}" }
    | test_process3
    | map { output ->
      map = [["output1", "output2"], output.drop(1)].transpose()
      map_without_dummy = map.collectEntries{ key, out ->
        if (out instanceof List && out.size() > 2) {
          [ key, out.drop(1) ]
        } else if (out instanceof List && out.size == 2) {
          [ key, out[1] ]
        } else {
          [ key, null ]
        }
      }
      [ output[0], map_without_dummy ]
    }
    | view { "output: ${it}" }
}

$ NXF_VER=21.10.6 nextflow run test_outputs_opt4.nf
input: foo
input: bar
output: [foo, [output1:work/96/a51f95280ee3332f50b6b05a12596b/output.txt, output2:work/96/a51f95280ee3332f50b6b05a12596b/output2.txt]]
output: [bar, [output1:work/ec/87149bfea74975d37307d6a115c812/output.txt, output2:null]]
@jorgeaguileraseqera jorgeaguileraseqera self-assigned this May 17, 2022
jorgeaguileraseqera added a commit that referenced this issue Oct 21, 2022
closes #2678

Signed-off-by: Jorge Aguilera <jorge.aguilera@seqera.io>
@stale
Copy link

stale bot commented Jan 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@rcannood
Copy link
Author

Out of curiosity, is this issue still being worked on? Is it already possible to have an nullable optional output?

Thanks! 🙇

@bentsherman
Copy link
Member

Hi @rcannood , we've gone through a few sketches since you first submitted the issue. I think we ran into some tricky limitations under the hood that require more fundamental improvements to support nullable values properly.

You can see the current state of development here: #4553 . Even this PR is likely not how it will look in the end, but it can give you an idea of where we are heading. Basically, instead of trying to patch the nullable option into the current syntax, we are working on a broader "static type" syntax that should also cover nullable values.

Thanks for your patience in the meantime. It ended up being a deeper rabbit hole than we thought 😅

@rcannood
Copy link
Author

rcannood commented Sep 24, 2024

Thanks for the update @bentsherman ! Looking forward to static types :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants