Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance cwltoil to support SoftwareRequirements & BioContainers. #1943

Merged
merged 1 commit into from
Oct 31, 2017

Conversation

jmchilton
Copy link
Contributor

This enables the reproducibilty stack described in this preprint and presented at BOSC 2017 under Toil. Concretely this enables all the same options in cwltoil as added to cwltool in common-workflow-language/cwltool#214 including --beta-conda-dependencies, --beta-dependency-resolvers-configuration, and --beta-use-biocontainers. The first two of these are documented in depth in cwltool's README (https://github.com/common-workflow-language/cwltool/#leveraging-softwarerequirements-beta).

Here I will quickly review a couple of the available options against test examples available in cwltool's tests directory using this branch of Toil.

git clone https://github.com/common-workflow-language/cwltool.git
cd cwltool

From here we can quickly demonstrate installation and resolution of CWL SoftwareRequirement hints using Conda using the tests/seqtk_seq.cwl tool. This tool doesn't define an explicit DockerRequirement but does define the following SoftwareRequirement in its hints as follows:

hints:
  SoftwareRequirement:
    packages:
    - package: seqtk
      version:
      - r93

We can try this tool out with cwltoil and see that by default we probably don't have the binary seqtk on our PATH and so the tool fails using the following command:

cwltoil tests/seqtk_seq.cwl tests/seqtk_seq_job.json

This should result in a tool execution failure. We can then instruct cwltoil to install the required package from Bioconda into an isolated environment and use it as needed by passing it the --beta-conda-dependencies flag as follows:

cwltoil --beta-conda-dependencies tests/seqtk_seq.cwl tests/seqtk_seq_job.json

The tool should now be successful.

The Conda support can be endless tweaked but the defaults are defaults that target the best practice Conda channels that work well for the Galaxy project. Additional SoftwareRequirement resolution options are available including targetting Software Modules, lmod, Homebrew, simple scripts called "Galaxy packages". All of these options can be specified and configured with a YAML file passed to cwltoil using the --beta-dependency-resolvers-configuration option instead of the simple shortcut --beta-conda-dependencies. The cwltool documentation walks through a few examples of adapting infrastructure to tools and tools to package managers. Reference documentation is available in galaxy-lib's documentation.

In addition to options that allow configuring tool execution environments, containers themselves can be discovered and/or built from these software requirements. The Biocontainers project (previously Biodocker) contains a registry we use for this purpose. Every version of every Bioconda package has a corresponding best-practice (very lightweight, very small) Docker container on quay.io. There are over 3000 such containers currently.

Continuing with the example above, the new --beta-use-biocontainers flag instructs cwltoil to fetch the corresponding Biocontainers container from quay.io automatically or build one to use locally (required for instance for tools with multiple software requirements - fat tools).

cwltoil --beta-use-biocontainers tests/seqtk_seq.cwl tests/seqtk_seq_job.json

These containers contain the same binaries that the package would use locally (outside of Docker). Therefore this technique allows cross platform reproducibility/remixability across cwltool, cwltoil, Galaxy, and CLI - both inside and outside of containers.

@jmchilton
Copy link
Contributor Author

I'll add a couple tests based on the examples above - just found cwlTest.py and it looks like a very clean interface for testing.

@jmchilton jmchilton force-pushed the cwldeps2 branch 6 times, most recently from 17e89fe to 339d7b4 Compare October 30, 2017 17:30
@jmchilton
Copy link
Contributor Author

I finally got my two news tests to pass - it looks like this current failure is a transient failure unrelated to these changes right (it didn't crop up on the other runs for this PR)? I don't have a retry button in your Jenkins server - want me to rebase with an arbitrary change to rerun the tests or is it okay the way it is?

@ejacox
Copy link
Contributor

ejacox commented Oct 30, 2017

Jenkins, test this please.

@ejacox
Copy link
Contributor

ejacox commented Oct 30, 2017

Tests passed, but got this error:
Checking if your working copy is clean ...
You have are untracked files:
cwltool_deps/
make: *** [check_clean_working_copy] Error 1

From the end of:
http://jenkins.cgcloud.info/job/toil-pull-requests/3386/console

This enables the reproducibilty stack described in [this preprint](https://www.biorxiv.org/content/early/2017/10/11/200683) and [presented at BOSC 2017](http://jmchilton.github.io/writing/bosc2017slides/biocontainers.html) under Toil. Concretely this enables all the same options in cwltoil as added to cwltool in common-workflow-language/cwltool#214 including `` --beta-conda-dependencies``, ``--beta-dependency-resolvers-configuration``, and ``--beta-use-biocontainers``. The first two of these are documented in depth in cwltool's README (https://github.com/common-workflow-language/cwltool/#leveraging-softwarerequirements-beta).

Here I will quickly review a couple of the available options against test examples available in cwltool's ``tests`` directory using this branch of Toil.

```
git clone https://github.com/common-workflow-language/cwltool.git
cd cwltool
```

From here we can quickly demonstrate installation and resolution of CWL ``SoftwareRequirement`` hints using Conda using the tests/seqtk_seq.cwl tool. This tool doesn't define an explicit ``DockerRequirement`` but does define the following ``SoftwareRequirement`` in its ``hints`` as follows:

```
hints:
  SoftwareRequirement:
    packages:
    - package: seqtk
      version:
      - r93
```

We can try this tool out with ``cwltoil`` and see that by default we probably don't have the binary seqtk on our ``PATH`` and so the tool fails using the following command:

```
cwltoil tests/seqtk_seq.cwl tests/seqtk_seq_job.json
```

This should result in a tool execution failure. We can then instruct ``cwltoil`` to install the required package from Bioconda into an isolated environment and use it as needed by passing it the ``--beta-conda-dependencies`` flag as follows:

```
cwltoil --beta-conda-dependencies tests/seqtk_seq.cwl tests/seqtk_seq_job.json
```

The tool should now be successful.

The Conda support can be endless tweaked but the defaults are defaults that target the best practice Conda channels that work well for the Galaxy project. Additional ``SoftwareRequirement`` resolution options are available including targetting Software Modules, lmod, Homebrew, simple scripts called "Galaxy packages". All of these options can be specified and configured with a YAML file passed to cwltoil using the ``--beta-dependency-resolvers-configuration`` option instead of the simple shortcut ``--beta-conda-dependencies``. The cwltool documentation walks through a few examples of adapting infrastructure to tools and tools to package managers. Reference documentation is available in [galaxy-lib's documentation](http://galaxy-lib.readthedocs.io/en/latest/topics/dependency_resolution.html).

In addition to options that allow configuring tool execution environments, containers themselves can be discovered and/or built from these software requirements. The [Biocontainers](https://github.com/BioContainers) project (previously Biodocker) contains a registry we use for this purpose. Every version of every Bioconda package has a corresponding best-practice (very lightweight, very small) Docker container on quay.io. There are over 3000 such containers currently.

Continuing with the example above, the new `--beta-use-biocontainers` flag instructs ``cwltoil`` to fetch the corresponding Biocontainers container from quay.io automatically or build one to use locally (required for instance for tools with multiple software requirements - fat tools).

```
cwltoil --beta-use-biocontainers tests/seqtk_seq.cwl tests/seqtk_seq_job.json
```

These containers contain the same binaries that the package would use locally (outside of Docker). Therefore this technique allows cross platform reproducibility/remixability across cwltool, cwltoil, Galaxy, and CLI - both inside and outside of containers.
@jmchilton
Copy link
Contributor Author

@ejacox I guess this is expected - the tests run from that directory and the Conda packages that get installed get installed into the working directory by default - cwltool does the same thing. I've just added this file to .gitignore for now - if there is something more "Toil-ish" to be done let me know - I can add an override point for the tests but keep it that way for the CLI interface by default, I can modify it to install things into $HOME by default, or I could try to run the CWL tests with a different CWD.

@ejacox
Copy link
Contributor

ejacox commented Oct 31, 2017

Thank you @jmchilton

@ejacox ejacox merged commit 1e1deed into DataBiosphere:master Oct 31, 2017
DailyDreaming pushed a commit that referenced this pull request Dec 14, 2017
Enhance cwltoil to support SoftwareRequirements & BioContainers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants