Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beta support for configurable dependency resolution & Biocontainers. #214

Merged
merged 23 commits into from
Jul 7, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
85158a3
Beta support for configurable dependency resolution & Biocontainers.
jmchilton Oct 18, 2016
ee83315
galaxy-lib as a setup.py plugin
jmchilton Jun 27, 2017
1e6ea02
Implement and add example for specifications in SoftwareRequirements.
jmchilton Jun 27, 2017
ef3e6f2
Fix --no-container, add example combining Conda + explicit docker opt…
jmchilton Jun 27, 2017
432ad4c
Fix whitespace deleted in previous rebase.
jmchilton Jun 27, 2017
e7059a2
Fixup seqtk seq example w/wrong name.
jmchilton Jun 27, 2017
913f617
Example using environment-like modules.
jmchilton Jun 27, 2017
6b929f1
Fix Python 3 incompatibility in PYTHON_RUN_SCRIPT.
jxtx Jun 27, 2017
79bae7d
Fix stderr/stdout handling in external bash script processing of jobs.
jmchilton Jun 27, 2017
b19f60d
Sorry for the XML - we didn't need to do that.
jmchilton Jun 27, 2017
461b10f
Demonstrate mapping of SoftwareRequirements to local resources.
jmchilton Jun 27, 2017
368f888
Updated typeshed defs for Galaxy-lib?
jmchilton Jun 27, 2017
02ecac6
E251, E241
mr-c Jun 29, 2017
a55aee1
E261
mr-c Jun 29, 2017
0823ce8
add default value for find_default_container
mr-c Jun 29, 2017
3ed134d
Update job.py
mr-c Jun 29, 2017
15a6ab4
Fix null handling in redone find_default_container usage.
jmchilton Jul 3, 2017
dac661e
Refine type signatures a bit for galaxy-lib.
jmchilton Jul 3, 2017
0b6f27e
Refactor utilities for dealing with SoftwareRequirements into own mod…
jmchilton Jul 3, 2017
d473e3a
Furter refine type descriptions for galaxy-lib.
jmchilton Jul 3, 2017
75c8d81
I suppose this is needed after 0b6f27e83064e80b6a2ce21f9a3d2be3f21cba9f.
jmchilton Jul 7, 2017
5522e88
Docs for resolving SoftwareRequirements.
jmchilton Jul 7, 2017
fd2ac01
Merge branch 'master' into galaxy_deps
mr-c Jul 7, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 206 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,212 @@ The easiest way to use cwltool to run a tool or workflow from Python is to use a

# result["out"] == "foo"

Leveraging SoftwareRequirements (Beta)
--------------------------------------

CWL tools may be decoarated with ``SoftwareRequirement`` hints that cwltool
may in turn use to resolve to packages in various package managers or
dependency management systems such as `Environment Modules
<http://modules.sourceforge.net/>`__.

Utilizing ``SoftwareRequirement`` hints using cwltool requires an optional
dependency, for this reason be sure to use specify the ``deps`` modifier when
installing cwltool. For instance::

$ pip install 'cwltool[deps]'

Installing cwltool in this fashion enables several new command line options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. should we hide them when the deps extra isn't modified? What happens if they are called and galaxy-lib is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have clarified this behavior with #459 and improved it a bit.

The most general of these options is ``--beta-dependency-resolvers-configuration``.
This option allows one to specify a dependency resolvers configuration file.
This file may be specified as either XML or YAML and very simply describes various
plugins to enable to "resolve" ``SoftwareRequirement`` dependencies.

To discuss some of these plugins and how to configure them, first consider the
following ``hint`` definition for an example CWL tool.

.. code:: yaml

SoftwareRequirement:
packages:
- package: seqtk
version:
- r93

Now imagine deploying cwltool on a cluster with Software Modules installed
and that a ``seqtk`` module is avaialble at version ``r93``. This means cluster
users likely won't have the ``seqtk`` the binary on their ``PATH`` by default but after
sourcing this module with the command ``modulecmd sh load seqtk/r93`` ``seqtk`` is
available on the ``PATH``. A simple dependency resolvers configuration file, called
``dependency-resolvers-conf.yml`` for instance, that would enable cwltool to source
the correct module environment before executing the above tool would simply be:

.. code:: yaml

- type: module

The outer list indicates that one plugin is being enabled, the plugin parameters are
defined as a dictionary for this one list item. There is only one required parameter
for the plugin above, this is ``type`` and defines the plugin type. This parameter
is required for all plugins. The available plugins and the parameters
available for each are documented (incompletely) `here
<https://docs.galaxyproject.org/en/latest/admin/dependency_resolvers.html>`__.
Unfortunately, this documentation is in the context of Galaxy tool ``requirement`` s instead of CWL ``SoftwareRequirement`` s, but the concepts map fairly directly.

cwltool is distributed with an example of such seqtk tool and sample corresponding
job. It could executed from the cwltool root using a dependency resolvers
configuration file such as the above one using the command::

cwltool --beta-dependency-resolvers-configuration /path/to/dependency-resolvers-conf.yml \
tests/seqtk_seq.cwl \
tests/seqtk_seq_job.json

This example demonstrates both that cwltool can leverage
existing software installations and also handle workflows with dependencies
on different versions of the same software and libraries. However the above
example does require an existing module setup so it is impossible to test this example
"out of the box" with cwltool. For a more isolated test that demonstrates all
the same concepts - the resolver plugin type ``galaxy_packages`` can be used.

"Galaxy packages" are a lighter weight alternative to Environment Modules that are
really just defined by a way to lay out directories into packages and versions
to find little scripts that are sourced to modify the environment. They have
been used for years in Galaxy community to adapt Galaxy tools to cluster
environments but require neither knowledge of Galaxy nor any special tools to
setup. These should work just fine for CWL tools.

The cwltool source code repository's test directory is setup with a very simple
directory that defines a set of "Galaxy packages" (but really just defines one
package named ``random-lines``). The directory layout is simply::

tests/test_deps_env/
random-lines/
1.0/
env.sh

If the ``galaxy_packages`` plugin is enabled and pointed at the
``tests/test_deps_env`` directory in cwltool's root and a ``SoftwareRequirement``
such as the following is encountered.

.. code:: yaml

hints:
SoftwareRequirement:
packages:
- package: 'random-lines'
version:
- '1.0'

Then cwltool will simply find that ``env.sh`` file and source it before executing
the corresponding tool. That ``env.sh`` script is only responsible for modifying
the job's ``PATH`` to add the required binaries.

This is a full example that works since resolving "Galaxy packages" has no
external requirements. Try it out by executing the following command from cwltool's
root directory::

cwltool --beta-dependency-resolvers-configuration tests/test_deps_env_resolvers_conf.yml \
tests/random_lines.cwl \
tests/random_lines_job.json

The resolvers configuration file in the above example was simply:

.. code:: yaml

- type: galaxy_packages
base_path: ./tests/test_deps_env

It is possible that the ``SoftwareRequirement`` s in a given CWL tool will not
match the module names for a given cluster. Such requirements can be re-mapped
to specific deployed packages and/or versions using another file specified using
the resolver plugin parameter `mapping_files`. We will
demonstrate this using `galaxy_packages` but the concepts apply equally well
to Environment Modules or Conda packages (described below) for instance.

So consider the resolvers configuration file
(`tests/test_deps_env_resolvers_conf_rewrite.yml`):

.. code:: yaml

- type: galaxy_packages
base_path: ./tests/test_deps_env
mapping_files: ./tests/test_deps_mapping.yml

And the corresponding mapping configuraiton file (`tests/test_deps_mapping.yml`):

.. code:: yaml

- from:
name: randomLines
version: 1.0.0-rc1
to:
name: random-lines
version: '1.0'

This is saying if cwltool encounters a requirement of ``randomLines`` at version
``1.0.0-rc1`` in a tool, to rewrite to our specific plugin as ``random-lines`` at
version ``1.0``. cwltool has such a test tool called ``random_lines_mapping.cwl``
that contains such a source ``SoftwareRequirement``. To try out this example with
mapping, execute the following command from the cwltool root directory::

cwltool --beta-dependency-resolvers-configuration tests/test_deps_env_resolvers_conf_rewrite.yml \
tests/random_lines_mapping.cwl \
tests/random_lines_job.json

The previous examples demonstrated leveraging existing infrastructure to
provide requirements for CWL tools. If instead a real package manager is used
cwltool has the oppertunity to install requirements as needed. While initial
support for Homebrew/Linuxbrew plugins is available, the most developed such
plugin is for the `Conda <https://conda.io/docs/#>`__ package manager. Conda has the nice properties
of allowing multiple versions of a package to be installed simultaneously,
not requiring evalated permissions to install Conda itself or packages using
Conda, and being cross platform. For these reasons, cwltool may run as a normal
user, install its own Conda environment and manage multiple versions of Conda packages
on both Linux and Mac OS X.

The Conda plugin can be endlessly configured, but a sensible set of defaults
that has proven a powerful stack for dependency management within the Galaxy tool
development ecosystem can be enabled by simply passing cwltool the
``--beta-conda-dependencies`` flag.

With this we can use the seqtk example above without Docker and without
any externally managed services - cwltool should install everything it needs
and create an environment for the tool. Try it out with the follwing command::

cwltool --beta-conda-dependencies tests/seqtk_seq.cwl tests/seqtk_seq_job.json

The CWL specification allows URIs to be attached to ``SoftwareRequirement`` s
that allow disambiguation of package names. If the mapping files described above
allow deployers to adapt tools to their infrastructure, this mechanism allows
tools to adapt their requirements to multiple package managers. To demonstrate
this within the context of the seqtk, we can simply break the package name we
use and then specify a specific Conda package as follows:

.. code:: yaml

hints:
SoftwareRequirement:
packages:
- package: seqtk_seq
version:
- '1.2'
specs:
- https://anaconda.org/bioconda/seqtk
- https://packages.debian.org/sid/seqtk

The example can be executed using the command::

cwltool --beta-conda-dependencies tests/seqtk_seq_wrong_name.cwl tests/seqtk_seq_job.json

The plugin framework for managing resolution of these software requirements
as maintained as part of `galaxy-lib <https://github.com/galaxyproject/galaxy-lib>`__ - a small, portable subset of the Galaxy
project. More information on configuration and implementation can be found
at the following links:

- `Dependency Resolvers in Galaxy <https://docs.galaxyproject.org/en/latest/admin/dependency_resolvers.html>`__
- `Conda for [Galaxy] Tool Dependencies <https://docs.galaxyproject.org/en/latest/admin/conda_faq.html>`__
- `Mapping Files - Implementation <https://github.com/galaxyproject/galaxy/commit/495802d229967771df5b64a2f79b88a0eaf00edb>`__
- `Specifications - Implementation <https://github.com/galaxyproject/galaxy/commit/81d71d2e740ee07754785306e4448f8425f890bc>`__
- `Initial cwltool Integration Pull Request <https://github.com/common-workflow-language/cwltool/pull/214>`__

Cwltool control flow
--------------------
Expand Down
2 changes: 2 additions & 0 deletions cwltool/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ def __init__(self): # type: () -> None
# Will be default "no_listing" for CWL v1.1
self.loadListing = "deep_listing" # type: Union[None, str]

self.find_default_container = None # type: Callable[[], Text]

def bind_input(self, schema, datum, lead_pos=None, tail_pos=None):
# type: (Dict[Text, Any], Any, Union[int, List[int]], List[int]) -> List[Dict[Text, Any]]
if tail_pos is None:
Expand Down
10 changes: 10 additions & 0 deletions cwltool/draft2tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,9 +174,19 @@ class CommandLineTool(Process):
def __init__(self, toolpath_object, **kwargs):
# type: (Dict[Text, Any], **Any) -> None
super(CommandLineTool, self).__init__(toolpath_object, **kwargs)
self.find_default_container = kwargs.get("find_default_container", None)

def makeJobRunner(self, use_container=True): # type: (Optional[bool]) -> JobBase
dockerReq, _ = self.get_requirement("DockerRequirement")
if not dockerReq and use_container:
default_container = self.find_default_container(self)
if default_container:
self.requirements.insert(0, {
"class": "DockerRequirement",
"dockerPull": default_container
})
dockerReq = self.requirements[0]

if dockerReq and use_container:
return DockerCommandLineJob()
else:
Expand Down
17 changes: 11 additions & 6 deletions cwltool/job.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@

PYTHON_RUN_SCRIPT = """
import json
import os
import sys
import subprocess

Expand All @@ -41,6 +42,7 @@
commands = popen_description["commands"]
cwd = popen_description["cwd"]
env = popen_description["env"]
env["PATH"] = os.environ.get("PATH")
stdin_path = popen_description["stdin_path"]
stdout_path = popen_description["stdout_path"]
stderr_path = popen_description["stderr_path"]
Expand All @@ -67,7 +69,7 @@
if sp.stdin:
sp.stdin.close()
rcode = sp.wait()
if isinstance(stdin, file):
if stdin is not subprocess.PIPE:
stdin.close()
if stdout is not sys.stderr:
stdout.close()
Expand Down Expand Up @@ -145,7 +147,6 @@ def _setup(self): # type: () -> None
_logger.debug(u"[job %s] initial work dir %s", self.name,
json.dumps({p: self.generatemapper.mapper(p) for p in self.generatemapper.files()}, indent=4))


def _execute(self, runtime, env, rm_tmpdir=True, move_outputs="move"):
# type: (List[Text], MutableMapping[Text, Text], bool, Text) -> None

Expand Down Expand Up @@ -328,8 +329,12 @@ def run(self, pull_image=True, rm_container=True,
env = cast(MutableMapping[Text, Text], os.environ)
if docker_req and kwargs.get("use_container") is not False:
img_id = docker.get_from_requirements(docker_req, True, pull_image)
elif kwargs.get("default_container", None) is not None:
img_id = kwargs.get("default_container")
if img_id is None:
find_default_container = self.builder.find_default_container
default_container = find_default_container and find_default_container()
if default_container:
img_id = default_container
env = cast(MutableMapping[Text, Text], os.environ)

if docker_req and img_id is None and kwargs.get("use_container"):
raise Exception("Docker image not available")
Expand Down Expand Up @@ -482,8 +487,8 @@ def _job_popen(
["bash", job_script.encode("utf-8")],
shell=False,
cwd=job_dir,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdout=sys.stderr, # The nested script will output the paths to the correct files if they need
stderr=sys.stderr, # to be captured. Else just write everything to stderr (same as above).
stdin=subprocess.PIPE,
)
if sp.stdin:
Expand Down
45 changes: 37 additions & 8 deletions cwltool/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

import pkg_resources # part of setuptools
import requests
import string

import ruamel.yaml as yaml
import schema_salad.validate as validate
Expand All @@ -31,9 +32,11 @@
relocateOutputs, scandeps, shortname, use_custom_schema,
use_standard_schema)
from .resolver import ga4gh_tool_registries, tool_resolver
from .software_requirements import DependenciesConfiguration, get_container_from_software_requirements
from .stdfsaccess import StdFsAccess
from .update import ALLUPDATES, UPDATES


_logger = logging.getLogger("cwltool")

defaultStreamHandler = logging.StreamHandler()
Expand Down Expand Up @@ -149,6 +152,15 @@ def arg_parser(): # type: () -> argparse.ArgumentParser
exgroup.add_argument("--quiet", action="store_true", help="Only print warnings and errors.")
exgroup.add_argument("--debug", action="store_true", help="Print even more logging")

# help="Dependency resolver configuration file describing how to adapt 'SoftwareRequirement' packages to current system."
parser.add_argument("--beta-dependency-resolvers-configuration", default=None, help=argparse.SUPPRESS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this configuration file format documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here-ish https://docs.galaxyproject.org/en/latest/admin/dependency_resolvers.html#dependency-resolvers-in-galaxy but the Galaxy docs reference Galaxy requirements and don't mention the file can be YAML like in the examples I've included here or using URIs to disambiguate packages or the local mapping stuff that makes the environment modules work more usable. So I guess the real answer is scattered across a bunch of PRs. Where do you want the cwltool documentation? Should I add it to a new section of the README or create a new file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought Galaxy-lib had its own documentation site we could link to..

In the meantime, a new section in the README is fine.

# help="Defaut root directory used by dependency resolvers configuration."
parser.add_argument("--beta-dependencies-directory", default=None, help=argparse.SUPPRESS)
# help="Use biocontainers for tools without an explicitly annotated Docker container."
parser.add_argument("--beta-use-biocontainers", default=None, help=argparse.SUPPRESS, action="store_true")
# help="Short cut to use Conda to resolve 'SoftwareRequirement' packages."
parser.add_argument("--beta-conda-dependencies", default=None, help=argparse.SUPPRESS, action="store_true")

parser.add_argument("--tool-help", action="store_true", help="Print command line help for tool")

parser.add_argument("--relative-deps", choices=['primary', 'cwd'],
Expand Down Expand Up @@ -236,12 +248,6 @@ def output_callback(out, processStatus):
for req in jobReqs:
t.requirements.append(req)

if kwargs.get("default_container"):
t.requirements.insert(0, {
"class": "DockerRequirement",
"dockerPull": kwargs["default_container"]
})

jobiter = t.job(job_order_object,
output_callback,
**kwargs)
Expand Down Expand Up @@ -648,7 +654,8 @@ def main(argsl=None, # type: List[str]
'relax_path_checks': False,
'validate': False,
'enable_ga4gh_tool_registry': False,
'ga4gh_tool_registries': []
'ga4gh_tool_registries': [],
'find_default_container': None
}.iteritems():
if not hasattr(args, k):
setattr(args, k, v)
Expand Down Expand Up @@ -716,8 +723,20 @@ def main(argsl=None, # type: List[str]
stdout.write(json.dumps(processobj, indent=4))
return 0

conf_file = getattr(args, "beta_dependency_resolvers_configuration", None) # Text
use_conda_dependencies = getattr(args, "beta_conda_dependencies", None) # Text

make_tool_kwds = vars(args)

build_job_script = None # type: Callable[[Any, List[str]], Text]
if conf_file or use_conda_dependencies:
dependencies_configuration = DependenciesConfiguration(args) # type: DependenciesConfiguration
make_tool_kwds["build_job_script"] = dependencies_configuration.build_job_script

make_tool_kwds["find_default_container"] = functools.partial(find_default_container, args)

tool = make_tool(document_loader, avsc_names, metadata, uri,
makeTool, vars(args))
makeTool, make_tool_kwds)

if args.validate:
return 0
Expand Down Expand Up @@ -838,5 +857,15 @@ def locToPath(p):
_logger.addHandler(defaultStreamHandler)


def find_default_container(args, builder):
default_container = None
if args.default_container:
default_container = args.default_container
elif args.beta_use_biocontainers:
default_container = get_container_from_software_requirements(args, builder)

return default_container


if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
Loading