Skip to content

Commit

Permalink
Merge pull request #214 from jmchilton/galaxy_deps
Browse files Browse the repository at this point in the history
Beta support for configurable dependency resolution & Biocontainers.
  • Loading branch information
mr-c authored Jul 7, 2017
2 parents 6e01be1 + fd2ac01 commit 9900de0
Show file tree
Hide file tree
Showing 150 changed files with 4,094 additions and 14 deletions.
206 changes: 206 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,212 @@ The easiest way to use cwltool to run a tool or workflow from Python is to use a

# result["out"] == "foo"

Leveraging SoftwareRequirements (Beta)
--------------------------------------

CWL tools may be decoarated with ``SoftwareRequirement`` hints that cwltool
may in turn use to resolve to packages in various package managers or
dependency management systems such as `Environment Modules
<http://modules.sourceforge.net/>`__.

Utilizing ``SoftwareRequirement`` hints using cwltool requires an optional
dependency, for this reason be sure to use specify the ``deps`` modifier when
installing cwltool. For instance::

$ pip install 'cwltool[deps]'

Installing cwltool in this fashion enables several new command line options.
The most general of these options is ``--beta-dependency-resolvers-configuration``.
This option allows one to specify a dependency resolvers configuration file.
This file may be specified as either XML or YAML and very simply describes various
plugins to enable to "resolve" ``SoftwareRequirement`` dependencies.

To discuss some of these plugins and how to configure them, first consider the
following ``hint`` definition for an example CWL tool.

.. code:: yaml
SoftwareRequirement:
packages:
- package: seqtk
version:
- r93
Now imagine deploying cwltool on a cluster with Software Modules installed
and that a ``seqtk`` module is avaialble at version ``r93``. This means cluster
users likely won't have the ``seqtk`` the binary on their ``PATH`` by default but after
sourcing this module with the command ``modulecmd sh load seqtk/r93`` ``seqtk`` is
available on the ``PATH``. A simple dependency resolvers configuration file, called
``dependency-resolvers-conf.yml`` for instance, that would enable cwltool to source
the correct module environment before executing the above tool would simply be:

.. code:: yaml
- type: module
The outer list indicates that one plugin is being enabled, the plugin parameters are
defined as a dictionary for this one list item. There is only one required parameter
for the plugin above, this is ``type`` and defines the plugin type. This parameter
is required for all plugins. The available plugins and the parameters
available for each are documented (incompletely) `here
<https://docs.galaxyproject.org/en/latest/admin/dependency_resolvers.html>`__.
Unfortunately, this documentation is in the context of Galaxy tool ``requirement`` s instead of CWL ``SoftwareRequirement`` s, but the concepts map fairly directly.

cwltool is distributed with an example of such seqtk tool and sample corresponding
job. It could executed from the cwltool root using a dependency resolvers
configuration file such as the above one using the command::

cwltool --beta-dependency-resolvers-configuration /path/to/dependency-resolvers-conf.yml \
tests/seqtk_seq.cwl \
tests/seqtk_seq_job.json

This example demonstrates both that cwltool can leverage
existing software installations and also handle workflows with dependencies
on different versions of the same software and libraries. However the above
example does require an existing module setup so it is impossible to test this example
"out of the box" with cwltool. For a more isolated test that demonstrates all
the same concepts - the resolver plugin type ``galaxy_packages`` can be used.

"Galaxy packages" are a lighter weight alternative to Environment Modules that are
really just defined by a way to lay out directories into packages and versions
to find little scripts that are sourced to modify the environment. They have
been used for years in Galaxy community to adapt Galaxy tools to cluster
environments but require neither knowledge of Galaxy nor any special tools to
setup. These should work just fine for CWL tools.

The cwltool source code repository's test directory is setup with a very simple
directory that defines a set of "Galaxy packages" (but really just defines one
package named ``random-lines``). The directory layout is simply::

tests/test_deps_env/
random-lines/
1.0/
env.sh

If the ``galaxy_packages`` plugin is enabled and pointed at the
``tests/test_deps_env`` directory in cwltool's root and a ``SoftwareRequirement``
such as the following is encountered.

.. code:: yaml
hints:
SoftwareRequirement:
packages:
- package: 'random-lines'
version:
- '1.0'
Then cwltool will simply find that ``env.sh`` file and source it before executing
the corresponding tool. That ``env.sh`` script is only responsible for modifying
the job's ``PATH`` to add the required binaries.

This is a full example that works since resolving "Galaxy packages" has no
external requirements. Try it out by executing the following command from cwltool's
root directory::

cwltool --beta-dependency-resolvers-configuration tests/test_deps_env_resolvers_conf.yml \
tests/random_lines.cwl \
tests/random_lines_job.json

The resolvers configuration file in the above example was simply:

.. code:: yaml
- type: galaxy_packages
base_path: ./tests/test_deps_env
It is possible that the ``SoftwareRequirement`` s in a given CWL tool will not
match the module names for a given cluster. Such requirements can be re-mapped
to specific deployed packages and/or versions using another file specified using
the resolver plugin parameter `mapping_files`. We will
demonstrate this using `galaxy_packages` but the concepts apply equally well
to Environment Modules or Conda packages (described below) for instance.

So consider the resolvers configuration file
(`tests/test_deps_env_resolvers_conf_rewrite.yml`):

.. code:: yaml
- type: galaxy_packages
base_path: ./tests/test_deps_env
mapping_files: ./tests/test_deps_mapping.yml
And the corresponding mapping configuraiton file (`tests/test_deps_mapping.yml`):

.. code:: yaml
- from:
name: randomLines
version: 1.0.0-rc1
to:
name: random-lines
version: '1.0'
This is saying if cwltool encounters a requirement of ``randomLines`` at version
``1.0.0-rc1`` in a tool, to rewrite to our specific plugin as ``random-lines`` at
version ``1.0``. cwltool has such a test tool called ``random_lines_mapping.cwl``
that contains such a source ``SoftwareRequirement``. To try out this example with
mapping, execute the following command from the cwltool root directory::

cwltool --beta-dependency-resolvers-configuration tests/test_deps_env_resolvers_conf_rewrite.yml \
tests/random_lines_mapping.cwl \
tests/random_lines_job.json

The previous examples demonstrated leveraging existing infrastructure to
provide requirements for CWL tools. If instead a real package manager is used
cwltool has the oppertunity to install requirements as needed. While initial
support for Homebrew/Linuxbrew plugins is available, the most developed such
plugin is for the `Conda <https://conda.io/docs/#>`__ package manager. Conda has the nice properties
of allowing multiple versions of a package to be installed simultaneously,
not requiring evalated permissions to install Conda itself or packages using
Conda, and being cross platform. For these reasons, cwltool may run as a normal
user, install its own Conda environment and manage multiple versions of Conda packages
on both Linux and Mac OS X.

The Conda plugin can be endlessly configured, but a sensible set of defaults
that has proven a powerful stack for dependency management within the Galaxy tool
development ecosystem can be enabled by simply passing cwltool the
``--beta-conda-dependencies`` flag.

With this we can use the seqtk example above without Docker and without
any externally managed services - cwltool should install everything it needs
and create an environment for the tool. Try it out with the follwing command::

cwltool --beta-conda-dependencies tests/seqtk_seq.cwl tests/seqtk_seq_job.json

The CWL specification allows URIs to be attached to ``SoftwareRequirement`` s
that allow disambiguation of package names. If the mapping files described above
allow deployers to adapt tools to their infrastructure, this mechanism allows
tools to adapt their requirements to multiple package managers. To demonstrate
this within the context of the seqtk, we can simply break the package name we
use and then specify a specific Conda package as follows:

.. code:: yaml
hints:
SoftwareRequirement:
packages:
- package: seqtk_seq
version:
- '1.2'
specs:
- https://anaconda.org/bioconda/seqtk
- https://packages.debian.org/sid/seqtk
The example can be executed using the command::

cwltool --beta-conda-dependencies tests/seqtk_seq_wrong_name.cwl tests/seqtk_seq_job.json

The plugin framework for managing resolution of these software requirements
as maintained as part of `galaxy-lib <https://github.com/galaxyproject/galaxy-lib>`__ - a small, portable subset of the Galaxy
project. More information on configuration and implementation can be found
at the following links:

- `Dependency Resolvers in Galaxy <https://docs.galaxyproject.org/en/latest/admin/dependency_resolvers.html>`__
- `Conda for [Galaxy] Tool Dependencies <https://docs.galaxyproject.org/en/latest/admin/conda_faq.html>`__
- `Mapping Files - Implementation <https://github.com/galaxyproject/galaxy/commit/495802d229967771df5b64a2f79b88a0eaf00edb>`__
- `Specifications - Implementation <https://github.com/galaxyproject/galaxy/commit/81d71d2e740ee07754785306e4448f8425f890bc>`__
- `Initial cwltool Integration Pull Request <https://github.com/common-workflow-language/cwltool/pull/214>`__

Cwltool control flow
--------------------
Expand Down
2 changes: 2 additions & 0 deletions cwltool/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ def __init__(self): # type: () -> None
# Will be default "no_listing" for CWL v1.1
self.loadListing = "deep_listing" # type: Union[None, str]

self.find_default_container = None # type: Callable[[], Text]

def bind_input(self, schema, datum, lead_pos=None, tail_pos=None):
# type: (Dict[Text, Any], Any, Union[int, List[int]], List[int]) -> List[Dict[Text, Any]]
if tail_pos is None:
Expand Down
10 changes: 10 additions & 0 deletions cwltool/draft2tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,9 +174,19 @@ class CommandLineTool(Process):
def __init__(self, toolpath_object, **kwargs):
# type: (Dict[Text, Any], **Any) -> None
super(CommandLineTool, self).__init__(toolpath_object, **kwargs)
self.find_default_container = kwargs.get("find_default_container", None)

def makeJobRunner(self, use_container=True): # type: (Optional[bool]) -> JobBase
dockerReq, _ = self.get_requirement("DockerRequirement")
if not dockerReq and use_container:
default_container = self.find_default_container(self)
if default_container:
self.requirements.insert(0, {
"class": "DockerRequirement",
"dockerPull": default_container
})
dockerReq = self.requirements[0]

if dockerReq and use_container:
return DockerCommandLineJob()
else:
Expand Down
17 changes: 11 additions & 6 deletions cwltool/job.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@

PYTHON_RUN_SCRIPT = """
import json
import os
import sys
import subprocess
Expand All @@ -41,6 +42,7 @@
commands = popen_description["commands"]
cwd = popen_description["cwd"]
env = popen_description["env"]
env["PATH"] = os.environ.get("PATH")
stdin_path = popen_description["stdin_path"]
stdout_path = popen_description["stdout_path"]
stderr_path = popen_description["stderr_path"]
Expand All @@ -67,7 +69,7 @@
if sp.stdin:
sp.stdin.close()
rcode = sp.wait()
if isinstance(stdin, file):
if stdin is not subprocess.PIPE:
stdin.close()
if stdout is not sys.stderr:
stdout.close()
Expand Down Expand Up @@ -145,7 +147,6 @@ def _setup(self): # type: () -> None
_logger.debug(u"[job %s] initial work dir %s", self.name,
json.dumps({p: self.generatemapper.mapper(p) for p in self.generatemapper.files()}, indent=4))


def _execute(self, runtime, env, rm_tmpdir=True, move_outputs="move"):
# type: (List[Text], MutableMapping[Text, Text], bool, Text) -> None

Expand Down Expand Up @@ -328,8 +329,12 @@ def run(self, pull_image=True, rm_container=True,
env = cast(MutableMapping[Text, Text], os.environ)
if docker_req and kwargs.get("use_container") is not False:
img_id = docker.get_from_requirements(docker_req, True, pull_image)
elif kwargs.get("default_container", None) is not None:
img_id = kwargs.get("default_container")
if img_id is None:
find_default_container = self.builder.find_default_container
default_container = find_default_container and find_default_container()
if default_container:
img_id = default_container
env = cast(MutableMapping[Text, Text], os.environ)

if docker_req and img_id is None and kwargs.get("use_container"):
raise Exception("Docker image not available")
Expand Down Expand Up @@ -482,8 +487,8 @@ def _job_popen(
["bash", job_script.encode("utf-8")],
shell=False,
cwd=job_dir,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdout=sys.stderr, # The nested script will output the paths to the correct files if they need
stderr=sys.stderr, # to be captured. Else just write everything to stderr (same as above).
stdin=subprocess.PIPE,
)
if sp.stdin:
Expand Down
45 changes: 37 additions & 8 deletions cwltool/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

import pkg_resources # part of setuptools
import requests
import string

import ruamel.yaml as yaml
import schema_salad.validate as validate
Expand All @@ -31,9 +32,11 @@
relocateOutputs, scandeps, shortname, use_custom_schema,
use_standard_schema)
from .resolver import ga4gh_tool_registries, tool_resolver
from .software_requirements import DependenciesConfiguration, get_container_from_software_requirements
from .stdfsaccess import StdFsAccess
from .update import ALLUPDATES, UPDATES


_logger = logging.getLogger("cwltool")

defaultStreamHandler = logging.StreamHandler()
Expand Down Expand Up @@ -149,6 +152,15 @@ def arg_parser(): # type: () -> argparse.ArgumentParser
exgroup.add_argument("--quiet", action="store_true", help="Only print warnings and errors.")
exgroup.add_argument("--debug", action="store_true", help="Print even more logging")

# help="Dependency resolver configuration file describing how to adapt 'SoftwareRequirement' packages to current system."
parser.add_argument("--beta-dependency-resolvers-configuration", default=None, help=argparse.SUPPRESS)
# help="Defaut root directory used by dependency resolvers configuration."
parser.add_argument("--beta-dependencies-directory", default=None, help=argparse.SUPPRESS)
# help="Use biocontainers for tools without an explicitly annotated Docker container."
parser.add_argument("--beta-use-biocontainers", default=None, help=argparse.SUPPRESS, action="store_true")
# help="Short cut to use Conda to resolve 'SoftwareRequirement' packages."
parser.add_argument("--beta-conda-dependencies", default=None, help=argparse.SUPPRESS, action="store_true")

parser.add_argument("--tool-help", action="store_true", help="Print command line help for tool")

parser.add_argument("--relative-deps", choices=['primary', 'cwd'],
Expand Down Expand Up @@ -236,12 +248,6 @@ def output_callback(out, processStatus):
for req in jobReqs:
t.requirements.append(req)

if kwargs.get("default_container"):
t.requirements.insert(0, {
"class": "DockerRequirement",
"dockerPull": kwargs["default_container"]
})

jobiter = t.job(job_order_object,
output_callback,
**kwargs)
Expand Down Expand Up @@ -648,7 +654,8 @@ def main(argsl=None, # type: List[str]
'relax_path_checks': False,
'validate': False,
'enable_ga4gh_tool_registry': False,
'ga4gh_tool_registries': []
'ga4gh_tool_registries': [],
'find_default_container': None
}.iteritems():
if not hasattr(args, k):
setattr(args, k, v)
Expand Down Expand Up @@ -716,8 +723,20 @@ def main(argsl=None, # type: List[str]
stdout.write(json.dumps(processobj, indent=4))
return 0

conf_file = getattr(args, "beta_dependency_resolvers_configuration", None) # Text
use_conda_dependencies = getattr(args, "beta_conda_dependencies", None) # Text

make_tool_kwds = vars(args)

build_job_script = None # type: Callable[[Any, List[str]], Text]
if conf_file or use_conda_dependencies:
dependencies_configuration = DependenciesConfiguration(args) # type: DependenciesConfiguration
make_tool_kwds["build_job_script"] = dependencies_configuration.build_job_script

make_tool_kwds["find_default_container"] = functools.partial(find_default_container, args)

tool = make_tool(document_loader, avsc_names, metadata, uri,
makeTool, vars(args))
makeTool, make_tool_kwds)

if args.validate:
return 0
Expand Down Expand Up @@ -838,5 +857,15 @@ def locToPath(p):
_logger.addHandler(defaultStreamHandler)


def find_default_container(args, builder):
default_container = None
if args.default_container:
default_container = args.default_container
elif args.beta_use_biocontainers:
default_container = get_container_from_software_requirements(args, builder)

return default_container


if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
Loading

0 comments on commit 9900de0

Please sign in to comment.