Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: allow user defined workflow definition environment #3780

Open
4 tasks
oliver-sanders opened this issue Aug 18, 2020 · 7 comments
Open
4 tasks

python: allow user defined workflow definition environment #3780

oliver-sanders opened this issue Aug 18, 2020 · 7 comments
Labels
speculative blue-skies ideas
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 18, 2020

Note: This was something I was going to hold back for Cylc9, however, it is becoming apparent that it might be worth considering this in the 8.x timeframe.

Note: This functionality might probably be best located in another repository. Putting it in Cylc Flow for discussion, we can move it if approved.

The Problem

It is now possible to import arbitrary Python modules in the flow.conf file via Jinja2:

flow.cylc

#Jinja2
{% import "numpy" as np %}
{# ... #}

Or, preferably from within Python files loaded into Jinja2:

lib/python/process_parameters.py

import numpy as np
# ...

Which is great, provided that the modules you want are actually installed in the Cylc environment. If you maintain your own Cylc environment then you can maintain your own Conda recipes and build environments which cater to the needs of your workflows 🤮. This is terrible for portability, reproducibility and just general niceness.

Surely the dependencies should lie with the system just like in normal Python projects?

The Bigger Problem

At the moment this is an issue which hits users who are doing more advanced Jinja2 work, especially as Cylc7 is Python2.

When Cylc9 arrives and workflow definitions start being written in Python the ability to add to the Cylc environment will become much greater.

flow.py

"""My Cylc9 workflow."""

# we have at least one community loading data from excel spreadsheets
# which defines the workflow, not the nicest solution but that's the accepted
# interchange format in their area
from excel import OpenExcel 
data = OpenExcel('data.xls')

# loading data from netcdf has also cropped up
from netCDF4 import Dataset
rootgrp = Dataset("test.nc", "w", format="NETCDF4")

# many people use these troublesome twins for everything data
import numpy as np
import pandas as pd

from cylc.flow import Flow

# ...

The Solution?

We add a plugin to Cylc which allows the Cylc environment to be extended with a virtual environment. We install required dependencies into this enviroment.

  • This will work irrespective of whether Cylc Flow was installed via pip or conda.
  • This virtual environment would be activated from within whatever environment Cylc is installed into.
  • This virtual environment would use the same Python interpreter as the "parent" environment.
  • This would effectively allow us to spin up lightweight virtual environments into which we install just the extra bits which the users requires.

When necessary Cylc would re-invoke the cylc run command inside this virtual environment. This environment would not be available to jobs (circa #3712) it's just for processing the wokflow definition.

pipenv (pip+virtualenv)

Whilst this plugin could be implemented in different ways I think the most logical candidate might be pipenv which is a system for spinning up environments with virtualenv installing stuff pip.

Pipenv would take care of the creation and management of virtual environments for us making this plugin pretty simple to write.

Here's a quick example of how pipenv could be used:

$ # activate your cylc8 environment (with pipenv installed in it)
$ conda activate cylc8

$ # configure pipenv to work with the "parent" environment
$ cd my-workflow/
$ pipenv --python $(command -v python3)
$ pipenv --site-packages

$ # you now have a virtualenv which "extends" the parent environment
$ pipenv run python3 -c 'from cylc.flow import __version__; print(__version__)'
8.0a3.dev

$ # install workflow definition dependencies using a pip-like interface
$ pipenv install cowsay
$ pipenv run python3 -c 'from cowsay import cow; cow("moo")'
  ___
< moo >
  ===
        \
         \
           ^__^                             
           (oo)\_______                   
           (__)\       )\/\             
               ||----w |           
               ||     ||  
               
$ # the lockfile shows what's installed
$ cat Pipfile.lock 
{
    "_meta": {
        "hash": {
            "sha256": "7f086388cf5c03c7072a870415dc29f71218d8aee191fe6507d33176521a4af8"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.7"
        },
        "sources": [
            {
                "name": "pypi",
                "url": "https://pypi.org/simple",
                "verify_ssl": true
            }
        ]
    },
    "default": {
        "cowsay": {
            "hashes": [
                "sha256:7ec3ec1bb085cbb788b0de1e762941b4469faf41c6cdbec08a7ac072a7d1d6eb",
                "sha256:debde99bae664bd91487613223c1cb291170d8703bf7d524c3a4877ad37b4dad"
            ],
            "index": "pypi",
            "version": "==2.0.3"
        }
    },
    "develop": {}
}

The Cylc Pipfile plugin would re-invoke any cylc (run|restart|get-config) commands based on the presence of a Pipfile file.

poetry

Another tool with a very similar feature set to Pipenv, it too can work with virtual environments, create lock files, etc. It also has a streamlined system for pypi publishing and other niceness.

Conclusion / Questions

  • Pipenv vs Poetry (implementing one leaves a pattern for implementing the other).
  • This Cylc Plugin would likely be quite small as it's pretty much just re-invoking Cylc commands behind pipenv so the actual "work" involved here is not as great as the length of the discussion might make it seem.
  • This approach would provide the ability for suite definition and potentially other Cylc plugins to be installed into virtual environments where IT policies permit it.
  • By developing it in a separate plugin we make it an installation choice removing the security consequences from Cylc Flow itself.
  • Alternative implementations should be possible should pipenv not work out.
  • I think that Pipenv can be configured to install from a local PyPi cache (e.g. Artifactory) if desired.

Caveats

  1. Users would still be coupled to the same Python version as the "parent" Cylc environment.

    • With Cylc8 we will leap to 3.7 so this shouldn't be much of an issue, however, it is an irritation.
    • Now that we've broken the dependency on system provided utilities Cylc shouldn't get behind with Python versions again.
  2. Python modules would get installed once for each workflow which uses them.

    • This is awfully similar to npm/yarn.
    • Pipenv would install stuff either when you add the dependency or the first time you run a new workflow.
    • I think this is an acceptable overhead as Pipenv would only install the additional modules that the user requested.
    • I think this is an acceptable overhead but might surprise people who aren't yet familiarised with the increasingly isolated patterns package managers are using these days.
    • The virtual environments can be cleaned up manually with pipenv --rm, this represents a management overhead.
  3. This would restrict users to installing stuff into Python virtual environments.

    • I.E. pip rather than conda.
    • I think this is ok as we wouldn't expect users to require system modules for use in suite definitions.

Questions

  • Does this seem sensible? Can anyone see any potential issues I've not thought of?
  • What other purposes might it be desirable to extend the Cylc environment for? Does this approach work for those purposes?
  • Is pip installation sufficient?
  • What implementation should we go for (virtualenv, pipenv, heck even conda is almost an option - just a very heavyweight one which would require duplicating the whole env)
@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Aug 18, 2020
@oliver-sanders oliver-sanders added this to the 8.x milestone Aug 18, 2020
@oliver-sanders oliver-sanders self-assigned this Aug 18, 2020
@kinow
Copy link
Member

kinow commented Aug 18, 2020

Does this seem sensible? Can anyone see any potential issues I've not thought of?

It sounds good to me, +1. Can't think of any potential issues right now.

What other purposes might it be desirable to extend the Cylc environment for? Does this approach work for those purposes?

Not sure if directly related to extending Cylc environment, but maybe run tasks or sub-workflows with containers?

With Nextflow you can run an entire workflow with Docker, while also using Conda.

nextflow run -with-docker [docker image]
Every time your script launches a process execution, Nextflow will run it into a Docker container created by using the specified image. In practice Nextflow will automatically wrap your processes and run them by executing the docker run command with the image you have provided.

nextflow.preview.dsl=2

process foo {
    conda 'numpy pandas matplotlib'

    output:
      path 'foo.txt'
    script:
      """
      your_command > foo.txt
      """
}

process bar {
    container = 'image_name'
    input:
      path x
    output:
      path 'bar.txt'
    script:
      """
      another_command $x > bar.txt
      """
}

workflow {
    data = channel.fromPath('/some/path/*.txt')
    foo()
    bar(data)
}

docker {
    enabled = true
}

The conda directive will create a new environment, install the packages, run the process, the container directive specifies the container image to be used, which could have pip, conda, Alpine linux + Python apk packages, etc.

Same with Airflow using the Docker operator

with DAG('docker_dag', default_args=default_args, schedule_interval="5 * * * *", catchup=False) as dag:
        t1 = BashOperator(
                task_id='print_current_date',
                bash_command='date'
        )

        t2 = DockerOperator(
                task_id='docker_command',
                image='centos:latest',
                api_version='auto',
                auto_remove=True,
                command="/bin/sleep 30",
                docker_url="unix://var/run/docker.sock",
                network_mode="bridge"
        )

        t3 = BashOperator(
                task_id='print_hello',
                bash_command='echo "hello world"'
        )

        t1 >> t2 >> t3

It also has PythonVirtualenvOperator but it works differently. It creates an venv to execute whatever Python function you give it, then destroys the environment. Similar to a container, but using venv.

    generate_data_one = PythonVirtualenvOperator(
        task_id='generate_data_to_gcs_one',
        python_callable=data_generator, # a python function
        requirements=['google-cloud-storage==1.28.1',
                      'DateTime==4.3'
                      ],
        dag=dag,
        system_site_packages=False,
    )
# after the task runs, the venv is destroyed

Even snakemake supports a mix of Conda/containers

# snakemake --use-conda --use-singularity

container: "docker://continuumio/miniconda3:4.4.10"

rule NAME:
    input:
        "table.txt"
    output:
        "plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    script:
        "scripts/plot-stuff.R"

Is pip installation sufficient?

I think it's probably the easiest to start with.

What implementation should we go for (virtualenv, pipenv, heck even conda is almost an option - just a very heavyweight one which would require duplicating the whole env)

Probably the easiest one, just to prove it works. Then either extend the plug-in to supports others, or have separated plug-ins - if necessary.

Just my 0.02 cents 👍
Bruno

@hjoliver
Copy link
Member

hjoliver commented Aug 19, 2020

Does this seem sensible?

Yep

What other purposes might it be desirable to extend the Cylc environment for? Does this approach work for those purposes?

I guess the main intent here is to allow install and import of arbitrary Python packages during construction of the workflow definition (e.g. to automatically create a workflow based on the content of a netcdf data file, as you noted).

I guess another use, since the scheduler environment is also used to execute job submissions, xtriggers, and event handlers, would be to make those a lot more extensible and customizable - I guess that connects with @kinow suggestion above.

But we need to keep containers in mind too, especially for job (and sub-workflow) execution.

Is pip installation sufficient?
What implementation should we go for (virtualenv, pipenv, heck even conda is almost an option - just a very heavyweight one which would require duplicating the whole env)

It seems reasonable to me to (at least initially) stick with Python and go as light weight as possible.

@oliver-sanders
Copy link
Member Author

since the scheduler environment is also used to execute job submissions, xtriggers, and event handlers

Yep this is what I had in mind.

A simple but inevitable use case pipenv install cylc-xtriggers!

Especially with #3497 it will almost certainly be necessary to add Python modules to the environment in order to build effective async xtrigger functions.

Not sure if directly related to extending Cylc environment, but maybe run tasks or sub-workflows with containers?

I think that the job execution environment is a different topic. Tasks should define their own environments which shouldn't interact with the "suite environment".

But +1 for docker support, this is something multiple people have asked us about and for pretty obvious reasons. I think this is something else we could handle with a plugin which would be installed on the job-hosts and get run in the job environment itself. It's a somewhat tricky plugin to write since the jobscript is in bash (add another argument to the re-write it in Python list), but once we've got that interfacer worked out it should be fairly simple to do something like this:

[runtime]
    [[foo]]
        script = command to run in docker container
        platform = my-platform
        [[[job]]]
            docker image = user/image-name

Need to do some more thinking on this though...

@oliver-sanders oliver-sanders removed their assignment Aug 19, 2020
@oliver-sanders
Copy link
Member Author

Note #3712 is quite important for this one as otherwise this "suite environment" would leak into local job execution environment providing an un-reliable proxy for job dependency installation.

@TomekTrzeciak
Copy link
Contributor

Would this solution work for deployment in production environments, which might be walled off from internet access for package installation?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Sep 29, 2020

pipenv uses pip underneath so any cache or local package repositories you can configure pip to install from should work.

Options for offline installation with pip:

I think anything more than that would have to be considered as beyond the scope of Cylc, though the development of this plugin would leave behind an interface permitting alternative implementations.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Oct 6, 2020

Added Poetry to the issue above as I've just learned that it can manage per-project virtual environments too, and I thought it was just a packaging tool.

I've not tested the "environment extension" in Poetry, but it's likely achievable via a similar route.

Also note that both Pipenv and Poetry support storing the virtual environments inside the project itself (like yarn, npm) which is nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speculative blue-skies ideas
Projects
None yet
Development

No branches or pull requests

4 participants