-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: allow user defined workflow definition environment #3780
Comments
It sounds good to me, +1. Can't think of any potential issues right now.
Not sure if directly related to extending Cylc environment, but maybe run tasks or sub-workflows with containers? With Nextflow you can run an entire workflow with Docker, while also using Conda.
The Same with Airflow using the Docker operator
It also has PythonVirtualenvOperator but it works differently. It creates an venv to execute whatever Python function you give it, then destroys the environment. Similar to a container, but using venv. generate_data_one = PythonVirtualenvOperator(
task_id='generate_data_to_gcs_one',
python_callable=data_generator, # a python function
requirements=['google-cloud-storage==1.28.1',
'DateTime==4.3'
],
dag=dag,
system_site_packages=False,
)
# after the task runs, the venv is destroyed Even snakemake supports a mix of Conda/containers
I think it's probably the easiest to start with.
Probably the easiest one, just to prove it works. Then either extend the plug-in to supports others, or have separated plug-ins - if necessary. Just my 0.02 cents 👍 |
Yep
I guess the main intent here is to allow install and import of arbitrary Python packages during construction of the workflow definition (e.g. to automatically create a workflow based on the content of a netcdf data file, as you noted). I guess another use, since the scheduler environment is also used to execute job submissions, xtriggers, and event handlers, would be to make those a lot more extensible and customizable - I guess that connects with @kinow suggestion above. But we need to keep containers in mind too, especially for job (and sub-workflow) execution.
It seems reasonable to me to (at least initially) stick with Python and go as light weight as possible. |
Yep this is what I had in mind. A simple but inevitable use case Especially with #3497 it will almost certainly be necessary to add Python modules to the environment in order to build effective async xtrigger functions.
I think that the job execution environment is a different topic. Tasks should define their own environments which shouldn't interact with the "suite environment". But +1 for docker support, this is something multiple people have asked us about and for pretty obvious reasons. I think this is something else we could handle with a plugin which would be installed on the job-hosts and get run in the job environment itself. It's a somewhat tricky plugin to write since the jobscript is in bash (add another argument to the re-write it in Python list), but once we've got that interfacer worked out it should be fairly simple to do something like this: [runtime]
[[foo]]
script = command to run in docker container
platform = my-platform
[[[job]]]
docker image = user/image-name Need to do some more thinking on this though... |
Note #3712 is quite important for this one as otherwise this "suite environment" would leak into local job execution environment providing an un-reliable proxy for job dependency installation. |
Would this solution work for deployment in production environments, which might be walled off from internet access for package installation? |
Options for offline installation with
I think anything more than that would have to be considered as beyond the scope of Cylc, though the development of this plugin would leave behind an interface permitting alternative implementations. |
Added Poetry to the issue above as I've just learned that it can manage per-project virtual environments too, and I thought it was just a packaging tool. I've not tested the "environment extension" in Poetry, but it's likely achievable via a similar route. Also note that both Pipenv and Poetry support storing the virtual environments inside the project itself (like |
The Problem
It is now possible to import arbitrary Python modules in the
flow.conf
file via Jinja2:flow.cylc
Or, preferably from within Python files loaded into Jinja2:
lib/python/process_parameters.py
Which is great, provided that the modules you want are actually installed in the Cylc environment. If you maintain your own Cylc environment then you can maintain your own Conda recipes and build environments which cater to the needs of your workflows 🤮. This is terrible for portability, reproducibility and just general niceness.
Surely the dependencies should lie with the system just like in normal Python projects?
The Bigger Problem
At the moment this is an issue which hits users who are doing more advanced Jinja2 work, especially as Cylc7 is Python2.
When Cylc9 arrives and workflow definitions start being written in Python the ability to add to the Cylc environment will become much greater.
flow.py
The Solution?
We add a plugin to Cylc which allows the Cylc environment to be extended with a virtual environment. We install required dependencies into this enviroment.
pip
orconda
.When necessary Cylc would re-invoke the
cylc run
command inside this virtual environment. This environment would not be available to jobs (circa #3712) it's just for processing the wokflow definition.pipenv (pip+virtualenv)
Whilst this plugin could be implemented in different ways I think the most logical candidate might be
pipenv
which is a system for spinning up environments withvirtualenv
installing stuffpip
.Pipenv would take care of the creation and management of virtual environments for us making this plugin pretty simple to write.
Here's a quick example of how
pipenv
could be used:The Cylc Pipfile plugin would re-invoke any
cylc (run|restart|get-config)
commands based on the presence of aPipfile
file.poetry
Another tool with a very similar feature set to Pipenv, it too can work with virtual environments, create lock files, etc. It also has a streamlined system for pypi publishing and other niceness.
Conclusion / Questions
pipenv
so the actual "work" involved here is not as great as the length of the discussion might make it seem.pipenv
not work out.Caveats
Users would still be coupled to the same Python version as the "parent" Cylc environment.
Python modules would get installed once for each workflow which uses them.
npm
/yarn
.pipenv --rm
, this represents a management overhead.This would restrict users to installing stuff into Python virtual environments.
pip
rather thanconda
.Questions
pip
installation sufficient?virtualenv
,pipenv
, heck evenconda
is almost an option - just a very heavyweight one which would require duplicating the whole env)The text was updated successfully, but these errors were encountered: