How can I run my pipeline programmatically after I packaged my project? #370

f-istvan · 2020-05-14T13:00:44Z

I ran kedro package and distributed my kedro project. Now, when I do a pip install my_procejt I get the package but the question is how can I run my pipeline from an external .py script. What should I import to my external python file and how can I run my pipeline?

Based on the generated kedro_cli.py I tried this (my_external_runner_script.py):

from kedro.runner import SequentialRunner
from kedro.context import load_context
from pathlib import Path

def run(
    tag=None,
    env=None,
    runner=SequentialRunner,
    node_names=(),
    to_nodes=[],
    from_nodes=[],
    from_inputs=[],
    load_version={},
    pipeline=None,
    config=None,
    params={},
):

    context = load_context(Path.cwd(), env=env, extra_params=params)
    context.run(
        tags=tag,
        runner=runner(),
        node_names=node_names,
        from_nodes=from_nodes,
        to_nodes=to_nodes,
        from_inputs=from_inputs,
        load_versions=load_version,
        pipeline_name=pipeline,
    )


if __name__ == "__main__":
    run()

The Path.cwd() is obviously wrong in this case. I could not find any info about this in the documentation?

Could you please give me a hint how to do this?

Thank you,
Stefan

The text was updated successfully, but these errors were encountered:

mzjp2 · 2020-05-14T13:02:01Z

@limdauto looks like one for you. I believe there's improvements/fixes for this coming up soon in the next release

limdauto · 2020-05-14T13:10:55Z

Hi @f-istvan, since kedro package only packages Python code, you will need the following in your current working directory:

A .kedro.yml file with the content: context_path: <my_project>.run.ProjectContext
A conf/ directory with the configurations for your run.
Optionally a data/ directory if you use local data.

Then you can run your package with python -m <my_project>.run. No need for an external run script.

In the upcoming 0.16 release, we have removed the requirement for the .kedro.yml file and added more information on how to use this packaging mechanism.

f-istvan · 2020-05-14T14:46:03Z

Hi @limdauto,

Adding the files and directories you described fixed my issue. I can run my pipeline with python -m my_project.run. This is great! However, I sill want to run the pipeline from a python script on my environment. It is just more convenient than a bash script. I understand that I can use import subprocess for example but something like this would be useful for me (my_external_runner_script.py):

from my_project import my_pipeline_runner

# I can build my_params here dynamically with python

# and run the pipeline programmatically
my_pipeline_runner(my_params ...)

It's like cli (python -m my_project.run) vs script (python my_external_runner_script.py).
Hope it is a valid point and makes sense.

Thank you!

WaylonWalker · 2020-05-14T14:48:30Z

@limdauto Can you explain the reasoning behind having conf/ and .kedro.yml outside of the python library? I am a bit confused why they wouldn't be inside so that kedro package could make standalone libraries.

limdauto · 2020-05-14T15:31:38Z

@f-istvan Oh this is a very interesting use case. Interestingly enough, I thought calling load_context(Path.cwd()) in the same directory as conf/ and .kedro.yml should give you a useable context to run your pipeline programmatically. What error do you see?

@WaylonWalker That's a good question. We touched on this in our newly updated documentation about packaging: https://kedro.readthedocs.io/en/latest/03_tutorial/05_package_a_project.html?highlight=package#package-your-project

Please note that this packaging method only contains Python source code of your Kedro pipeline, not any of the conf/, data/ and logs/ directories. To successfully run the packaged project, you still need to be inside a directory that contain these sub-directories. This allows you to distribute the same source code but run it with different configuration, data and logging location in different environments.

This mental model unfortunately breaks when it comes to .kedro.yml, so in 0.16 we removed the need for .kedro.yml to be manually created altogether. I hope this makes sense.

WaylonWalker · 2020-05-15T01:25:17Z

Would it be possible to package up the catalog? I have had some folks that do a lot of hypothesis testing that want fast access to data, and very little interest in building/running pipelines want the catalog for certain projects made easily portable. They would really like to be able to pip install amazing_pipeline and start running hypothesis against data loaded off of the catalog.

Personally I have a couple of pipelines that seem to reuse the same sections of pipeline over and over. I would really like to be able to make those their own package, import them into new projects and just append the pipeline and catalog to the new pipeline with no more than a small bulk change in the path or s3 bucket.

limdauto · 2020-05-15T09:42:45Z

@WaylonWalker Yes, you can. It's a bit more involved. In setup.py, there is an option for you to add package data, which are essentially non-Python files, in your distribution: https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files

The problem, however, is that your package data need to be on the same level as your setup.py or below. Our conf/ is in a layer above. So what I would do as a workaround if you want to bundle your conf/ with your source distribution is to modify your setup.py to enable the include_package_data functionality and modify the kedro_cli.package command to:

Move everything under src/ to a /tmp location
Move your conf/ to that exact location

Then you should be able to bundle your configuration with your binary distribution. It's not what we would recommend as the default setup, but it should work.

WaylonWalker · 2020-05-15T18:53:24Z

Thanks @limdauto, I got everything fully working from a pip installed wheel. I moved conf/ and .kedro.yml into the library, included them in the Manifest, modified some path variables and it is working much easier than I anticipated.

This is a big move for us. I do not really see as many use cases for using different catalogs as I do for folks who want to be able to pip install project and get access to all of the project's data. For the most part, I think I can take care of dev/prod easier with a transformer than walking everyone through how to get the catalog working for hypothesis testing.

limdauto · 2020-05-20T09:58:30Z

@f-istvan I'm going to be closing this but please feel free to re-open it if you still have problems.

mark-einhorn-1987 · 2021-11-04T21:55:07Z

@f-istvan did you ever manage to find a solution to this, or did you revert to using something like subprocess? I have run into a similar sort of issue. Would be good to know. Cheers!

f-istvan added the Issue: Question label May 14, 2020

limdauto self-assigned this May 14, 2020

limdauto closed this as completed May 20, 2020

pull bot pushed a commit to FoundryAI/kedro that referenced this issue Jul 17, 2020

Minor doc update for "Working with PySpark" (kedro-org#370)

77e2c07

Galileo-Galilei mentioned this issue Jul 13, 2021

Universal Kedro deployment (Part 1) - Separate external and applicative configuration to make Kedro cloud native #770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I run my pipeline programmatically after I packaged my project? #370

How can I run my pipeline programmatically after I packaged my project? #370

f-istvan commented May 14, 2020

mzjp2 commented May 14, 2020

limdauto commented May 14, 2020 •

edited

Loading

f-istvan commented May 14, 2020

WaylonWalker commented May 14, 2020

limdauto commented May 14, 2020

WaylonWalker commented May 15, 2020

limdauto commented May 15, 2020

WaylonWalker commented May 15, 2020

limdauto commented May 20, 2020

mark-einhorn-1987 commented Nov 4, 2021

How can I run my pipeline programmatically after I packaged my project? #370

How can I run my pipeline programmatically after I packaged my project? #370

Comments

f-istvan commented May 14, 2020

mzjp2 commented May 14, 2020

limdauto commented May 14, 2020 • edited Loading

f-istvan commented May 14, 2020

WaylonWalker commented May 14, 2020

limdauto commented May 14, 2020

WaylonWalker commented May 15, 2020

limdauto commented May 15, 2020

WaylonWalker commented May 15, 2020

limdauto commented May 20, 2020

mark-einhorn-1987 commented Nov 4, 2021

limdauto commented May 14, 2020 •

edited

Loading