Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I run my pipeline programmatically after I packaged my project? #370

Closed
f-istvan opened this issue May 14, 2020 · 10 comments
Closed
Assignees

Comments

@f-istvan
Copy link

I ran kedro package and distributed my kedro project. Now, when I do a pip install my_procejt I get the package but the question is how can I run my pipeline from an external .py script. What should I import to my external python file and how can I run my pipeline?

Based on the generated kedro_cli.py I tried this (my_external_runner_script.py):

from kedro.runner import SequentialRunner
from kedro.context import load_context
from pathlib import Path

def run(
    tag=None,
    env=None,
    runner=SequentialRunner,
    node_names=(),
    to_nodes=[],
    from_nodes=[],
    from_inputs=[],
    load_version={},
    pipeline=None,
    config=None,
    params={},
):

    context = load_context(Path.cwd(), env=env, extra_params=params)
    context.run(
        tags=tag,
        runner=runner(),
        node_names=node_names,
        from_nodes=from_nodes,
        to_nodes=to_nodes,
        from_inputs=from_inputs,
        load_versions=load_version,
        pipeline_name=pipeline,
    )


if __name__ == "__main__":
    run()

The Path.cwd() is obviously wrong in this case. I could not find any info about this in the documentation?

Could you please give me a hint how to do this?

Thank you,
Stefan

@mzjp2
Copy link
Contributor

mzjp2 commented May 14, 2020

@limdauto looks like one for you. I believe there's improvements/fixes for this coming up soon in the next release

@limdauto
Copy link
Contributor

limdauto commented May 14, 2020

Hi @f-istvan, since kedro package only packages Python code, you will need the following in your current working directory:

  • A .kedro.yml file with the content: context_path: <my_project>.run.ProjectContext
  • A conf/ directory with the configurations for your run.
  • Optionally a data/ directory if you use local data.

Then you can run your package with python -m <my_project>.run. No need for an external run script.

In the upcoming 0.16 release, we have removed the requirement for the .kedro.yml file and added more information on how to use this packaging mechanism.

@f-istvan
Copy link
Author

Hi @limdauto,

Adding the files and directories you described fixed my issue. I can run my pipeline with python -m my_project.run. This is great! However, I sill want to run the pipeline from a python script on my environment. It is just more convenient than a bash script. I understand that I can use import subprocess for example but something like this would be useful for me (my_external_runner_script.py):

from my_project import my_pipeline_runner

# I can build my_params here dynamically with python

# and run the pipeline programmatically
my_pipeline_runner(my_params ...)

It's like cli (python -m my_project.run) vs script (python my_external_runner_script.py).
Hope it is a valid point and makes sense.

Thank you!

@WaylonWalker
Copy link
Contributor

@limdauto Can you explain the reasoning behind having conf/ and .kedro.yml outside of the python library? I am a bit confused why they wouldn't be inside so that kedro package could make standalone libraries.

@limdauto
Copy link
Contributor

@f-istvan Oh this is a very interesting use case. Interestingly enough, I thought calling load_context(Path.cwd()) in the same directory as conf/ and .kedro.yml should give you a useable context to run your pipeline programmatically. What error do you see?

@WaylonWalker That's a good question. We touched on this in our newly updated documentation about packaging: https://kedro.readthedocs.io/en/latest/03_tutorial/05_package_a_project.html?highlight=package#package-your-project

Please note that this packaging method only contains Python source code of your Kedro pipeline, not any of the conf/, data/ and logs/ directories. To successfully run the packaged project, you still need to be inside a directory that contain these sub-directories. This allows you to distribute the same source code but run it with different configuration, data and logging location in different environments.

This mental model unfortunately breaks when it comes to .kedro.yml, so in 0.16 we removed the need for .kedro.yml to be manually created altogether. I hope this makes sense.

@limdauto limdauto self-assigned this May 14, 2020
@WaylonWalker
Copy link
Contributor

Would it be possible to package up the catalog? I have had some folks that do a lot of hypothesis testing that want fast access to data, and very little interest in building/running pipelines want the catalog for certain projects made easily portable. They would really like to be able to pip install amazing_pipeline and start running hypothesis against data loaded off of the catalog.

Personally I have a couple of pipelines that seem to reuse the same sections of pipeline over and over. I would really like to be able to make those their own package, import them into new projects and just append the pipeline and catalog to the new pipeline with no more than a small bulk change in the path or s3 bucket.

@limdauto
Copy link
Contributor

@WaylonWalker Yes, you can. It's a bit more involved. In setup.py, there is an option for you to add package data, which are essentially non-Python files, in your distribution: https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files

The problem, however, is that your package data need to be on the same level as your setup.py or below. Our conf/ is in a layer above. So what I would do as a workaround if you want to bundle your conf/ with your source distribution is to modify your setup.py to enable the include_package_data functionality and modify the kedro_cli.package command to:

  • Move everything under src/ to a /tmp location
  • Move your conf/ to that exact location

Then you should be able to bundle your configuration with your binary distribution. It's not what we would recommend as the default setup, but it should work.

@WaylonWalker
Copy link
Contributor

Thanks @limdauto, I got everything fully working from a pip installed wheel. I moved conf/ and .kedro.yml into the library, included them in the Manifest, modified some path variables and it is working much easier than I anticipated.

This is a big move for us. I do not really see as many use cases for using different catalogs as I do for folks who want to be able to pip install project and get access to all of the project's data. For the most part, I think I can take care of dev/prod easier with a transformer than walking everyone through how to get the catalog working for hypothesis testing.

@limdauto
Copy link
Contributor

@f-istvan I'm going to be closing this but please feel free to re-open it if you still have problems.

@mark-einhorn-1987
Copy link

@f-istvan did you ever manage to find a solution to this, or did you revert to using something like subprocess? I have run into a similar sort of issue. Would be good to know. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants