Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Kedro run as a Package (2023) #3237

Closed
4 tasks done
noklam opened this issue Oct 27, 2023 · 9 comments
Closed
4 tasks done

Improve Kedro run as a Package (2023) #3237

noklam opened this issue Oct 27, 2023 · 9 comments

Comments

@noklam
Copy link
Contributor

noklam commented Oct 27, 2023

Context

I have sat down with @idanov today and try to recall our memory about #1423 that was made by @antonymilne. Solving this PR would make Kedro more compatible anywhere (particularly Databricks) and potentially simplify our documentation on Databricks.

#1423 summarize how Kedro run is supported currently

A user can run their kedro project in several ways. Note that the run command executed can be defined on the framework side or overridden in turn by a plugin or a project cli.py (done by _find_run_command).

  • Kedro CLI: kedro run. This is the only route that doesn't go through the project main.py; instead it goes through kedro.framework.main, which builds the CLI tree and does something like _find_run_command
  • CLI but not kedro CLI: python -m spaceflights. This hits the project main.py and will call main
  • CLI but not kedro CLI: python src/spaceflights. This is just a more unusual way of doing 2
  • Inside your own python script: from spaceflights.main import main; main(), then run the script using python
  • Inside IPython/Jupyter with no kedro ipython extension: same as 4 but you run it in the notebook
  • Inside IPython/Jupyter with kedro ipython extension: same as 5 but you can also do session.run (which is what we have advertised as the way to do a kedro run in the past)

All the above must be run from within the project root or kedro won't be able to find your conf. Options 2 onwards needs you to have pip installed your project or to have src in your PYTHONPATH. Note that having the project pip installed could mean first doing kedro package and then pip install the resulting .whl file or it could mean just pip install ./src from your project root; it doesn't make a difference.

In summary, there are 3 things that #1423 attempt to fix and we can break it down.

Added one more:

@noklam
Copy link
Contributor Author

noklam commented Feb 28, 2024

So I see there are two different ways of going:

I. Improve the current approach with CLI entrypoints - Parent #3237

It consists 3 sub-tasks:

II. Use KedroSession everywhere

The current entrypoint look like this:
https://github.com/kedro-org/kedro-starters/blob/87715c48977bdfe1c64dd1d924a9f3e6c0933951/pandas-iris/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/__main__.py#L39C2-L43C25

def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
    configure_project(package_name)
    run = _find_run_command(package_name)
    run(*args, **kwargs)

I want to highlight the fact that, using the CLI approach is the only way to make cli.py works in packaged mode. #2384 suggests to remove cli.py, this will however be a breaking change and is this something we want to do?

If we give up cli.py, we can swap the _find_run_command with KedroSession. This will easily avoid all #2681, #2682, #3051, but it almost mean that there will be no alternatives to extend Kedro CLI.

(Roughly like this)

def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
    configure_project(package_name)
    session = KedroSession.create() # need to handle `env` and `extra_params`
    result = session.run(*args, **kwargs)

I remember #2169 (comment) mention he extend Kedro CLI for this particular reason, though kedro-boot maybe the latest approach? Cc @takikadiri

@astrojuanlu @merelcht I'd like to get some feedback about this.

Lastly, Databricks is not the only issue. Solving this will improve integrating Kedro with downstream application (but not a complete solution). The philosophy so far focus on getting the pipeline run once in a wheel format, but recent discussion suggest that users want to run pipeline repeatly, thus this won't help kedro-boot.

@astrojuanlu
Copy link
Member

I'd like to explore the idea of using the KedroSession everywhere, because it moves us towards making the Session more usable.

In the end, users will want to define their own scripts, ways of launching the KedroSession, and so on.

but it almost mean that there will be no alternatives to extend Kedro CLI.

Could you explain this in a bit more detail?

@noklam
Copy link
Contributor Author

noklam commented Mar 8, 2024

but it almost mean that there will be no alternatives to extend Kedro CLI.
Could you explain this in a bit more detail?
cli.py is a way to extend Kedro CLI at a project level, for example, you can add new options in a cli.py in <python_package>/cli.py. The existing packaging Kedro project has a __main__ entrypoint, which load this cli.py if exists. KedroSession will omit this part completely.

Extend questions:

I'd like to explore the idea of using the KedroSession everywhere, because it moves us towards making the Session more usable.

In the end, users will want to define their own scripts, ways of launching the KedroSession, and so on.
I see you mentioned we should do #2682, do you think we need to do both?

@astrojuanlu
Copy link
Member

I see you mentioned we should do #2682, do you think we need to do both?

Oh, I commented on #2682 after I saw your comment on #3680. Is there any leftover?

To clarify, I was withdrawing my opposition in case we ever have to do that.

For this particular issue, I still think pursuing the KedroSession approach is better.

The existing packaging Kedro project has a __main__ entrypoint, which load this cli.py if exists. KedroSession will omit this part completely.

Yes, that's what I understood. Isn't it possible to extend the CLI developing a normal plugin, like kedro airflow and kedro docker do? Although now that I mention it, I have no idea how that interacts with python -m packaged_kedro.

@noklam
Copy link
Contributor Author

noklam commented Mar 8, 2024

Yes, that's what I understood. Isn't it possible to extend the CLI developing a normal plugin, like kedro airflow and kedro docker do? Although now that I mention it, I have no idea how that interacts with python -m packaged_kedro.

Correct, you can extend and add subcommand but not change existing KedroCLI, cli.py particular can change any kedro command, more commonly changing the run command. For example, you can add new arguments like
kedro run --my-custom-arg <value>, this isn't possible with the plugin mechanism (and would be very bad if it's possible)

@astrojuanlu
Copy link
Member

Got it. Assuming something like this:

def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
    configure_project(package_name)
    session = KedroSession.create() # need to handle `env` and `extra_params`
    result = session.run(*args, **kwargs)

(from your comment above)

Can't the users add any CLI arguments they want, with argparse, click, fire, tyro, or anything else?

@merelcht
Copy link
Member

I think a redesign of the KedroSession is long overdue. However, I do think it goes beyond just the issue addressed here of running Kedro as a package and I'd like to explore what other workflows the session is involved in or could be involved in. e.g. #2169 (also mentioned above), we've talked about the role of the session in interactive workflows as well (notebooks) and of course there's the role of the session in experiment tracking. I guess any meaningful changes to the session are also likely to be breaking, and it would be a shame to postpone improving packaged Kedro until then.

Is it possible to do:

in a non-breaking way first and then tackle the KedroSession as a separate larger piece?

@takikadiri
Copy link

takikadiri commented Mar 12, 2024

Enable running kedro project/package programatically was one of my main focus while developing kedro-boot. I enumerated these three entry points for running kedro :

  • Running kedro project with a CLI --> Kedro CLI run command,
  • Running kedro package with a CLI --> __main__ module that configure the project and reuse the run command
  • Running kedro package/project programatically or as part of interactive workflow --> This is missing. I don't think that the __main__.py should cover this. This maybe could be solved by a run_package/run_project functions provided by the framework not the template. The run_package/run_project could reuse the underlying function of the Kedro run command.

These three entry points could reuse the same run function (used by the Kedro run command). This is something that i tried to achieve with kedro-boot while developing the boot_project and boot_package. I faced some problems with click, when trying to decouple the click command from it's uderlying function, that i manage to solve by giving up the function definition and having just a **kwargs that would be passed by the click command or the user interface that reuse the function. Here is an example of such decoupling:

def run_function(**kwargs):

	# some args preprocessing
	# Session creation & running
    with KedroSession.create(
            env=kedro_args.get("env", ""),
            extra_params=kedro_args.get("params", ""),
            conf_source=kedro_args.get("conf_source", ""),
        ) as session:
            return session.run(
                tags=tuple_tags,
                runner=runner,
                node_names=tuple_node_names,
                from_nodes=kedro_args.get("from_nodes", ""),
                to_nodes=kedro_args.get("to_nodes", ""),
                from_inputs=kedro_args.get("from_inputs", ""),
                to_outputs=kedro_args.get("to_outputs", ""),
                load_versions=kedro_args.get("load_versions", {}),
                pipeline_name=kedro_args.get("pipeline", ""),
                namespace=kedro_args.get("namespace", ""),
            )

@click.command(name="run", short_help="")
def run(**kwargs) -> Any:
    return run_function(**kwargs)

run_params = [
    click.option("--pipeline", type=str, help=""),
    click.option("--env", type=str, help=""),
    .
    .
]

for param in run_params:
        run = param(run)

@merelcht
Copy link
Member

All subtasks are completed, so I'm closing this issue as complete as well! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Current
Development

No branches or pull requests

4 participants