-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packaged kedro pipeline does not work well on Databricks #1807
Comments
This is one of the things I have fixed in #1423. I really need to write it up so we can fix it properly (I have a better fix than that PR). Let me check the slack thread now anyway. |
Add some more comments to document what happened. In summary, there are 2 issues:
from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)
with KedroSession.create(package_name) as session:
session.run()
|
A few more notes on this. The
Problem 2 is fixed by #1769:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This is an ongoing issue https://www.linen.dev/s/kedro/t/13153013/exit-traceback-most-recent-call-last-file-databricks-python-#14d50244-c91a-4738-b0de-11b822eb181b Looking at the traceback, this is the code the user was trying to run:
And filling the gaps, I think the
So, most likely the user is doing
This touches on #2384 and #2682. I think we should address this soon. |
In line with what I said in #2682 (comment),
|
This bug report is more than 1 year old. Can we get an updated view of what's happening here? We have 3 official workflows for Databricks:
On the other hand, in this page https://docs.kedro.org/en/stable/tutorial/package_a_project.html we're telling users to do this: from <package_name>.__main__ import main
main(
["--pipeline", "__default__"]
) # or simply main() if you don't want to provide any arguments which in my opinion is conceptually broken, because we are abusing a CLI function to run it programmatically. @noklam called this above a "downgrade", but what's the problem with running with KedroSession.create(package_name) as session:
session.run() ? Should we update our docs instead? @idanov's #3191 seems to make that CLI function use click standalone mode conditionally, but in my opinion (1) it's a hack and (2) doesn't really help us pinpoint why our users are running code that is broken instead of following the instructions from our docs. What am I missing? Footnotes
|
@astrojuanlu The important thing here is "Packaged project" and "Kedro Project" In a simplified world, , The benefit of having an entrypoint is that most deployment system support this because it is just
with KedroSession.create(package_name) as session:
For this, I think this is a bad workaround as it shouldn't need a special entrypoint, #3191 should replace this. Honestly I don't hate I think we need to be pragmatic here, whether or not CLI is the best way to do this is arguable. This will requires more discussion and design. Unless we plan to change this in short term. I think we should at least move forward with #3191 and continue this CLI vs click wrapper option. |
Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.
I agree 💯 I understand then that the problem is that currently users are forgetting to do that step, and they expect the default entrypoint to just work. |
It is more generic. #3237 has more context but #3191 go beyond just Databricks, it also affect integrating Kedro with other downstream task. (the CLI way doesn't work currently, so only KedroSession.run works) |
This has been my point all along: that this shouldn't be a "workaround", it should be our recommended way of doing it. I'm not blocking #3191 on this discussion, but I'm hoping we can have it some day. Reason being: running a Kedro project programmatically requires either abusing these CLI functions, or adding some magic ( |
In the past we had a few nice
It's probably worth documenting that for advanced users who would like to know how Kedro works internally and maybe retrieve the old |
|
I generated a reproducer that doesn't depend on Kedro. Source code: https://github.com/astrojuanlu/simple-click-entrypoint Job definition: Result: |
Reported this upstream https://github.com/MicrosoftDocs/azure-docs/issues/116964 |
Did a bit more debugging #3191 (review) the It's clear that passing |
The sorry state of this issue is as follows:
We need to keep brainstorming. I honestly thing the best way forward would be to simplify the default |
Leave a comment here: #3237 (comment). This goes beyond Databricks so I want to keep the conversation on the broader topic. |
Useful bit of information I found in See:
(From https://stackoverflow.com/a/77904549 1) For whatever reason this doesn't work on VSCode notebooks though, they set a variable called Dropping this here in case it helps. Footnotes
|
Nah, none of this works on Databricks 👎🏼 |
Originated from internal users https://quantumblack.slack.com/archives/C7JTFSFUY/p1661236540255789
The text was updated successfully, but these errors were encountered: