Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parent] - Notebook/IPython Debug line magic %load_node feature discussion thread #3535

Closed
4 of 5 tasks
noklam opened this issue Jan 19, 2024 · 12 comments
Closed
4 of 5 tasks
Assignees

Comments

@noklam
Copy link
Contributor

noklam commented Jan 19, 2024

Description

This ticket is created to track and discuss the development of this feature.

What does this feature do?

In short, it tries to recreate the context of a pipeline error by eliminating manual steps as much as possible to provide better UX. It loads up the data, imports (and) function body in notebook cells, allowing user to quickly start debugging, plotting the data etc.

Image

Tracked Tickets:

Comments

I have collected some comments from a few users and these are some quotes categorised by themes.

IPython

But I agree, ipython version, since you're likely in your IDE, no need to bring in the function body. You can open the right file and dev there
HI. very nice! is it something that could work in kedro ipython also direclty ?

Support more platform

  • support platform:
    • Mainly notbeook & Lab
    • VSCode Notebook (not yet)
    • Databricks (not tested)

Two-way sync

Would be handy to debug, but you should do this in the IDE instead of a notebook. If you change the code, you still have to manually sync the changes to VCS
Look awesome
@Nok
. Would there be a path back from notebook into the project’s files (as it was with kedro jupyter convert )?

Recursive definition of function body

Also how would this work if your node function imports other functions which may import other functions? You only show where depth is 1?

Edge Case improvement

  • Handle imports that are not from top of the script, i.e. functions define in the same file as the node function which lives in locals().
  • Intermediate MemoryDataset
@noklam noklam converted this from a draft issue Jan 19, 2024
@datajoely
Copy link
Contributor

As stretch goal a runner like --runner=SequentialDebug where the last failed node spins up a notebook with the function body and data available to work out why it failed 🚀

@astrojuanlu
Copy link
Member

As stretch goal a runner like --runner=SequentialDebug where the last failed node spins up a notebook with the function body and data available to work out why it failed 🚀

This is actually a very cool idea - alternatively, instead of a notebook it could be a (pdb) session

@datajoely
Copy link
Contributor

ipdb!

@AhdraMeraliQB
Copy link
Contributor

AhdraMeraliQB commented Jan 31, 2024

Context for Tech Design - 31/1/24

Feature motivation - #1832

To summarise, debugging Kedro projects in notebooks is a painful process. One of the suggestions to make this easier for users included the proposal of a %load_node line magic, that would load a node from a user's Kedro project pipeline into notebook cells for easier debugging. The current MVP for this feature lives in #3510.

Steps to test the MVP

  1. Clone the PR an install your local Kedro version
  2. Navigate to an existing test project, or a directory where you wish to create one and follow kedro new. Make sure to select example code if not using a starter.
  3. Navigate into your project root
  4. Run kedro jupyter lab
  5. In the notebook, test the line magic with %load_node <node-name>

The current implementation

The line magic currently will load any imports in the source file of the node, the node's inputs (assumed to be declared in the catalog), and the body of the node function.

debug

This serves as an excellent starting point, the implementation successfully resolves:

  • Nested functions
  • One line functions
  • Messy/scattered import statements

However, it does not account/allow for:

  • References to objects out of node function scope but within the source-file
  • MemoryDataset inputs
  • kedro ipython and other platforms
  • Unnamed nodes (node name is not required when defining a node)

This brings us to the points of our discussion.

What makes a complete feature?

The focus of this discussion is to iron out the expectations for the first version of this feature, and possible future extensions. In this, there are four primary themes to consider. In this discussion, we must determine if a particular use case should be addressed now, later, much later, or never. Additionally, if something is to be done later, much later, or never, what will we do in the interim?

Consider: Extending the current implementation

Several edge cases are not handled by the line magic:

1. If a node input is a MemoryDataset

The current implementation assumes all inputs exist in the catalogue, and running the cell that loads the node inputs will result in an error if the input is not defined in the catalog. We have several options when addressing this:

a. Ignore it - expect users to understand the error.
Pro: No technical requirement. Con: Invites friction and confusion

b. Add a comment to the imports block explaining the above.
Pro: No technical requirement. Con: Requires users to edit their project catalog to continue debugging with the feature.

c. Resolve it with the line magic.
Pro: Frictionless for users. Con: Technically challenging - requires it's own investigation into how this could be implemented.

2. Node references an object defined within the source-file but outside of node scope

Including only imports and the node function means we don't have access to any helper functions defined inside the file but outside of the node function. Any references to these when the node function is loaded will result in an error.

a. Ignore it - expect users to understand the error.
Pro: No technical requirement. Con: Invites friction and confusion

b. Add from <node.source.module> import *
Pro: Frictionless for users. Con: Technically challenging - requires it's own investigation into how this could be implemented.

3. Return statements make unrunnable cells

When loading a node function, we omit the defining line. Including return statements then results in a syntax error when the cell is run.

image

a. Ignore it - expect users to understand the error.
Pro: No technical requirement. Con: Invites friction and confusion

b. Comment out the return statement
Pro: Simple Con: Introduces syntax errors if returns are within conditions, see:

x = yyy
if <some_flag):
  # return x
else:
  # return None

c. Comment out return statement and introduce print statement as replacement.
Pro: Simple to introduce - exists on #3568, eliminates errors. Con: Modifies node, user will have to fix this if copying the code back into their project files

Consider: Supporting Other Platforms

#3510 Only supports jupyter lab/notebook. #3536 introduces support for kedro ipython (Support across notebook/lab most IDE and platforms except databricks). What should we do about:

  1. Databricks notebooks
  2. VS Code

And if we choose not to support any platforms in the current iteration, how do we inform users? (Multiple selections allowed)

  1. Docs
  2. Release Notes
  3. Warn when trying to use the line magic in unsupported platform

Consider: Supporting the node-finding process by making changes to the default node name

As mentioned briefly above, node names are not required when defining a node, and are often omitted by users. However, the current implementation of the line magic searches for the specified by filtering the pipeline for matching node names, and so if the name is not defined, the %load_node will not be able to find the node.

We could address this by:

  1. Mentioning this in the ValueNotFoundError
    Pro: Trivial to add. Makes user aware of problem. Con: Requires user to modify project files to continue debugging in notebook

  2. Search all node.func.__name__ if node name is not found - see %load_node experiments #3568
    Pro: Solves problem. Easy to implement. Con: Ugly complexity. Behaviour diverges from kedro run --nodes/--to-nodes/--from-nodes

  3. Modify default node name - see Spike: proposal for better default node names #3575
    Pro: Solves issue across Kedro, not just for line magic. Con: technically breaking. Can we ship these changes together?

[Extra] Consider: The future of notebook debugging

There were some suggestions for the future of this feature, for completeness I am listing them here

  1. 2-way conversion (%load_node and %save_node ?) - do we bring back kedro jupyter convert? xref
  2. A SequentialDebug runner - xref
  3. Populating problematic functions when error results from function call - xref

References: main PR: #3510, original discussion: #1832, experiments: #3568, related: #3575, #3536

@astrojuanlu
Copy link
Member

astrojuanlu commented Jan 31, 2024

My 2 cents:

Extending the current implementation

2c. Bringing the full contents of the corresponding file, including the function definitions themselves
3d. Do not unpack the function body, and retain the function definition (see also point above)

Supporting Other Platforms

Frankly I'd ignore kedro ipython, kedro jupyter and shift the focus to "make this work where %load_ext kedro.ipython works". This doesn't work for pure REPLs, and also it's unclear how other non-Jupyter notebooks would behave (they're all mostly based on IPython, but then frontend code might differ significantly). But the point is that I wouldn't consider kedro ipython/jupyter a platform because of #2777

The future of notebook debugging

I would say let's not bring back kedro jupyter convert, it didn't work back then and there's no reason it will work now.

Having a %save_node is promising, but maybe we should broaden our perspective and have a %save_code that is more generic? Note that this is a big, unsolved problem in the Jupyter <-> IDE interaction as a whole.

@deepyaman
Copy link
Member

Extending the current implementation

  • Nit on 3c: should be a display() call?
  • @astrojuanlu's suggestion 3d also makes sense to me

Supporting Other Platforms

For a lot of these edge cases, as well as for other platforms, I think I'm OK with having some friction (and perhaps marking the feature as experimental for now). I think there's a risk of going too deep in this feature, without understanding how many people actually use it/how they use it.

@AhdraMeraliQB
Copy link
Contributor

AhdraMeraliQB commented Jan 31, 2024

Tech Design Summary

Link to Miro board

Now (part of #3510):

  • Create parent issue to track "later" actions and to be referenced in universal warning (below): %load_node line magic improvements #3580
  • Warn users when preparing node inputs that inputs need to be explicitly declared in the catalog
  • Handle return statements by commenting out and substituting print display() Include function definition and function call with node inputs - see comment below for why
  • Add a line when node not found about node name vs node function
  • Add a platform agnostic warning that this is an experimental feature and is only supports jupyter and ipython

Next:

Later

  • from node.source.file import * (for helper functions)
  • Support other platforms
  • Resolving memory datasets (can be ignored fully)
  • Update default node name - will have to go in next major release, needs to be unique. More discussion required.

@datajoely
Copy link
Contributor

Sad I couldn't join the design meeting but this looks great - thanks @noklam for initial innovation which now feels like a tangible reality and also @AhdraMeraliQB for making the decisions made easy to grok!

@astrojuanlu
Copy link
Member

One salient point I just realized about unpacking the function body and transforming return statements into print or display - we would be at risk of changing the logic of the code.

A silly example:

def foo():
    if True:
        return 1
    subprocess.check_output("rm -rf /")  # but `True is True` always right??

so foo() will absolutely always return 1.

However:

if True:
    display(1)
subprocess.check_output("rm -rf /")  # but `True is True` always right??

will nuke your filesystem.

Could we reconsider my proposal above

3d. Do not unpack the function body, and retain the function definition (see also point above)

? I think we never got to discuss it during the call (my fault, the conversation drifted exactly at that point and I forgot to bring it up)

@noklam noklam changed the title [Parent] - Jupyter Debug line magic thread [Parent] - Jupyter Debug line magic %load_node thread and bug reports Feb 1, 2024
@noklam noklam changed the title [Parent] - Jupyter Debug line magic %load_node thread and bug reports [Parent] - Jupyter Debug line magic %load_node feature discussion thread Feb 1, 2024
@noklam noklam changed the title [Parent] - Jupyter Debug line magic %load_node feature discussion thread [Parent] - Notebook/IPython Debug line magic %load_node feature discussion thread Feb 19, 2024
@astrojuanlu

This comment was marked as outdated.

@astrojuanlu

This comment was marked as off-topic.

@noklam
Copy link
Contributor Author

noklam commented Mar 5, 2024

Closing this as this is released in 0.19.3. Follow up improvement discussion will be document here: #3580

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants