Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

Closed
noklam opened this issue Jul 21, 2022 · 19 comments

Comments

@noklam
Copy link
Contributor

noklam commented Jul 21, 2022

Dataset discovery

Image

Running the variable in a notebook cell is the most common way to inspect a variable in Jupyter. Currently catalog give us an useless memory address, it would be much nicer if it prints out what's available, potentially just wrap it as catalog.list() for simplicity.

demo:

Image

Autocompletion

# Current Way - Autocompletion not possible
catalog.load("example_iris_data").head()

# Slightly more verbose way - but much easier to type with Tab-completion
catalog.datasets.example_iris_data.load().head()

Under the hood, auto-completion works with checking the dir(object) method

Possible solutions:

  1. Modify dir to shows datasets, so user can do catalog. + Tab, it will shows other method available in catalog though. We can probably makes datasets show up at the top of autocompletion too and others method at the bottom. I think it just has to be a list most likely. Of course we have to implement the __getattr__ so it actually looks at catalog.datasets too when we do catalog.dataset_name.

demo:
Image

  1. Implement a dict-like interface so when user do catalog[ + Tab, it will shows all the datasets.

Bonus

session and context also has useless __repr__, but they are not too useful in the interactive workflow so it has lower priority.

@antonymilne
Copy link
Contributor

LOVE this idea. I've added it to the interactive workflow milestone which I still need to populate a bit more since I've had related thoughts that I haven't had a chance to write down properly yet.

Just a few notes for now:

  • the correct mechanism to access dataset_name through catalog needs to be formalised anyway as per my comments in Improving the I/O transparency with kedro run #1691. Easy tab completion in the interactive workflow should definitely be a factor here. Note there's a complication that namespaced datasets have . in their name which might make access through getattr awkward. So we need to think quite carefully about what are the different options and the pros and cons here
  • one of the things we plan to add to the interactive workflow that would make this suggestion less important is the ability to open a notebook in the context of a node, e.g. %load_node node_name which would automatically populate Python variables containing all the input datasets pre-loaded. Tabbed completion of catalog would still be really cool, but not such high priority then
  • pipelines could probably also benefit from a better __repr__, but as you say catalog is definitely the most important
  • not important, but rich offers us __rich__ and __rich_repr__ which might be nice here to make things even snazzier. Ultimately I'd like all kedro objects (nodes, pipelines, datasets, etc.) to automatically have beautiful, clearly formatted str/repr, but that's some way off

@noklam
Copy link
Contributor Author

noklam commented Aug 2, 2022

the correct mechanism to access dataset_name through catalog needs to be formalised anyway as per my comments in #1691. Easy tab completion in the interactive workflow should definitely be a factor here. Note there's a complication that namespaced datasets have . in their name which might make access through getattr awkward. So we need to think quite carefully about what are the different options and the pros and cons here

I agree it becomes tricky once namespace is in the picture, in this case, I think a dict-like interface is more appropriate, similar to how we access parameters with nested .

one of the things we plan to add to the interactive workflow that would make this suggestion less important is the ability to open a notebook in the context of a node, e.g. %load_node node_name which would automatically populate Python variables containing all the input datasets pre-loaded. Tabbed completion of catalog would still be really cool, but not such high priority then

%load_node is an interesting idea. We have a similar discussion about this a few months ago, I think one of the challenges here is to make MemoryDataSet work, if it is written into catalog already then it would be straightforward. I used to rely on the notebook for debugging whenever I need to debug a large pipeline which I need to inspect the data. The reason I do that is because PyCharm Debug mode could get really slow when the data size is large, it should freeze even if you do print with a few lines of data. The alternative is to just run the pipeline up to the interested node, then I just copy and paste the code of the node and start debugging. The tricky part is to get the required data load properly.

It may be worth discuss what should session.run return, for interactive/debugging purpose, it may make sense to keep the intermediate memory dataset. (IRRC currently we return node's output they are not consumed by any other node.)

@antonymilne
Copy link
Contributor

I just came across this page which is well worth reading to get some more ideas:

  • there's ways to customise tab completion beyond just dir
  • there's special Jupyter repr methods that can render things more nicely than the default __repr__

@datajoely
Copy link
Contributor

you can also use __rich_repr__ now...

@noklam
Copy link
Contributor Author

noklam commented Aug 31, 2023

I am working on #2676 and I try to debug and it's hard. I want to re-purpose this issue so it's not Jupyter focus. If we want to make users use this as a component it need to have nicer public API and str. Right now it's hard to find out what is available without going through the source code.

In catalog.yml I have X_train@pyspark but when I try to print it I get this:

X_train__pyspark metadata: 
  None

@noklam noklam changed the title Improve Jupyter workflow with autocompletion and better __str__ __repr__ Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed Aug 31, 2023
@datajoely
Copy link
Contributor

Yeah the internal representation of namespaced datasets with a double underscore is really annoying - I hit this when doing an IDE prototype (#2821). It would be great if the catalog had a presentation-layer representation available in the public API

@noklam
Copy link
Contributor Author

noklam commented Aug 31, 2023

The clean up will go into this undefined milestone Redesign Catalog and Datasets. But before this happen we can still make the public interface nicer, maybe then deprecate or unify the rest later. Good shout about namespace I haven't try that.

@noklam noklam modified the milestones: Make it easier to use Kedro as a library, Using Kedro with existing projects Sep 1, 2023
@noklam
Copy link
Contributor Author

noklam commented Sep 4, 2023

@noklam
Copy link
Contributor Author

noklam commented Oct 23, 2023

One more, when worked with CST they often have a large catalog. A few things that I think is useful

  • catalog.list(<regex>) - not many people is aware that this is possible to help discover the name of the datasets.
  • catalog.datasets. (Tab) can use autocompletion (catalog.load doesn't)
    • More often I just do datasets = catalog.datasets, then I do datasets.my_dataset.load() which is much faster to type (Large catalog usually mean nested namespace and long entry). There is a separate problem of . get map to __ which isn't consistent and we need better design.

I was wondering could we make this better? Can we have catalog["xxx"] so it works like a dictionary and can use autocomplete easily.
Alternatively, if we prefer less API change, is it possible to make catalog.load("xxx") auto-completable?

Reference:

@noklam
Copy link
Contributor Author

noklam commented Oct 23, 2023

I did some more research and I couldn't find any auto-completion feature for catalog.load(" (Hit Tab). The autocompletion only works for catalog. (__dir__) or catalog[" (_ipython_completion or inherit the dict interface).

@noklam
Copy link
Contributor Author

noklam commented Oct 23, 2023

found a nicer solution to use _repr_pretty instead of __repr__. IMO overriding __repr__ isn't a big problem, but it's not consistent of what it should do.

I think the full feature need to be designed properly, but some of the non-breaking stuff like printing would be already useful and we can change it afterward. I would like to do this sooner than later.

@datajoely
Copy link
Contributor

You can also do a Rich specific repr https://rich.readthedocs.io/en/stable/pretty.html#typing

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Nov 4, 2023

Hello,
I've been playing around recently more "interactively" with Kedro, especially to introduce data analysts to kedro in notebooks and these features spontaneously arise from the conversation. As @noklam, I think this is already worth introducing a better __repr__ for Datasets and DataCatalog. Can I start a PR for this?

My two cents on the implementation details:

  • Autocomplete would be great, but the thread shows that this feature is much more complex to implement (particularly because of namespaces), so we should ship the pretty __repr__ first.
  • I am more in favor of implementing __repr__ rather than _repr_pretty or __rich_repr__ (at least at first) because I feel this can lead to inconsistent experience in a debugger / a script / a notebook, so I'd prefer adding complexity incrementally
  • I think it is fine to iterate frequently and change the repr "often" because it is not a breaking change and people do not rely on it for production code but rather for interactive use.
  • I think the _describe() is a good cancdidate to use in the __repr__ method for datasets which seems to be introduced specifically for this purpose. The drawback is that I think it is not always implemented (especially for user defined dataset) so we should have at least a fallback strategy. Maybe it is worth implementing this method as abstract in the AbstractDataset class in the future?
  • The DataCatalog.__repr__ may be simpler than just calling {dataset.__repr__ for datasets in catalog.datasets} because the print can be overcrowded, especially if there are a lot of datasets. On the other hand, just calling catalog.list() seems not informative enough. What do we want for the first dratf?
  • The "credentials" specificity is natively handled by pydantic, but if we don't want to introduce pydantic as dependency, we should handle it manually. It is possible to pass a callable with attrs so we might give it a shot. Notice that in the _describe() method most people don't show credentials, so it should be fine if we go with proxying this method to print the dataset.

@astrojuanlu
Copy link
Member

I agree with going ahead with __repr__ (especially in the face of #2928) and separating the autocomplete use case 👍🏽

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Nov 4, 2023

I've just started, and here is a fun fact : a__str__ method already exists in AbstractDataset, originally introduced in May 2019 🤯 , and it leverages the _describe method to achieve something very similar to xhat is described above. The current state of printing in a notebook is described hereafter :

from kedro.extras.datasets.pandas import CSVDataSet

ds = CSVDataSet(
    filepath=r"temp.csv",
    load_args={"sep": ";"},
)
ds
repr(ds)
print(ds)
str(ds)
Printing method Current result Result with a __repr__ implemented
ds <kedro.extras.datasets.pandas.csv_dataset.CSVDataSet at 0x1ab62e5cf70> result of __repr__
repr(ds) '<kedro.extras.datasets.pandas.csv_dataset.CSVDataSet object at 0x000001AB62E88C70>' result of __repr__
print(ds) CSVDataSet(filepath=temp.csv, protocol={'sep': ;}, save_args={'index': False}) same as current
str(ds) "CSVDataSet(filepath=temp.csv, protocol={'sep': ;}, save_args={'index': False})" same as current

This tutorial claims that:

The object representation is returned as a string by the Python method, repr(). This method is called when the object's repr() method is used. If feasible, the text returned should be a legitimate Python expression that may be used to recreate the object.

... which is what does our __str__ method. So it seems the current __str__ should be moved to __repr__ to make it more discoverable by users, and we can eventually discuss making a __str__ method later if necessary. WDYT?

@astrojuanlu
Copy link
Member

I confirm from the private repository (before open sourcing) that the __str__ method is there since forever basically.

Maybe we can just make __repr__ and __str__ do the same for now? I don't see the need to change __str__, as useless as it might be...

@astrojuanlu
Copy link
Member

Re-reading this:

# Current Way - Autocompletion not possible
catalog.load("example_iris_data").head()

# Slightly more verbose way - but much easier to type with Tab-completion
catalog.datasets.example_iris_data.load().head()

Would autocompletion of dataset names as strings be possible? See a similar thing for pandas DataFrame columns:

image

The problem with doing the dynamic properties is that some dataset names that are valid in YAML would become illegal in that way (same problem as with pandas columns) and also it would pollute the namespace of the DataCatalog (again, same problem)

@astrojuanlu
Copy link
Member

What is missing from this issue?

@merelcht
Copy link
Member

Closed because this was done in #3981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

7 participants