Improving the I/O transparency with `kedro run` #1691

noklam · 2022-07-05T12:44:54Z

Introduction

With a highly parameterized configuration (Jinja, hydra or OmegaConf), it is not easy to troubleshoot data easily. Often, it is useful to get the full path so users can inspect the data manually. Currently, users need to hack into context and do yaml.dump to get this information.

i.e. s3://{base_path}//{special_parameter} -> Should be compile to s3://prefix/filename

Ultimately, the goal is to provide full transparency about the I/O within a kedro run, user should be able to get this information for logging or reproducing a particular experiment.

Background

Related Issues:

kedro compile - basically a feature to generate the compiled version of configuration at run-time, potentially also with catalog.dumps which is more suitable for Jupyter workflow. (It should log the full path in case relatively path is used)
Exposing load/save version in hooks or included these information in logs? #1580 - the run-time load_version isn't available to users with VersionedDataSet and it's something that we need to fix.
Enriching the logging message within kedro run - potentially with some DEBUG level message

Rollout strategy

There should be no breaking changes, 1 & 2 can be done in parallel. For 3 we can default with no changes and optionally expose more verbose logging.

The text was updated successfully, but these errors were encountered:

datajoely · 2022-07-05T13:11:42Z

I would like to say that dbt compile is a nice workflow that we could take inspiration from. They have two folders models and target this is essentially like src and dist for us in some ways. The target part is a neat way of showing what SQL will actually get passed to the database.

https://docs.getdbt.com/reference/commands/compile

antonymilne · 2022-07-06T20:22:20Z

The compiled filepath is actually already available in the data catalog, just it's quite hidden away: catalog._get_dataset("dataset_name")._filepath or catalog.datasets.dataset_name._filepath.

kedro compile: I understand the need for this and I think there's room for something here, but not sure that a new kedro compile command is the right way to handle it. While it's not breaking, adding a new CLI command is quite a big thing that's not so easily reversed. The correct solution here should become clearer as we solve the configuration question. e.g. OmegaConf.resolve does pretty much what we want here already, so if we use OmegaConf then the question would be how we want to expose that resolved config - maybe it would be more naturally done through a hook writing to a file rather than a separate command.
Agreed, but I think this and the heart of the issue here are just symptoms of a more general issue - see below.
This is actually already done with a DEBUG, just no one knows about it... It includes the filepath and the version information. We should maybe consider the idea I mentioned in Improve logging setup in Kedro #1461 about having a verbose flag that adapts logging level automatically and would make it more obvious that you can change verbosity of logs.

Fundamental issue

I think this and #1580 are actually just symptoms of a more fundamental underlying issue: the API and underlying workings of io.core and io.data_catalog are very confusing and should be rethought in general. These are very old components in kedro and maybe some of the decisions that were originally made about their design should be revised. I think there's also very likely to be old bits of code there that could now be removed or renamed (e.g. who would guess that something named add_feed_dict is used to add parameters to the catalog?). It feels like tech debt rather than intentional design currently.

I don't think they're massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:

catalog.datasets is presumably the "official" route to get a dataset rather than _get_dataset, but catalog.datasets wouldn't allow a namespaced datasets to be accessed without doing getattr. There's some very subtle and non-obvious differences between datasets and _get_dataset, and then there's also catalog._data_sets (which I think might just be a historical leftover... but not sure).
it also seems at a glance that _filepath is only defined for versioned datasets (? seems weird)
to actually get the correct versioned filepath it's even harder - in our datasets we do get_filepath_str(self._get_load_path(), self._protocol) which is pretty obscure. Similar to Refactor load version logic #1654
Consider refactoring datasets, whether _exists, _release,_invalidate_cache belongs to the abstractclass or the implementation(Added by Nok)

So I think we should look holistically at the structures involved here and work out what the API should look like so there's one, clear way to access the things that people need to access. I actually don't think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a catalog.dumps) or whether it's just a case of making what we already have better organised, documented and clearer.

antonymilne · 2022-07-06T20:36:24Z

Curious what @deepyaman thinks of this. He may well have the honour of being the most familiar person playing around with io! So hopefully he can give a more informed evaluation of its current state, points of confusion and what could be improved. The above are just things I've observed, usually from trying to answer people questions about, e.g. "how do I find the filepath from the catalog", and then getting confused myself going through the code to try and work it out.

noklam · 2022-07-06T20:50:46Z

Generally agree, I don't think the implementation will be complicated, we just need to decide how we expose this feature to our user. The downside of using hook is it's generally hard to turn it off, you need to go into setting.py which is not a common user-facing file. You probably only need it once in a while for troubleshooting.
Agree
🤯 I have no idea about this

noklam added the Issue: Feature Request New feature or improvement to existing feature label Jul 5, 2022

noklam added this to Kedro Framework Jul 5, 2022

antonymilne mentioned this issue Jul 7, 2022

Xarray NetCDFDataSet kedro-org/kedro-plugins#165

Closed

antonymilne added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 13, 2022

merelcht added this to the Configuration overhaul milestone Jul 25, 2022

merelcht removed the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 27, 2022

antonymilne mentioned this issue Aug 2, 2022

Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721

Closed

antonymilne mentioned this issue Aug 10, 2022

Re-design io.core and io.data_catalog #1778

Open

antonymilne mentioned this issue Sep 5, 2022

Refactor load version logic #1654

Open

noklam mentioned this issue Oct 6, 2022

Remove redundant resolve_load_version call #1911

Merged

5 tasks

merelcht added the Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets label Oct 26, 2022

merelcht removed this from the Adding `OmegaConf` to configuration handling milestone Jan 20, 2023

merelcht added the Component: Configuration label Jan 20, 2023

kedro-org locked and limited conversation to collaborators Mar 27, 2024

merelcht converted this issue into discussion #3747 Mar 27, 2024

github-project-automation bot moved this to Done in Kedro Framework Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Improving the I/O transparency with `kedro run` #1691

Improving the I/O transparency with `kedro run` #1691

noklam commented Jul 5, 2022 •

edited

Loading

datajoely commented Jul 5, 2022 •

edited

Loading

antonymilne commented Jul 6, 2022 •

edited by noklam

Loading

antonymilne commented Jul 6, 2022

noklam commented Jul 6, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Improving the I/O transparency with kedro run #1691

Improving the I/O transparency with kedro run #1691

Comments

noklam commented Jul 5, 2022 • edited Loading

Introduction

Background

Rollout strategy

datajoely commented Jul 5, 2022 • edited Loading

antonymilne commented Jul 6, 2022 • edited by noklam Loading

Fundamental issue

antonymilne commented Jul 6, 2022

noklam commented Jul 6, 2022

This issue was moved to a discussion.

Improving the I/O transparency with `kedro run` #1691

Improving the I/O transparency with `kedro run` #1691

noklam commented Jul 5, 2022 •

edited

Loading

datajoely commented Jul 5, 2022 •

edited

Loading

antonymilne commented Jul 6, 2022 •

edited by noklam

Loading