Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing load/save version in hooks or included these information in logs? #1580

Open
noklam opened this issue May 30, 2022 · 7 comments · Fixed by #1911
Open

Exposing load/save version in hooks or included these information in logs? #1580

noklam opened this issue May 30, 2022 · 7 comments · Fixed by #1911
Labels
Component: Framework Issue/PR that addresses core framework functionality Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@noklam
Copy link
Contributor

noklam commented May 30, 2022

(Created by Nok, converted from Discord Discussion)

Desciption

A user want to have dataset load/save version logged, potentially like this

2022-05-30 10:01:09,039 - kedro.io.data_catalog - INFO - Loading data from `batch` (PartitionedDataSet) (version: 2022-05-15T05.24.31.017Z)
2022-05-30 10:01:18,467 - kedro.io.data_catalog - INFO - Saving data to `processed_batch` (ParquetDataSet)... (version: 2022-05-30T08.23.54.197Z)

This is not possible currently as kedro does not track this information, it should belongs to either

  1. Extra hook spec (after_dataset_loaded hook)
  2. Session store (experiment tracking)

How current load version is determined when version=None?

Currently, this load_version information is buried deep down in the framework, and it is determined only when a dataset is loaded at runtime.
The details of how "latest" version is in a method called resolve_load_version, which further calls _fetch_latest_load_version

Further Studies

For some reason, resolve_load_version is being called twice, L594 seems to be a leftover from historical refactoring, need further confirmation.

kedro/kedro/io/core.py

Lines 558 to 564 in b2e59fa

def _get_load_path(self) -> PurePosixPath:
if not self._version:
# When versioning is disabled, load from original filepath
return self._filepath
load_version = self.resolve_load_version()
return self._get_versioned_path(load_version) # type: ignore

self.resolve_load_version() # Make sure last load version is set

The refactor PR is here:
f03226e

@noklam noklam added Issue: Feature Request New feature or improvement to existing feature Component: Framework Issue/PR that addresses core framework functionality labels May 30, 2022
@antonymilne
Copy link
Contributor

antonymilne commented Jun 6, 2022

FYI I think it's already possible to achieve this since save_version and load_versions are available in after_catalog_created. So you can shift that information into the right after_dataset_loaded etc. hooks and recreate the load_version logic by doing something like load_version = load_versions.get("dataset_name", save_version).

That's obviously not a good solution though, and it would be nice to have a better way of doing it. Currently we expose dataset_name and data in the dataset hooks. If we want to add load/save_version then I wonder whether we should just expose the whole catalog.datasets[dataset_name]. I've seen people asking for filepath in these hooks before, so it might be best just to make the whole thing available rather than adding more and more arguments piecemeal. (Like the discussion about whether we should expose context or config_loader in a hook)

@noklam
Copy link
Contributor Author

noklam commented Jun 6, 2022

Is the load version information when load_version = None where it would just look up the latest version?

@antonymilne
Copy link
Contributor

antonymilne commented Jun 6, 2022

Yeah, exactly. From memory that's what load_version does? I could be wrong though...

Edit: I'm wrong. Looks like load_version actually does a glob to look through files and pick out the latest version. That makes more sense than what I said before. Forget what I said 😀

@noklam
Copy link
Contributor Author

noklam commented Jun 6, 2022

That's what I found out, that information wasn't expose to the hook at all, what we got is None only. So I think it is almost impossible to implement a solution currently.

@noklam
Copy link
Contributor Author

noklam commented Jun 6, 2022

Sorry I have written a more detailed issue originally but Github project actually convert my issue to a blank page so I missed this when I recreated the issue😅

I think this information would be quite valuable to experiment tracking, data versioning is one of the key for reproducible experiment. If this info is stored in session store we can eventually reproduce an experiment with the session_id and extract all the dataset that it used exactly.

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jun 13, 2022
@merelcht
Copy link
Member

merelcht commented Jun 29, 2022

Notes from Technical Design session:

It was agreed that the code for loading the latest data/fetching the load version needs further refactoring. Inside kedro-viz we have replicated some of the logic to load the latest data for experiment tracking, which should be improved as well. To be done in:

After the refactoring is done, we should looking into whether we should expose the load/save version information and how to best do that.

@noklam
Copy link
Contributor Author

noklam commented Apr 3, 2023

This wasn't completed in #1911, closed by mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Framework Issue/PR that addresses core framework functionality Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants