Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Richer search for experiment tracking #1039

Closed
2 tasks
antoinebon opened this issue Aug 29, 2022 · 5 comments
Closed
2 tasks

Richer search for experiment tracking #1039

antoinebon opened this issue Aug 29, 2022 · 5 comments

Comments

@antoinebon
Copy link

Description

In Kedro-Viz experiment tracking, I would like to be able to search in my experiments based on a value of a variable that I am tracking, but it seems that the search is currently only going thru the experiment metadata

Context

In order to track how a particular dataset is generated, I generate a run hash that is a hash of all the hyper-parameters that have an impact on the data generating process, which I then track via kedro viz experiment tracking. Since my datasets grow incrementally in time, I would like to be able to filter all the runs that have a particular hash, so that I can browse the history of the runs

Checklist

  • Extend experiment tracking search functionality to support searching in tracked variables (via the web app interface)
  • Support matching on various data types (string, numeric, list, dict)
@tynandebold tynandebold moved this to Inbox in Kedro-Viz Aug 30, 2022
@tynandebold tynandebold moved this from Inbox to Backlog in Kedro-Viz Sep 5, 2022
@tynandebold
Copy link
Member

Thanks a lot for this request, @antoinebon! This is something we've been discussing within the team and hope to implement in the near future. We'll keep you updated here with any developments as we build this out.

@antonymilne
Copy link
Contributor

Just to add to the above: this is absolutely a feature we should try to add, and was always on the roadmap, but I think it's not going to be easy. In fact it's impossible until we've worked out what should be the role of the session store. All previous discussions on this assumed that we would use the SQLite database to help query by metric, but I know that @idanov is not a fan of that idea. He believes that the session store should be purely for metadata (e.g. author, run command, etc.) and not storing anything on metrics. It would be very useful to understand how a search functionality would work in that scheme.

For reference so we have it in one place, this is where we have discussed this in the past: kedro-org/kedro#1070. The original issue https://github.com/quantumblacklabs/private-kedro/issues/1192 (not publicly accessible, sorry) has some additional comments (e.g. on SQLite) which are useful for context also. I definitely recommend skimming through the whole thing if you have access to it. Some relevant quotes below (all of which assume the SQLite model).

From @limdauto:

We only store the tracked dataset names in the session store. To get the tracked data itself, we can iterate through the list of tracked dataset names and load the data using catalog.load(tracked_dataset_name)

In the future, we will allow users to query by metrics. To that extend, we need a metrics-friendly search index. At the very least, we need to setup an index in sqlite to do it: https://www.tutorialspoint.com/sqlite/sqlite_indexes.htm -- but there are other solution, including an in-memory search index where we pay the cost up front when starting viz or we can even us full-blown disk-based search index too: https://whoosh.readthedocs.io/en/latest/index.html. There are pros & cons for each approach. I will write a separate design doc just for the metrics query. But it will be for later iteration.

From @AntonyMilneQB:

The biggest problem with PAI was always the performance, which came from the limitation of mlflow's storage system that you mentioned. Do not underestimate how many kedro runs people do! Let's say you have a pipeline with 10 metrics, you're on a team of 10 people, each of whom runs the pipeline 10 times a day and logs to the same place. These are not unrealistic numbers (on Tiberius we did way more than this). Over the course of a month you'll have 3000 session_ids saved in the database, each of which contains 10 metrics (which could be one dataset or split across several).

Now let's say you want to find all the runs that have accuracy > 0.8. How would this perform? Presumably you need to catalog.load(dataset_name) every single dataset_name in the tracked_data table, even those that don't even contain accuracy. You'll be loading datasets that aren't even metrics. How long would that take for many thousands of json files? (Genuine question... I don't know)

I'm wondering whether it would be wise to speed up querying by including some other information in the tracked_data table, like the dataset type (allows you to load up only tracking.MetricsDataSet datasets). Or we could just say that you can only ever query by numerical value and only need to include metrics datasets in the tracked_data table in the first place. Or should we just store the metric values directly in this table to avoid the catalog.load calls at all? In that case you're duplicating information with the dataset though which isn't great.

I really don't have an idea of how performant the proposed scheme is, so maybe this is going to be a complete non-issue. I would just caution that people are going to end up with a lot of metrics stored over the course of a project, and we should have something that scales well to that.

From @limdauto:

In terms of technical performance, I'm still considering the pros and cons of whether to perform the search client-side or server-side. But I know for a fact we can do text search client side up to thousands of rows easily. For millions of rows, you can employ an embedded in-memory search index like this one to help: https://github.com/techfort/LokiJS. I'm still debating though.

@NeroOkwa
Copy link
Contributor

Thanks a lot for this @antoinebon , we are currently doing user testing for a new Kedro Experiment Tracking feature. Would you be interested in participating? If yes, kindly provide an email address.

@NeroOkwa
Copy link
Contributor

NeroOkwa commented Nov 14, 2022

Context

This was a common feature request that came up in the experiment tracking user testing synthesis #1627

Users want to be able to filter values they are tracking for further investigation.

Supporting quotes

"But actually I haven't been able to find a way yet in Kedro-Viz to to filter on this value that I'm tracking”.

"Maybe some kind of way to filter runs by metric values, sort of like ones where the error rate was less than the 0.5 or something like that”.

@tynandebold
Copy link
Member

We'll reopen this ticket if we look at this again downstream.

@tynandebold tynandebold closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2023
@github-project-automation github-project-automation bot moved this from Backlog to Done in Kedro-Viz Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Status: Done
Development

No branches or pull requests

4 participants