Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetricsDataSet doesn't work on S3 #1118

Closed
1 task done
tbeijloos opened this issue Oct 4, 2022 · 6 comments
Closed
1 task done

MetricsDataSet doesn't work on S3 #1118

tbeijloos opened this issue Oct 4, 2022 · 6 comments

Comments

@tbeijloos
Copy link

tbeijloos commented Oct 4, 2022

@noklam

Description

When using the MetricsDataSet with a filepath that refers to S3, kedro viz doesn't show any plots and cannot find the filepath. It seems like the MetricsDataSet doesn't support S3.

Context

When trying to use kedro viz for a MetricsDataset from S3, kedro viz didn't show any plots and neither did it show a filepath. When the dataset was saved to a local path, the plots show up.

Steps to Reproduce

  1. Save a MetricsDataset to S3
  2. Refer to this dataset in the Data Catalog as it being a MetricsDataSet with a filepath to S3
  3. Run kedro viz on this MetricsDataSet.

Expected Result

In kedro viz there should be a filepath for the dataset as well as plots showing.

Actual Result

there is no filepath, neither any plots.

-- FileNotFoundError: [WinError 3] The system cannot find the path specified: "project_name\\data\\09_tracking\\test_all_best_model.json"
Exception in ASGI application 

Your Environment

Include as many relevant details as possible about the environment you experienced the bug in:

  • Web browser system: Microsoft edge
  • kedro viz verson 5.1.1
  • kedro version 0.18.2
  • Python version used: 3.9

Checklist

  • Add a label to categorize this issue.
@noklam
Copy link
Contributor

noklam commented Oct 4, 2022

Notes:

  1. It only fails for tracking.MetricsDataSet
  2. only MetricsDataSet is using loaded_versioned_tracking_data

This is likely the problem since it works on the local filesystem but fails on s3, viz has it only logic to load the data instead of using the dataset's own _load method.

@staticmethod
def load_versioned_tracking_data(
filepath: str = None, num_versions: int = 10
) -> Optional[Dict[datetime, Any]]:
"""Load data for multiple versions of the metrics dataset
Args:
filepath: the path whether the dataset is located.
num_versions: the maximum number of past versions we want to load.
Returns:
A dictionary containing the version and the json data inside each version
"""
if not filepath:
return None # pragma: no cover
version_list = [
path
for path in sorted(Path(filepath).iterdir(), reverse=True)
if path.is_dir()
]
versions = {}
for version in version_list[:num_versions]:
try:
run_id = datetime.strptime(version.name, VERSION_FORMAT)
except ValueError:
logger.warning(
"""Expected run_id (timestamp) of format YYYY-MM-DDTHH:MM:SS.ffffff.
Skip when loading tracking data."""
)
continue
else:
path = version / Path(filepath).name
with open(path, encoding="utf8") as fs_file:
versions[run_id] = json_stdlib.load(fs_file)
return versions

@tynandebold tynandebold moved this to Inbox in Kedro-Viz Oct 4, 2022
@tynandebold tynandebold changed the title <MetricsDataSet doesn't work on S3> MetricsDataSet doesn't work on S3 Oct 4, 2022
@tynandebold tynandebold moved this from Inbox to Backlog in Kedro-Viz Oct 4, 2022
@antonymilne
Copy link
Contributor

Assuming this relates to the metadata side panel (and not the experiment tracking screen itself), this is a known issue: #1000. As explained there, I'm not sure how much time we should invest fixing it since it might well be removed pretty soon.

If we do want to fix it then that issue lists some other things that could be fixed at the same time also. For context, here's how I fixed the experiment tracking screen to work with s3:
#984 (comment)
38d7fb0

@noklam
Copy link
Contributor

noklam commented Oct 20, 2022

Coming back to this as I am working on something related to the version. It would be much simpler to use higher-level functions like dataset._get_load_path. It's a bit strange to me the 2 functions used two different logic to traverse the directory. It's unclear to me why this need to be re-implemented in kedro-viz, maybe I missed out some context?

@staticmethod
def load_latest_tracking_data(
dataset: Union["JSONDataSet", "MetricsDataSet"],
) -> Optional[Dict[str, float]]:
"""Load data for latest versions of the json dataset.
Below operation is also on kedro.io.core -> fetched_latest_load_version()
However it is a cached function and hence cannot be relied upon
Args:
dataset: the latest version of the metrics or json dataset
Returns:
A dictionary containing json data for the latest version
"""
pattern = str(dataset._get_versioned_path("*"))
version_paths = sorted(dataset._glob_function(pattern), reverse=True)
most_recent = next(
(path for path in version_paths if dataset._exists_function(path)), None
)
if not most_recent:
return None
with dataset._fs.open(most_recent, **dataset._fs_open_args_load) as fs_file:
return json.load(fs_file)
@staticmethod
def load_versioned_tracking_data(
filepath: str, num_versions: int = 10
) -> Optional[Dict[datetime, Any]]:
"""Load data for multiple versions of the metrics dataset
Args:
filepath: the path whether the dataset is located.
num_versions: the maximum number of past versions we want to load.
Returns:
A dictionary containing the version and the json data inside each version
"""
version_list = [
path
for path in sorted(Path(filepath).iterdir(), reverse=True)
if path.is_dir()
]
versions = {}
for version in version_list[:num_versions]:
try:
run_id = datetime.strptime(version.name, VERSION_FORMAT)
except ValueError:
logger.warning(
"""Expected run_id (timestamp) of format YYYY-MM-DDTHH:MM:SS.ffffff.
Skip when loading tracking data."""
)
continue
else:
path = version / Path(filepath).name
with open(path) as fs_file:
versions[run_id] = json.load(fs_file)
return versions

@tynandebold
Copy link
Member

I really can't say either way. Would it be hard for you to investigate why it's done this way in Viz and if it's easy to fix the issue there then do so?

@rashidakanchwala
Copy link
Contributor

We can close this issue. It's no longer valid as we have removed load_latest_tracking_dataset and load_versioned_tracking_dataset

@tynandebold

@github-project-automation github-project-automation bot moved this from Backlog to Done in Kedro-Viz Mar 20, 2023
@tynandebold
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants