MetricsDataSet doesn't work on S3 #1118

tbeijloos · 2022-10-04T15:03:56Z

Description

When using the MetricsDataSet with a filepath that refers to S3, kedro viz doesn't show any plots and cannot find the filepath. It seems like the MetricsDataSet doesn't support S3.

Context

When trying to use kedro viz for a MetricsDataset from S3, kedro viz didn't show any plots and neither did it show a filepath. When the dataset was saved to a local path, the plots show up.

Steps to Reproduce

Save a MetricsDataset to S3
Refer to this dataset in the Data Catalog as it being a MetricsDataSet with a filepath to S3
Run kedro viz on this MetricsDataSet.

Expected Result

In kedro viz there should be a filepath for the dataset as well as plots showing.

Actual Result

there is no filepath, neither any plots.

-- FileNotFoundError: [WinError 3] The system cannot find the path specified: "project_name\\data\\09_tracking\\test_all_best_model.json"
Exception in ASGI application

Your Environment

Include as many relevant details as possible about the environment you experienced the bug in:

Web browser system: Microsoft edge
kedro viz verson 5.1.1
kedro version 0.18.2
Python version used: 3.9

Checklist

Add a label to categorize this issue.

The text was updated successfully, but these errors were encountered:

noklam · 2022-10-04T15:13:09Z

Notes:

It only fails for tracking.MetricsDataSet
only MetricsDataSet is using loaded_versioned_tracking_data

This is likely the problem since it works on the local filesystem but fails on s3, viz has it only logic to load the data instead of using the dataset's own _load method.

kedro-viz/package/kedro_viz/models/flowchart.py

Lines 615 to 648 in d670180

    
               @staticmethod 
        
               def load_versioned_tracking_data( 
        
                   filepath: str = None, num_versions: int = 10 
        
               ) -> Optional[Dict[datetime, Any]]: 
        
                   """Load data for multiple versions of the metrics dataset 
        
                   Args: 
        
                       filepath: the path whether the dataset is located. 
        
                       num_versions: the maximum number of past versions we want to load. 
        
                   Returns: 
        
                       A dictionary containing the version and the json data inside each version 
        
                   """ 
        
                   if not filepath: 
        
                       return None  # pragma: no cover 
        
                   version_list = [ 
        
                       path 
        
                       for path in sorted(Path(filepath).iterdir(), reverse=True) 
        
                       if path.is_dir() 
        
                   ] 
        
                   versions = {} 
        
                   for version in version_list[:num_versions]: 
        
                       try: 
        
                           run_id = datetime.strptime(version.name, VERSION_FORMAT) 
        
                       except ValueError: 
        
                           logger.warning( 
        
                               """Expected run_id (timestamp) of format YYYY-MM-DDTHH:MM:SS.ffffff. 
        
                               Skip when loading tracking data.""" 
        
                           ) 
        
                           continue 
        
                       else: 
        
                           path = version / Path(filepath).name 
        
                           with open(path, encoding="utf8") as fs_file: 
        
                               versions[run_id] = json_stdlib.load(fs_file) 
        
                   return versions

antonymilne · 2022-10-09T05:55:27Z

Assuming this relates to the metadata side panel (and not the experiment tracking screen itself), this is a known issue: #1000. As explained there, I'm not sure how much time we should invest fixing it since it might well be removed pretty soon.

If we do want to fix it then that issue lists some other things that could be fixed at the same time also. For context, here's how I fixed the experiment tracking screen to work with s3:
#984 (comment)
38d7fb0

noklam · 2022-10-20T10:48:10Z

Coming back to this as I am working on something related to the version. It would be much simpler to use higher-level functions like dataset._get_load_path. It's a bit strange to me the 2 functions used two different logic to traverse the directory. It's unclear to me why this need to be re-implemented in kedro-viz, maybe I missed out some context?

kedro-viz/package/kedro_viz/models/graph.py

Lines 639 to 691 in b2011d5

    
               @staticmethod 
        
               def load_latest_tracking_data( 
        
                   dataset: Union["JSONDataSet", "MetricsDataSet"], 
        
               ) -> Optional[Dict[str, float]]: 
        
                   """Load data for latest versions of the json dataset. 
        
                   Below operation is also on kedro.io.core -> fetched_latest_load_version() 
        
                   However it is a cached function and hence cannot be relied upon 
        
                   Args: 
        
                       dataset: the latest version of the metrics or json dataset 
        
                   Returns: 
        
                       A dictionary containing json data for the latest version 
        
                   """ 
        
                   pattern = str(dataset._get_versioned_path("*")) 
        
                   version_paths = sorted(dataset._glob_function(pattern), reverse=True) 
        
                   most_recent = next( 
        
                       (path for path in version_paths if dataset._exists_function(path)), None 
        
                   ) 
        
                   if not most_recent: 
        
                       return None 
        
                   with dataset._fs.open(most_recent, **dataset._fs_open_args_load) as fs_file: 
        
                       return json.load(fs_file) 
        
               @staticmethod 
        
               def load_versioned_tracking_data( 
        
                   filepath: str, num_versions: int = 10 
        
               ) -> Optional[Dict[datetime, Any]]: 
        
                   """Load data for multiple versions of the metrics dataset 
        
                   Args: 
        
                       filepath: the path whether the dataset is located. 
        
                       num_versions: the maximum number of past versions we want to load. 
        
                   Returns: 
        
                       A dictionary containing the version and the json data inside each version 
        
                   """ 
        
                   version_list = [ 
        
                       path 
        
                       for path in sorted(Path(filepath).iterdir(), reverse=True) 
        
                       if path.is_dir() 
        
                   ] 
        
                   versions = {} 
        
                   for version in version_list[:num_versions]: 
        
                       try: 
        
                           run_id = datetime.strptime(version.name, VERSION_FORMAT) 
        
                       except ValueError: 
        
                           logger.warning( 
        
                               """Expected run_id (timestamp) of format YYYY-MM-DDTHH:MM:SS.ffffff. 
        
                               Skip when loading tracking data.""" 
        
                           ) 
        
                           continue 
        
                       else: 
        
                           path = version / Path(filepath).name 
        
                           with open(path) as fs_file: 
        
                               versions[run_id] = json.load(fs_file) 
        
                   return versions

tynandebold · 2022-10-20T15:10:26Z

I really can't say either way. Would it be hard for you to investigate why it's done this way in Viz and if it's easy to fix the issue there then do so?

rashidakanchwala · 2023-03-20T14:10:52Z

We can close this issue. It's no longer valid as we have removed load_latest_tracking_dataset and load_versioned_tracking_dataset

@tynandebold

tynandebold · 2023-03-20T14:15:13Z

Thank you!

tbeijloos added the Issue: Bug Report label Oct 4, 2022

tynandebold added this to Kedro-Viz Oct 4, 2022

tynandebold moved this to Inbox in Kedro-Viz Oct 4, 2022

tynandebold changed the title ~~<MetricsDataSet doesn't work on S3>~~ MetricsDataSet doesn't work on S3 Oct 4, 2022

tynandebold added this to Kedro Framework Oct 4, 2022

tynandebold added the Technical Design label Oct 4, 2022

tynandebold moved this from Inbox to Backlog in Kedro-Viz Oct 4, 2022

tynandebold removed this from Kedro Framework Oct 4, 2022

tynandebold removed the Technical Design label Oct 4, 2022

rashidakanchwala closed this as completed Mar 20, 2023

github-project-automation bot moved this from Backlog to Done in Kedro-Viz Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MetricsDataSet doesn't work on S3 #1118

MetricsDataSet doesn't work on S3 #1118

tbeijloos commented Oct 4, 2022 •

edited by tynandebold

Loading

noklam commented Oct 4, 2022

antonymilne commented Oct 9, 2022

noklam commented Oct 20, 2022

tynandebold commented Oct 20, 2022

rashidakanchwala commented Mar 20, 2023

tynandebold commented Mar 20, 2023

MetricsDataSet doesn't work on S3 #1118

MetricsDataSet doesn't work on S3 #1118

Comments

tbeijloos commented Oct 4, 2022 • edited by tynandebold Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

Checklist

noklam commented Oct 4, 2022

antonymilne commented Oct 9, 2022

noklam commented Oct 20, 2022

tynandebold commented Oct 20, 2022

rashidakanchwala commented Mar 20, 2023

tynandebold commented Mar 20, 2023

tbeijloos commented Oct 4, 2022 •

edited by tynandebold

Loading