Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing Path Fix #1853

Merged
merged 13 commits into from
Apr 15, 2024
Merged

Testing Path Fix #1853

merged 13 commits into from
Apr 15, 2024

Conversation

ravi-kumar-pilla
Copy link
Contributor

@ravi-kumar-pilla ravi-kumar-pilla commented Apr 8, 2024

Description

Testing windows build failure due to StatsHook on main branch.

   WARNING  Unable to get file size for the       hooks.py:142
                             dataset                                           
                             CSVDataset(filepath=C:/Users/circleci             
                             /AppData/Local/Temp/pytest-of-circlec             
                             i/pytest-0/test_get_file_size_dataset             
                             0_0/model_inputs.csv, load_args={},               
                             protocol=file, save_args={'index':                
                             False}): [WinError 3] The system                  
                             cannot find the path specified:                   
                             'C:Users/circleci/AppData/Local/Temp/             
                             pytest-of-circleci/pytest-0/test_get_             
                             file_size_dataset0_0/model_inputs.csv             
                             '

NOTE: While I tested on windows machine locally, it was working fine. However on CircleCI, filepath has C:/ , but the system is trying to find C: (missing the forward slash). After looking into plugins kedro-datasets further, some dataset._filepath returns PurePosixPath while some return string. The fix for #1797 should be made on the datasets side to be consistent.

Development notes

The pandas CSV, Excel, Feather, Generic, HDF, json, Parquet, xml has a code block -

 super().__init__(
            filepath=PurePosixPath(path),
            version=version,
            exists_function=self._fs.exists,
            glob_function=self._fs.glob,
        )

while others like deltatable, gbq, sql keep filepath as str. I am not completely aware of the reason for this difference. @SajidAlamQB please add if you have any information on this.

If this is not intentional, we should be fixing the issue on kedro-datasets.

  • As a platform-independent fix we can do as below in DataStatsHook if not on kedro-datasets
file_path = get_filepath_str(PurePosixPath(dataset._filepath), dataset._protocol)

NOTE: continue_config.yml changes are for testing windows build and they will be discarded before merging to main

QA notes

  • All tests should pass

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@ravi-kumar-pilla ravi-kumar-pilla marked this pull request as ready for review April 9, 2024 02:33
@rashidakanchwala
Copy link
Contributor

also looping in @merelcht , does it make sense to fix this issue on kedro-datasets side? are we strict on return type validation for file_path? pls let us know.

@rashidakanchwala rashidakanchwala requested review from merelcht and removed request for merelcht April 9, 2024 13:17
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@ravi-kumar-pilla
Copy link
Contributor Author

also looping in @merelcht , does it make sense to fix this issue on kedro-datasets side? are we strict on return type validation for file_path? pls let us know.

Had a discussion on slack and Nok mentioned that filepath is not a well defined API, means there may be entries of datasets which do not have filepath. Though I still think if there is filepath, the type should be consistent. From Viz perspective, we can have the conversion to be foolproof.

Thank you

@@ -135,7 +135,9 @@ def get_file_size(self, dataset: Any) -> Union[int, None]:
return None

try:
file_path = get_filepath_str(Path(dataset._filepath), dataset._protocol)
file_path = get_filepath_str(
PurePosixPath(dataset._filepath), dataset._protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, Path usually is the platform-agnostic way and works on both Windows and Linux. But PurePosixPath wouldn't work on Windows, so why are we changing this. Happy to get on a call to understand better.

…x/win_build

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) thanks Ravi -- great debugging

@ravi-kumar-pilla ravi-kumar-pilla merged commit ecf143c into main Apr 15, 2024
13 checks passed
@ravi-kumar-pilla ravi-kumar-pilla deleted the fix/win_build branch April 15, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants