-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix get_dataset_path() in fs_utils.py #663
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #663 +/- ##
==========================================
- Coverage 85.22% 85.19% -0.04%
==========================================
Files 85 85
Lines 4981 4984 +3
Branches 791 792 +1
==========================================
+ Hits 4245 4246 +1
- Misses 596 597 +1
- Partials 140 141 +1 ☔ View full report in Codecov by Sentry. |
Did you manage to track down what is the origin of this extra '/'? Which part of code added it? Asking this to make sure the same issue wouldn't pop up in some other scenarios with different code paths. Also, can you please add a unittest covering this case? |
The URL comes from user input. In get_filesystem_and_path_or_paths(), URLparse () is used to handle URLs. Under Windows, the file cannot be accessed via the file path "/D:/test/test.parquet ". Instead, it should be "D:/test/test.parquet". In my opinion, you can only test this code under Windows. |
@@ -33,6 +34,9 @@ def get_dataset_path(parsed_url): | |||
# s3/gs/gcs filesystem expects paths of the form `bucket/path` | |||
return parsed_url.netloc + parsed_url.path | |||
|
|||
if parsed_url.scheme.lower() in ['file'] and "Windows" in platform.system(): | |||
return parsed_url.path[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a chance that this fix would break for file urls on network location (as per this article). Consider the example of the following WIndows url: file://hostname/path/to/the%20file.txt
@@ -33,6 +34,9 @@ def get_dataset_path(parsed_url): | |||
# s3/gs/gcs filesystem expects paths of the form `bucket/path` | |||
return parsed_url.netloc + parsed_url.path | |||
|
|||
if parsed_url.scheme.lower() in ['file'] and "Windows" in platform.system(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could add a unit test verifying correctness of this function in petastorm/tests/test_fs_utils.py
. You would need to mock the value returned by platform.system()
and then I think the test logic should be straight forward.
If for some reason you can not do this, we'd be able to land this and I can help with the test later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking the PR with "Request Changes" due to incorrect behavior in the case of windows urls that contain network and not local paths.
version:
Windows 10
Python 3.7.0
petastorm 0.9.8
pyarrow 3.0.0
The following error occurred when I ran the above code under Windows.
The image above shows that there is an extra '/' at the beginning of the file path.
So I deleted the extra '/' .