Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added smb protocol with test #782

Closed
wants to merge 10 commits into from
Closed

Added smb protocol with test #782

wants to merge 10 commits into from

Conversation

lucasjamar
Copy link
Contributor

@lucasjamar lucasjamar commented Jun 7, 2021

This solves #765

Description

Added else for filesystems that contain "netloc" but are not cloud protocols.

Development notes

Added a simple test to ensure the username, password, host and port are returned in the path of smb files

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@lucasjamar lucasjamar changed the title Added host, port, user, password to path for smb protocol with test Added smb protocol with test Jun 7, 2021
@lucasjamar lucasjamar marked this pull request as ready for review June 7, 2021 08:40
@lucasjamar lucasjamar requested a review from idanov as a code owner June 7, 2021 08:40
@lucasjamar
Copy link
Contributor Author

lucasjamar commented Jun 7, 2021

It appears the unit tests are failing because they are timing out and the docs_linkcheck is failing because of some url on documentation about AWS that no longer exists.
Should I modify the url related to AWS?

Please let me know if you have any questions. thanks!

kedro/io/core.py Outdated Show resolved Hide resolved
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thank you for your contribution @lucasjamar! I'm not very familiar with these protocols so will try and get some people who are more familiar to take a look.

Please could you explain to a dummy why the whole username:password@myhost:8080 get passed as a path here when it doesn't for s3 etc.?

@antonymilne
Copy link
Contributor

antonymilne commented Jun 7, 2021

It appears the unit tests are failing because they are timing out and the docs_linkcheck is failing because of some url on documentation about AWS that no longer exists.
Should I modify the url related to AWS?

We're fixing the broken link in a different PR already so don't worry about it - when it's merged to master you can update this PR and it should fix it. And the unit tests timing out is probably just a random error, so re-rerunning the job (which will happen if you do another commit) will probably fix it. So probably nothing to worry about here 🙂

lucasjamar and others added 3 commits June 7, 2021 13:39
Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
@lucasjamar lucasjamar requested a review from yetudada as a code owner June 9, 2021 14:41
@lucasjamar
Copy link
Contributor Author

This looks great to me, thank you for your contribution @lucasjamar! I'm not very familiar with these protocols so will try and get some people who are more familiar to take a look.

Please could you explain to a dummy why the whole username:password@myhost:8080 get passed as a path here when it doesn't for s3 etc.?

To follow-up on your question @AntonyMilneQB , id be interested to know why the host and port are ignored for hdfs if someone knows the answer. I would have thought these are important pieces of information.

image

@limdauto
Copy link
Contributor

@lucasjamar @AntonyMilneQB @lorenabalan I think generally our strategy when parsing filepath and instantiate the underlying fsspec.filesystem is:

  • Parse path and protocol from filepath attribute of the dataset. The reason we have our own _parse_path function and not use the underlying fsspec.utils.infer_storage_options was due to how infer_storage_options was treating abl/adfs differently from s3/gcs: Potential bug in fsspec.utils.infer_storage_options fsspec/adlfs#45. So to ensure parsing consistency, we roll our own _parse_path. Since this deals with path and protocol only, everything else is discarded.
  • host & port are added to fs_args as @lucasjamar has noticed.
  • username & password should be added to credentials.

When we instantiate fsspec.filesystem, e.g. in a dataset, we will pass everything as kwargs to fsspec, so the net effect would be similar to passing a string containing everything to fsspec and allow fsspec to parse it with infer_storage_options.

What I'd suggest is maybe looking into why you need to specify a dummy url and fix that instead so you can write:

filepath: smb://my/file/path/
fs_args:
    host:
    port:
credentials: my_smb_dataset_credentials

@antonymilne
Copy link
Contributor

Thank you very much for this @limdauto! Super helpful 👍 Just for completeness, when I was looking to this earlier I also found the original commit where we moved from fsspec's infer_storage_options to our own implementation. Looking through infer_storage_options, it is indeed very similar to our _parse_path.

@lucasjamar
Copy link
Contributor Author

@lucasjamar @AntonyMilneQB @lorenabalan I think generally our strategy when parsing filepath and instantiate the underlying fsspec.filesystem is:

* Parse `path` and `protocol` from `filepath` attribute of the dataset. The reason we have our own `_parse_path` function and not use the underlying `fsspec.utils.infer_storage_options` was due to how `infer_storage_options` was treating `abl/adfs` differently from `s3/gcs`: [dask/adlfs#45](https://github.com/dask/adlfs/issues/45). So to ensure parsing consistency, we roll our own `_parse_path`. Since this deals with `path` and `protocol` only, everything else is discarded.

* `host` & `port` are added to `fs_args` as @lucasjamar has noticed.

* `username` & `password` should be added to `credentials`.

When we instantiate fsspec.filesystem, e.g. in a dataset, we will pass everything as kwargs to fsspec, so the net effect would be similar to passing a string containing everything to fsspec and allow fsspec to parse it with infer_storage_options.

What I'd suggest is maybe looking into why you need to specify a dummy url and fix that instead so you can write:

filepath: smb://my/file/path/
fs_args:
    host:
    port:
credentials: my_smb_dataset_credentials

@limdauto Thanks a lot for the helpful answer! As mentioned in an earlier comment, urlsplit interprets the first part of the path after :// as the network location even though it belongs to the path. For this reason, this information gets lost. Unfortunately, I dont see any alternative than the dummy url I mentioned or the implementation proposed. Or am I getting this wrong and there is a better solution? Thanks :)

@limdauto
Copy link
Contributor

limdauto commented Jun 10, 2021

@lucasjamar so I think in this case smb is similar to the the file protocol, in which case instead of the dummy URL, you can use triple forward slashes:

In [2]: _parse_filepath("smb:///test-path")
Out[2]: {'protocol': 'smb', 'path': '/test-path'}

This is consistent with how urllib treats file: protocol, i.e.

In [5]: urlsplit("file:///foo")
Out[5]: SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')

In [6]: urlsplit("file://foo")
Out[6]: SplitResult(scheme='file', netloc='foo', path='', query='', fragment='')

@antonymilne
Copy link
Contributor

Amazing, thanks very much @limdauto.

@lucasjamar I think this means we don't need this PR any more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants