Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of URL quoting (for file:// URLs) #1168

Open
mih opened this issue Jan 27, 2023 · 7 comments
Open

Handling of URL quoting (for file:// URLs) #1168

mih opened this issue Jan 27, 2023 · 7 comments

Comments

@mih
Copy link

mih commented Jan 27, 2023

I have a local file at /tmp/URL--https&c%%zenodo.org%record%68331 that I am trying to access via FSSPEC. This fails due to a missing unquoting step, as far as I can tell.

Here is a small demo to show the essence of the issue:

>>> from pathlib import Path
>>> from fsspec.core import url_to_fs

# local file exists
>>> p = Path('/tmp/URL--https&c%%zenodo.org%record%68331')
>>> p.exists()
True

# file:// URL generation via pathlib
>>> p_url = p.as_uri()
# this quotes the URL, which is to be expected
>>> p_url
'file:///tmp/URL--https%26c%25%25zenodo.org%25record%2568331'

# fsspec does not unquote
>>> fs, urlpath = url_to_fs(p_url)
>>> urlpath
'/tmp/URL--https%26c%25%25zenodo.org%25record%2568331'
# the returned urlpath is not the original path
# (this may be OK still)
>>> Path(urlpath).exists()
False

# but fsspec also fails to access the target file
>>> fs.stat(urlpath)
-> FileNotFoundError: [Errno 2] No such file or directory: '/tmp/URL--https%26c%25%25zenodo.org%25record%2568331'

# if quoting is undone, either for the URL, or for the path, things start working
>>> from urllib.parse import unquote
>>> fs.stat(unquote(urlpath))
{'name': '/tmp/URL--https&c%%zenodo.org%record%68331',...}
# or the URL
>>> fs, urlpath = url_to_fs(unquote(p_url))
>>> fs.stat(urlpath)
{'name': '/tmp/URL--https&c%%zenodo.org%record%68331',...}

I was confused by the above behavior, as I expect the urlpath returned by url_to_fs() to match the needs of the simultaneously returned fs object. My expectation was that fsspec would assume all input URL to be adequately quoted, and do any necessary unquoting to match the nature of urlpath to the needs of fs internally.

Do you consider this behavior a bug?

@martindurant
Copy link
Member

What happens if you do url_to_fs(p) or url_to_fs(str(p))?

@mih
Copy link
Author

mih commented Jan 27, 2023

url_to_fs(p) -> TypeError

url_to_fs(str(p) yields the correct behavior, beside the input being no file:// URL anymore.

>>> fs, urlpath = url_to_fs(str(p))
>>> urlpath
'/tmp/URL--https&c%%zenodo.org%record%68331'
>>> fs.stat(urlpath)
{'name': '/tmp/URL--https&c%%zenodo.org%record%68331',...}

@martindurant
Copy link
Member

OK, so url_to_fs(str(p) is correct: it returns the path "in the selected filesystem", so no protocol needed. Pathlib escaping the path for the local filesystem is simply bizarre to me, this is not HTTP!
There is also a method fs.unstrip_protocol to add the protocol back in, if you desire; it does the opposite of fs._strip_protocol, which does handle Path objects (by calling str() on them!).

@mih
Copy link
Author

mih commented Jan 27, 2023

Can you please clarify what the generally recommended behavior would be? I have posted the simplest case of what I believed to be the key issue. However, I am having this issue in a use case where I am accessing specific members of archive files (that are either remote or local) via a chained "URL" that I pass to url_to_fs(). An example would be:

(tar|zip)://myfile.txt(|::blockcache|::filecache)::(s3|file|https)://....

This is all working great, expect for this special case of file:// URLs. It seems to me that using fs.unstrip_protocol would not be a simple fix, but I would actually have to inspect and mangle the URL chain to achieve a homogenous handling of all possible URLs of this pattern.

mih added a commit to mih/datalad-next that referenced this issue Jan 27, 2023
See fsspec/filesystem_spec#1168

Handle it by not generating `file://` URLs ourselves, but by passing
naked platform paths in. This is subject to a trial on windows still.
@martindurant
Copy link
Member

It seems to me that the issue is with Path.to_uri - don't use that, use str() instead. Then you don't get any quoting which you need to undo. Path is specific to local files anyway, so not much use in the context of fsspec (unlike universal_pathlib, which maybe does a good job of this).

@mih
Copy link
Author

mih commented Jan 30, 2023

I have now changed my code to stop using file:// URLs wherever possible (ie. where my code is also the source of the URL). For any externally provided file:// I have documented that users themselves need to anticipate special handling by FSSPEC. Thanks for the clarifications. This issue can be closed from my POV.

For the record I'd like to mention that I believe that handling file:// URLs as if they were paths with a file:// string prefix (if I understood your statement in #1168 (comment) correctly) to be a suboptimal choice. file is a regular URL scheme, with RFC8089 having a dedicated section on them. From my POV, pathlib is correct in doing what it does (considering platform encodings of filenames, and percent-quoting). https://www.rfc-editor.org/rfc/rfc8089.html#section-4 states:

When a file URI is produced that represents textual data consisting
of characters from the Unicode Standard coded character set
[UNICODE], the data SHOULD be encoded as octets according to the
UTF-8 character encoding scheme [STD63] before percent-encoding is
applied (as per Section 2.5 of [RFC3986]).

From that statement I would conclude that (also) percent-encoding is to be expected in any file URL.

@martindurant
Copy link
Member

OK, food for thought. I'd rather not do anything about it for now, especially since typical posix tools (like bash) don't use such paths. In some cases, fsspec has to handle a large number of paths, the expense of yet another encoding step is better avoided.

mih added a commit to mih/datalad-next that referenced this issue May 23, 2023
See fsspec/filesystem_spec#1168

Handle it by not generating `file://` URLs ourselves, but by passing
naked platform paths in. This is subject to a trial on windows still.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants