-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urllib.request.urlopen doesn't handle UNC paths produced by pathlib's as_uri() (but can handle UNC paths with additional slashes) #90812
Comments
I've found open to have difficulty with a resolved pathlib path: Example code of: import pathlib
path = "Z:\\test.py"
with open(path) as fp:
print("Stock open: works")
data = fp.read()
with open(pathlib.Path(path).resolve().as_uri()) as fp:
print("Pathlib resolve open")
data = fp.read() Results in: Z:\> python test.py
Stock open: works
Traceback (most recent call last):
File "Z:\test.py", line 12, in <module>
with open(pathlib.Path(path).resolve().as_uri()) as fp:
FileNotFoundError: [Errno 2] No such file or directory: "file://machine/share/test.py" Interestingly, I've found that open("file:////machine/share/test.py") succeeds, but this isn't what pathlib's resolve() produces. It appears as though file_open only supports hosts that are local, but will open UNC paths on windows with the additional slashes. This is quite confusing behaviour and it's not clear why file://host/share/file won't work, but file:////host/share/file does. I imagine this is a long time issue and a decision has already been reached on why file_open doesn't support such URIs, but I couldn't find the answer anywhere, just bpo-32442 which was resolved without clarifying the situation... |
Why are you adding |
The API we provide accepts URIs, so whilst the example seems a little contrived, the code itself expects a URI and then calls open (making use of the ability to add open handlers).
As best I can tell the file handler is defined in urllib/request.py as file_open. This appears to do some preprocessing to remove the file scheme and (and explicitly throws an exception if there's a host that isn't localhost) before it gets to the C open(). I wondered why it didn't check if it was on windows and, if so, construct an appropriate path (since quadruple hash I don't think adheres to the URI RFC, but seems to open correctly)? |
My bad, sorry, I realized I was conflating open with urllib.request.urlopen. I believe the issue still exists though, sorry for the confusion. |
Here's the revised code sample: import pathlib
import urllib.request
path = "Z:\\test.py"
print(f"Stock open: {pathlib.Path(path).as_uri()}")
with urllib.request.urlopen(pathlib.Path(path).as_uri()) as fp:
data = fp.read()
print(f"Pathlib resolved open: {pathlib.Path(path).resolve().as_uri()}")
with urllib.request.urlopen(pathlib.Path(path).resolve().as_uri()) as fp:
data = fp.read() and here's the output: Z:\> python test.py
Stock open: file:///Z:/test.py
Pathlib resolved open: file://host/share/test.py
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\urllib\request.py", line 1505, in open_local_file
stats = os.stat(localfile)
FileNotFoundError: [WinError 2] The system cannot find the file specified: '\\share\\test.py'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Z:\test.py", line 14, in <module>
with urllib.request.urlopen(pathlib.Path(path).resolve().as_uri()) as fp:
File "C:\Program Files\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Program Files\Python310\lib\urllib\request.py", line 519, in open
response = self._open(req, data)
File "C:\Program Files\Python310\lib\urllib\request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Program Files\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Program Files\Python310\lib\urllib\request.py", line 1483, in file_open
return self.open_local_file(req)
File "C:\Program Files\Python310\lib\urllib\request.py", line 1522, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [WinError 2] The system cannot find the file specified: '\\share\\test.py'> |
urllib uses nturl2path under the hood. On my system it seems to return reasonable results for both two and four leading slashes: >>> nturl2path.url2pathname('////host/share/test.py')
'\\\\host\\share\\test.py'
>>> nturl2path.url2pathname('//host/share/test.py')
'\\\\host\\share\\test.py' (note that urllib strips the `file:` prefix before calling this function). |
I can confirm that url2pathname work with either number of slashes, and that open_file appears to have had the file: removed. However, in even if the check in open_file were bypassed, it calls open_local_file, which then strips any host before calling url2pathname, meaning the host will never be included if only two slashes are used.
This is what seems to cause the issue when attempting to open file://server/host/file.ext on windows, even though file:////server/host/file.ext open just fine. The problem that I found, and was in bug bpo-32442, is that pathlib only ever returns two slashes, which despite being a valid and correctly formed url, can't be opened by urllib.request.urlopen(). Since there doesn't seem to be an issue with opening these files (given it works for file:////server...) and since nt2pathname will produce the correct result, it feels as though open_file should have special code on windows to allow servers to be accepted by the file handler (open_local_file should probably stay as is to not change the API too much). |
In FileHandler.file_open(), req.host is the host name, which is either None or an empty string for a local drive path such as, respectively, "file:/Z:/test.py" or "file:///Z:/test.py". The value of req.selector never starts with "//", for which file_open() checks, but rather a single slash, such as "/Z:/test.py" or "/share/test.py". This is a bug in file_open(). Due to this bug, it always calls self.open_local_file(req), even if req.host isn't local. The distinction shouldn't matter in Windows, which supports UNC paths, but POSIX has to open a path on the local machine (possibly a mount point for a remote path, but that's irrelevant). In POSIX, if the local machine coincidentally has the req.selector path, then the os.stat() and open() calls will succeed with a bogus result. For "file://host/share/test.py", req.selector is "/share/test.py". In Windows, url2pathname() converts this to r"\share\test.py", which is relative to the drive of the process current working directory. This is a bug in open_local_file() on Windows. For it to work correctly, req.host has to be joined back with req.selector as the UNC path "//host/share/test.py". Of course, this need not be a local file in Windows, so Windows should be exempted from the local file limitation in file_open(). |
For r"\\host\share\test.py", the two slash conversion "file://host/share/test.py" is correct according to RFC80889 "E.3.1. <file> URI with Authority" [1]. In this case, req.host is "host", and req.selector is "/share/test.py". The four slash version "file:////host/share/test.py" is a known variant for a converted UNC path, as noted in RFC8089 "E.3.2. <file> URI with UNC Path". In this case, req.host is an empty string, and req.selector is "//host/share/test.py". There's another variant that uses 5 slashes for a UNC path, but urllib (or url2pathname) doesn't support it. |
Agree with the previous analysis. Just noting that: >>> nturl2path.pathname2url('\\\\host\\share\\test.py')
'////host/share/test.py' So four slashes are produced by the urllib code, whereas pathlib only produces two. According to wikipedia, both the two- and four-slash variants are in active usage. As we can see within Python itself! :P |
To correct myself, actually req.selector will start with "//" for a "file:////" URI, such as "file:////host/share/test.py". For this example, req.host is an empty string, so file_open() still ends up calling open_local_file(), which will open "//host/share/test.py". In Linux, "//host/share" is the same as "/host/share". In Cygwin and MSYS2 it's a UNC path. I guess this case should be allowed, even though the meaning of a "//" root isn't specifically defined in POSIX. Unless I'm overlooking something, file_open() only has to check the value of req.host. In POSIX, it should require opening a 'local' path, i.e. if req.host isn't None, empty, or a local host, raise URLError. In Windows, my tests show that the shell API special cases "localhost" (case insensitive) in "file:" URIs. For example, the following are all equivalent: "file:/C:/Temp", "file:///C:/Temp", and "file://localhost/C:/Temp". The shell API does not special case the real local host name or any of its IP addresses, such as 127.0.0.1. They're all handled as UNC paths. Here's what I've experimented with thus far, which passes the existing urllib tests in Linux and Windows: class FileHandler(BaseHandler):
def file_open(self, req):
if not self._is_local_path(req):
if sys.platform == 'win32':
path = url2pathname(f'//{req.host}{req.selector}')
else:
raise URLError("In POSIX, the file:// scheme is only "
"supported for local file paths.")
else:
path = url2pathname(req.selector)
return self._common_open_file(req, path)
def _is_local_path(self, req):
if req.host:
host, port = _splitport(req.host)
if port:
raise URLError(f"the host cannot have a port: {req.host}")
if host.lower() != 'localhost':
# In Windows, all other host names are UNC.
if sys.platform == 'win32':
return False
# In POSIX, support all names for the local host.
if _safe_gethostbyname(host) not in self.get_names():
return False
return True
# names for the localhost
names = None
def get_names(self):
if FileHandler.names is None:
try:
FileHandler.names = tuple(
socket.gethostbyname_ex('localhost')[2] +
socket.gethostbyname_ex(socket.gethostname())[2])
except socket.gaierror:
FileHandler.names = (socket.gethostbyname('localhost'),)
return FileHandler.names
def open_local_file(self, req):
if not self._is_local_path(req):
raise URLError('file not on local host')
return self._common_open_file(req, url2pathname(req.selector))
def _common_open_file(self, req, path):
import email.utils
import mimetypes
host = req.host
filename = req.selector
try:
if host:
origurl = f'file://{host}{filename}'
else:
origurl = f'file://{filename}'
stats = os.stat(path)
size = stats.st_size
modified = email.utils.formatdate(stats.st_mtime, usegmt=True)
mtype = mimetypes.guess_type(filename)[0] or 'text/plain'
headers = email.message_from_string(
f'Content-type: {mtype}\n'
f'Content-length: {size}\n'
f'Last-modified: {modified}\n')
return addinfourl(open(path, 'rb'), headers, origurl)
except OSError as exp:
raise URLError(exp) Unfortunately nturl2path.url2pathname() parses some UNC paths incorrectly. For example, the following path should be an invalid UNC path, since "C:" is an invalid name, but instead it gets converted into an unrelated local path. >>> nturl2path.url2pathname('//host/C:/Temp/spam.txt')
'C:\\Temp\\spam.txt' This goof depends on finding ":" or "|" in the path. It's arguably worse if the last component has a named data stream (allowed by RFC 8089): >>> nturl2path.url2pathname('//host/share/spam.txt:eggs')
'T:\\eggs' Drive "T:" is from "t:" in "t:eggs", due to simplistic path parsing. |
Hullo, just noting that I've made a proposal to unify Python's |
@eryksun Thank you very much for your most helpful notes and code in your earlier comment. I've opened a new PR that aims to solve this issue and others: #126148 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: