Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37609 - Support "UNC" and "GLOBAL" junctions in ntpath.splitdrive(). #31702

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Mar 6, 2022

Implementation by @eryksun. They note:

The old implementation would split "//server/" as ('//server/', ''). Since there's no share, this should not count as a drive. The new implementation splits it as ('', '//server/'). Similarly it splits '//?/UNC/server/' as ('', '//?/UNC/server/').

https://bugs.python.org/issue37609

…e()`.

Co-authored-by: Eryk Sun <eryksun@gmail.com>
Comment on lines +146 to +147
The UNC root must be exactly two separators. Other separators may be
repeated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for picking this up, but, fair warning, it's going to need more work. If you're up for that, great. I'd love to get your opinions on what needs to be done.

Steve Dower introduced the idea of leveraging PathCchSkipRoot() in Windows, and I like that idea, despite some reservations on my part. So I think splitdrive() should try to conform with PathCchSkipRoot(). As such, repetition of separators between the domain and junctions (e.g. UNC server and share) should be parsed as empty values for those components. Similar behavior would be extended to "\\?\UNC" paths.

At the time I wrote this, I was thinking to support whatever paths work in practice, but on closer scrutiny even GetFullPathNameW() handles the initial slashes for the server and share components without normalizing repeated slashes. For example:

>>> n = GetFullPathNameW('//server//file', len(buf), buf, byref(filepart))
>>> print(buf.value)
\\server\file
>>> print(filepart.value)
file

>>> n = GetFullPathNameW('////file', len(buf), buf, byref(filepart))
>>> print(buf.value)
\\\file
>>> print(filepart.value)
file

For normal UNC filepaths, if os.chdir() handles the path as a UNC path instead of as a rooted path on the current drive, then splitdrive() should split out the 'drive' component, even if it's malformed (e.g. empty server or share component). Extend this behavior to "\\?\UNC" paths, as PathCchSkipRoot() does, even though they're not valid for the current working directory.

UNC drive examples in the file namespace:

splitdrive('//server/share') == ('//server/share', '')
splitdrive('//server///share') == ('//server///share', '')
Copy link
Contributor

@eryksun eryksun Mar 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example should be changed to match PathCchSkipRoot(). For example:

>>> buf.value = r'\\server\\\dir'
>>> hr = PathCchSkipRoot(buf, byref(filepath))
>>> print(filepath.value)
\\dir

Corresponding splitdrive() result:

splitdrive('//server///dir') == ('//server/', '//dir')

The share component is an empty string. This example can be moved to a section that discusses malformed drives, particularly a section about paths that normalize as functional paths. In this case the path normalizes as if "dir" is a UNC share component.

Comment on lines +168 to +174
insensitive. Any device junction is recognized as a UNC drive, with
two exceptions that require additional qualification: "GLOBAL" and "UNC".

Normally the device namespace includes the local device junctions of a
user, such as mapped and subst drives. The "GLOBAL" junction limits this
view to just global devices. It must be followed either by a device
junction or another "GLOBAL" junction.
Copy link
Contributor

@eryksun eryksun Mar 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the desire to natively use PathCchSkipRoot(), the splitdrive() implementation should not support "Global". PathCchSkipRoot() doesn't support this prefix, and it's uncommon in practice. It's mostly used by device drivers that need to ensure that they're accessing a global device when executing in an arbitrary thread context. For example, a device driver would use the NT path "\??\Global\SpamDevice" instead of "\??\SpamDevice".

Comment on lines +203 to +208
splitdrive('//') == ('', '//')
splitdrive('//server/') == ('', '//server/')
splitdrive('///server/share') == ('', '///server/share')

splitdrive('//?/UNC/') == ('', '//?/UNC/')
splitdrive('//?/UNC/server/') == ('', '//?/UNC/server/')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, to align with PathCchSkipRoot(), these examples should have a drive, as follows:

 splitdrive('//') == ('//', '')
 splitdrive('//server/') == ('//server/', '')
 splitdrive('///share/file') == ('///share', '/file')
 splitdrive('//?/UNC/') == ('//?/UNC/', '')
 splitdrive('//?/UNC/server/') == ('//?/UNC/server/', '')

I changed some component names to reflect how they're classified. These examples can be included in a section about paths with malformed drives.

Comment on lines +212 to +287
if isinstance(p, bytes):
empty = b''
colon = b':'
sep = b'\\'
altsep = b'/'
device_domains = (b'?', b'.')
global_name = b'GLOBAL'
unc_name = b'UNC'
else:
empty = ''
colon = ':'
sep = '\\'
altsep = '/'
device_domains = ('?', '.')
global_name = 'GLOBAL'
unc_name = 'UNC'

# Check for a DOS drive.
if p[1:2] == colon:
return p[:2], p[2:]

# UNC drive for the file and device namespaces.
# \\domain\junction\object
# Separators may be repeated, except at the root.

def _next():
'''Get the next component, ignoring repeated separators.'''
i0 = index
while normp[i0:i0+1] == sep:
i0 += 1
if i0 >= len(p):
return -1, len(p)
i1 = normp.find(sep, i0)
if i1 == -1:
i1 = len(p)
return i0, i1

index = 0
normp = p.replace(altsep, sep)
# Consume the domain (server).
i, index = _next()
if i != 2:
return empty, p
domain = p[i:index]
# Consume the junction (share).
i, index = _next()
if i == -1:
return empty, p

if domain not in device_domains:
return p[:index], p[index:]

# GLOBAL and UNC are special in the device domain.
junction = p[i:index].upper()
# GLOBAL can be repeated.
while junction == global_name:
i, index = _next()
if i == -1:
# GLOBAL must be a prefix.
return empty, p
junction = p[i:index].upper()

if junction == unc_name:
# Allow the "UNC" device with no remaining path.
if index == len(p):
return p, empty
# Consume the meta-domain (server).
i, index = _next()
if i == -1:
return empty, p
# Consume the meta-junction (share).
i, index = _next()
if i == -1:
return empty, p

return p[:index], p[index:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a first attempt at implementing the simplified proposal:

def splitdrive(p):
    p = os.fspath(p)
    if isinstance(p, bytes):
        empty = b''
        sep = b'\\'
        altsep = b'/'
        colon = b':'
        device_domains = (b'?', b'.')
        unc_root = b'\\\\'
        unc_name = b'UNC'
    else:
        empty = ''
        sep = '\\'
        altsep = '/'
        colon = ':'
        device_domains = ('?', '.')
        unc_root = '\\\\'
        unc_name = 'UNC'

    # Handle a DOS drive path, rooted path, or relative path.
    #
    #   drive
    #   vvvvvvvvvvv
    #   ([A-Z] ":")? ("\"? name ("\"+ name)*)?
    #                ^^^^^^^^^^^^^^^^^^^^^^^^
    #                file path
    normp = p.replace(altsep, sep)
    if normp[:2] != unc_root:
        if p[1:2] == colon and p[:1].isalpha():
            return p[:2], p[2:]
        return empty, p

    # Handle a UNC drive path.
    #
    #   drive
    #   vvvvvvvvvvvvvvvvvvvvvvvvvv
    #   "\\" (domain ("\" junction ("\"+ name)*)?)?
    #                              ^^^^^^^^^^^
    #                              namespace path
    #   drive
    #   vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    #   "\\" ("?"|".") "\UNC" ("\" server ("\" share ("\"+ name)*)?)?
    #                                                ^^^^^^^^^^^
    #                                                file path
    parts = []
    start = index = 1
    for _ in range(2):
        start = index + 1
        index = normp.find(sep, start)
        if index == -1:
            return p, empty
        parts.append(p[start:index])

    if parts[0] not in device_domains or parts[1].upper() != unc_name:
        return p[:index], p[index:]

    # "UNC" device path
    for i in range(2):
        start = index + 1
        index = normp.find(sep, start)
        if index == -1:
            return p, empty

    return p[:index], p[index:]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is only an initial attempt at implementing the proposed behavior. The final version should use a new ntpath.splitroot() function. In Windows, this will leverage nt._path_splitroot(), with extensions, and otherwise use a fallback implementation written in Python. splitdrive() is a bit modified since the filesystem or namespace root slash should not be part of the drive in the splitdrive() result. Also, both splitroot() and splitdrive() should support all DOS device names as 'drives'. For example, "\\?\BootPartition" is another name for "\\?\C:" in Windows 10+. New DOS devices can be created in the context of the current user via DefineDosDeviceW(). They can target a volume device name, or an arbitrary path on the volume (i.e. the way subst.exe creates substitute drives).

>>> os.path.splitdrive('//?/BootPartition/Windows')
('//?/BootPartition', '/Windows')
>>> os.path.samefile('//?/BootPartition/Windows', 'C:/Windows')
True

Copy link
Contributor

@eryksun eryksun Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an implementation that splits a path into its component parts: drive, root, and rest. This could be useful for direct consumption in pathlib. It's trivial to use this function to implement splitdrive().

def splitparts(p):
    p = os.fspath(p)
    if isinstance(p, bytes):
        sep = b'\\'
        altsep = b'/'
        colon = b':'
        extendedpath = b'?'
        uncroot = b'\\\\'
        uncname = b'UNC'
    else:
        sep = '\\'
        altsep = '/'
        colon = ':'
        extendedpath = '?'
        uncroot = '\\\\'
        uncname = 'UNC'
    normp = p.replace(altsep, sep)

    # Handle a DOS drive path, rooted path, or relative path.
    if normp[:2] != uncroot:
        if p[1:2] == colon and p[:1].isalpha():
            index = 2
        else:
            index = 0
        if normp[index:index+1] == sep:
            return p[:index], p[index:index+1], p[index+1:]
        return p[:index], p[:0], p[index:]

    # Handle a UNC drive path.
    index = normp.find(sep, 2)
    if index == -1:
        index = len(p)
    domain = p[2:index]
    index += 1
    i = normp.find(sep, index)
    if i == -1:
        i = len(p)
    junction = p[index:i]
    index = i
    if domain == extendedpath and junction.upper() == uncname:
        index = normp.find(sep, index + 1)
        if index != -1:
            index = normp.find(sep, index + 1)
        if index == -1:
            index = len(p)
    return p[:index], p[index:index+1], p[index+1:]

Fixing up the result from PathCchSplitRoot() is currently more trouble than it's worth. It only supports extended paths for drives, volume GUID names, and "UNC" paths. This distinction goes too far because there are reasons to need an extended path with arbitrary device names, either to get a literal path (i.e. forward slashes in a device path) or to use a long path if long DOS paths are disabled for the current process or the system. Plus mounted volumes can have any device name, such as "BootPartition", which is legitimate within the Windows API (not just the NT API) by extension of DefineDosDeviceW(). Also, the support for drives in extended paths and volume GUID names has a serious bug. It splits r'\\?\C:spam' as (r'\\?\C:', 'spam'). The OS will try to access a device named "C:spam", so splitting a DOS drive-letter drive out of the device name is wrong.

Maybe using PathCchSplitRoot() will be worth it if these quirks can be worked around in the C implementation of nt._path_splitroot().

@barneygale barneygale marked this pull request as draft March 14, 2022 18:54
@barneygale
Copy link
Contributor Author

I'm trying a more targeted approach to avoid backwards compatibility problems. PR here: #91882. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants