Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pathlib.Path.glob does not follow symlinks #77609

Closed
BrianMSheldon mannequin opened this issue May 5, 2018 · 29 comments
Closed

pathlib.Path.glob does not follow symlinks #77609

BrianMSheldon mannequin opened this issue May 5, 2018 · 29 comments
Labels
OS-windows stdlib Python modules in the Lib dir topic-pathlib type-bug An unexpected behavior, bug, or error

Comments

@BrianMSheldon
Copy link
Mannequin

BrianMSheldon mannequin commented May 5, 2018

BPO 33428
Nosy @pfmoore, @pitrou, @tjguk, @jreese, @zware, @zooba, @emilyemorehouse, @BrianMSheldon

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-05-05.06:23:00.558>
labels = ['type-bug', 'library', 'OS-windows']
title = 'pathlib.Path.glob does not follow symlinks'
updated_at = <Date 2020-04-29.11:12:49.344>
user = 'https://github.com/BrianMSheldon'

bugs.python.org fields:

activity = <Date 2020-04-29.11:12:49.344>
actor = 'Danya.Alexeyevsky'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Windows']
creation = <Date 2018-05-05.06:23:00.558>
creator = 'brianmsheldon'
dependencies = []
files = []
hgrepos = []
issue_num = 33428
keywords = []
message_count = 4.0
messages = ['316197', '316724', '316880', '367640']
nosy_count = 9.0
nosy_names = ['paul.moore', 'pitrou', 'tim.golden', 'jreese', 'Danya.Alexeyevsky', 'zach.ware', 'steve.dower', 'emilyemorehouse', 'brianmsheldon']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue33428'
versions = ['Python 3.6']

Linked PRs

@BrianMSheldon
Copy link
Mannequin Author

BrianMSheldon mannequin commented May 5, 2018

Given a pathlib.Path that contains symlinked subfolders, Path.glob (and .rglob) do not follow symlinks. This is not consistent with glob.glob which does.

For example given the following:
C:\Folder
C:\Folder\Subfolder -> D:\Subfolder
D:\Subfolder\File.txt

pathlib.Path('C:/Folder').glob('**/*') yields the following paths:
WindowsPath('C:/Folder/Subfolder')

glob.glob('C:/Folder/**/*') yields the following paths:
'C:/Folder\Subfolder'
'C:/Folder\Subfolder\File.txt'

Notice how the contents of Subfolder are present in the glob.glob results but not for Path.glob.

I would expect Path.glob to be consistent with glob.glob. This is not the only inconsistency (e.g. bpo-22276, bpo-31202) and perhaps Path.glob should be re-implemented using glob.glob.

@BrianMSheldon BrianMSheldon mannequin added stdlib Python modules in the Lib dir OS-windows type-bug An unexpected behavior, bug, or error labels May 5, 2018
@jreese
Copy link
Mannequin

jreese mannequin commented May 15, 2018

This looks like an issue specific to Windows? I can't replicate on Mac, and given Windows' method of implementing "symlinks" as junctions.

@BrianMSheldon
Copy link
Mannequin Author

BrianMSheldon mannequin commented May 17, 2018

Windows does not implement symlinks as junctions. Windows has hardlinks, symlinks and junctions which are all distinctly different in behaviour.

I don't doubt that this is a Windows-specific issue, although I have not tested other platforms. Path.glob and .rglob does work for junctions and hardlinks but glob.glob works consistently for all three.

@DanyaAlexeyevsky
Copy link
Mannequin

DanyaAlexeyevsky mannequin commented Apr 29, 2020

I can reproduce the bug with Linux and python 3.7.5:

Python 3.7.5 (default, Apr 19 2020, 20:18:17) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pathlib import Path
>>> Path('a/b').mkdir(parents=True)
>>> Path('c/d').mkdir(parents=True)
>>> Path('a/c').symlink_to('../c')
>>> Path('e').symlink_to('c')
>>> list(Path('.').rglob('*'))
[PosixPath('e'), PosixPath('c'), PosixPath('a'), PosixPath('c/d'), PosixPath('a/c'), PosixPath('a/b')]

Expected result:

[PosixPath('e'), PosixPath('e/d'), PosixPath('c'), PosixPath('a'), PosixPath('c/d'), PosixPath('a/c'), PosixPath('a/c/d'), PosixPath('a/b')]

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@dlukes
Copy link

dlukes commented Nov 24, 2022

Following symlinks was disabled on purpose as a fix for #70200:

if entry_is_dir and not entry.is_symlink():

To re-enable it, we'd have to come up with a different mitigation for the symlink loops problem.

Or possibly, if hidden behind a follow_symlinks arg which would be False by default, it might be enough to just document it as a limitation when follow_symlinks=True? I don't know.

@ghost
Copy link

ghost commented Jan 25, 2023

On Ubuntu 22 I have the same problem

@barneygale
Copy link
Contributor

Or possibly, if hidden behind a follow_symlinks arg which would be False by default, it might be enough to just document it as a limitation when follow_symlinks=True? I don't know.

glob() still follows symlinks when matching segments like foo/, */ and so forth; it only refuses to follow symlinks when it encounters a ** wildcard. So the current behaviour is roughly follow_symlinks=Sometimes.

I realised this because I've been trying to implement an iterative version of rglob(), building upon walk() and filtering paths with a compiled re.Pattern object. walk() has its own follow_symlinks argument that accepts True and False, but neither option aligns with the existing behaviour, so it seems like a dead end unless this "bug" is "fixed".

@barneygale
Copy link
Contributor

zsh has a *** wildcard that recurses into symlinks to directories, see Recursive Globbing here: https://linux.die.net/man/1/zshexpn

@barneygale
Copy link
Contributor

Having thought about this some more, I'd like to propose that we add follow_symlinks arguments to glob() and rglob(), where:

  • follow_symlinks=False treats symlinks as files (default)
  • follow_symlinks=True follows symlinks to directories

Note that the default behaviour would therefore change: a pattern like foo/bar or */bar would no longer match foo/bar if foo is a symlink. This is consistent with how **/bar works today.

With that in place, we could implement glob() iteratively atop walk(), which would address #89727 and substantially improve performance!

Thoughts?

@zooba
Copy link
Member

zooba commented Feb 6, 2023

Is it possible to follow symlinks when matching non-wildcard segments? I'm not familiar with the implementation, though if you're planning to put it atop walk() then the current implementation may not matter.

It might have to be a different mode, but if it preserves the current behaviour, it may be worth it. I can't decide whether it meets my expectations better or not, but it does seem we can avoid the harm of cycles in that case.

@barneygale
Copy link
Contributor

barneygale commented Feb 7, 2023

With the walk()-based implementation of glob() I have in mind, it wouldn't be possible to treat ** differently to other patterns w.r.t following symlinks.

The current glob() implementation performs unnecessary work when it's given a pattern that contains a non-terminal recursive wildcard, e.g. **/a.py. It first walks the entire directory tree when it encounters the ** wildcard and records all the paths. It then appends /a.py to all those paths and stat()s them. This second step is unnecessary! We've just walked the directory tree and established all the files/directories that exist, so we shouldn't need to check the FS again.

In a possible future implementation, glob() could walk a directory tree when it hits a ** wildcard (as before), and then simply filter the results through a regex. This means any tokens following ** in the pattern won't incur any stat()s or scandir()s. It also means we don't need to maintain a set of paths we've yielded:

cpython/Lib/pathlib.py

Lines 193 to 202 in f87f6e2

yielded = set()
try:
successor_select = self.successor._select_from
for starting_point in self._iterate_directories(parent_path, is_dir, scandir):
for p in successor_select(starting_point, is_dir, exists, scandir, normcase):
if p not in yielded:
yield p
yielded.add(p)
finally:
yielded.clear()

However, this only works if we treat symlinks the same no matter whether we've hit a ** token or some other kind of token. Otherwise the filter will include or exclude the wrong things.

@barneygale
Copy link
Contributor

it does seem we can avoid the harm of cycles in that case

Symlinks really complicate everything don't they? :)

Another option: swallow ELOOP errors, like glob.glob() does? It seems reasonable to me. We could add an on_error argument to glob() (we already have one in walk()) if users want to do something different.

@zooba
Copy link
Member

zooba commented Feb 8, 2023

Ah I see, yeah, there's not really any alternative for handling **, and there's no real advantage to only collecting directories.

Refusing to traverse on ELOOP and also doing our own cycle detection for longer cycles is probably required anyway, and then use follow_symlinks to turn it off entirely.

I feel like on_error is too low-level for a pathlib API. I can imagine wanting to filter to only symlinks (so I know what to follow up on), but we don't have any APIs for that today and I don't think this is the point to add them.

So given all that, could we default to follow_symlinks=True (without following errors or following the same link multiple times)?

@barneygale
Copy link
Contributor

Q: how do you suggest we detect cycles? Calling resolve() / os.path.realpath() on every symlink will be quite expensive I think? Might be wrong.

@zooba
Copy link
Member

zooba commented Feb 20, 2023

If you have a ton of directory symlinks, sure, but who does that? And I'm pretty sure that's the only generic option (we could try to be clever and remember e.g. the inode of the first file in each directory, but I doubt that's as portable). We only have to check on the way into the symlink - not for every file inside of it (unless it's another directory symlink).

Maybe we can remember devs+inodes of the symlinks themselves and skip the check if it's a non-zero1 value that we haven't seen before?

And now I'm thinking about this, how do we want to handle multiple non-cyclic symlinks into the same directory? If our search root has 100 symlinks to the same directory, none of them are a cycle, but we'll return the same files 100 times. That's probably the right thing to do.

Footnotes

  1. Zero is the usual value for "inodes are not supported", but if they're all going to return zero then we'll check them all anyway, so the non-zero check probably doesn't change anything.

@barneygale
Copy link
Contributor

barneygale commented Feb 20, 2023

I rather like glob.glob()'s behaviour, which is to follow (and emit) symlinks in a cycle until an OSError for "too many symlinks in path" is raised, silently consume that error, and continue the rest of the walk. pathlib.Path.walk() does the same thing, but makes it somewhat customizable with an on_error argument. My shell and the find executable seem to do this too. It's a bit "stupid", I suppose, but I worry we'll be endlessly chasing edge cases if we try to be too clever with symlinks.

@zooba
Copy link
Member

zooba commented Feb 20, 2023

Fair. I imagine enough stuff breaks down with symlink cycles that people try to avoid them anyway.

barneygale added a commit to barneygale/cpython that referenced this issue Mar 12, 2023
Add a keyword-only *follow_symlinks* parameter to `pathlib.Path.glob()` and
`rglob()`, defaulting to false. When set to true, symlinks to directories
are followed as if they were directories.

Previously these methods followed symlinks except when evaluating "`**`"
wildcards; on Windows they returned paths in filesystem casing except when
evaluating non-wildcard tokens. Both these problems are solved here. This
will allow us to address pythonGH-102613 and pythonGH-81079 in future commits.
@barneygale
Copy link
Contributor

PR available: #102616

@maxnoe
Copy link
Contributor

maxnoe commented Mar 27, 2023

If you have a ton of directory symlinks, sure, but who does that?

We do :)

Details don't matter really here, but we have a common directory for data, let's call it /data/base and there we have either directories directly for one specific kind of data preprocessing and symlinks to other directories for some other kind of preprocessing.

/data/base
   /2023-01-02/   # directory
   /2023-01-03 -> /data/preprocessed/2023-01-03  # symlink
/data/preprocessed/
  /2023-01-03

We'd very much like to do stuff like Path("/data/base").rglob("foo.bar"), but that doesn't work currently.

@zooba
Copy link
Member

zooba commented Mar 27, 2023

Great scenario, appreciate you showing us.

Will the support for follow_symlinks=True be a good fix for you in future? I suspect you may be dealing with a case where you'd prefer following symlinks to become the default.

@maxnoe
Copy link
Contributor

maxnoe commented Mar 27, 2023

Will the support for follow_symlinks=True be a good fix for you in future?

yes, having to specify an option is no issue.

I suspect you may be dealing with a case where you'd prefer following symlinks to become the default.

Not really, it's all under our control, so as long as there is the option to follow symlinks, at least we would be fine.

@barneygale
Copy link
Contributor

barneygale commented Mar 27, 2023

Suggestion on following symlinks:

Literal (e.g. foo/) Wildcard (e.g. foo*/) Recursive wildcard (**/)
glob(follow_symlinks=False) ✓ (follow)
glob(follow_symlinks=True)

We default to follow_symlinks=False, which preserves existing behaviour. Users could set follow_symlinks=True to follow symlinks in all circumstances.

We'd be left with the problem that glob(follow_symlinks=False) follows symlinks in some circumstances. But perhaps that's a smaller problem than the present inability to follow symlinks when expanding **?

barneygale added a commit that referenced this issue May 3, 2023
…gment case (GH-104116)

We now use `_WildcardSelector` to evaluate literal pattern segments, which
allows us to retrieve the real filesystem case.

This change is necessary in order to implement a *case_sensitive* argument
(see GH-81079) and a *follow_symlinks* argument (see GH-77609).
barneygale added a commit to barneygale/cpython that referenced this issue May 3, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 4, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 10, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 11, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 15, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 23, 2023
barneygale added a commit that referenced this issue May 29, 2023
…02616)

Add a keyword-only *follow_symlinks* parameter to `pathlib.Path.glob()` and`rglob()`.

When *follow_symlinks* is `None` (the default), these methods follow symlinks except when evaluating "`**`" wildcards. When set to true or false, symlinks are always or never followed, respectively.
@barneygale
Copy link
Contributor

I've added a keyword-only follow_symlinks argument, which will be available in Python 3.13. See 9700dd1 / #102616

Thank you everyone, and especially @zooba, for the help in getting this through. Resolving!

@gpshead
Copy link
Member

gpshead commented Mar 23, 2024

Lets change the new follow_symlinks=None to follow_symlinks=NotRecursive (or some other explicitly named sentinel value defined in the pathlib module).

People reading code that explicitly specifies follow_symlinks=None as would be needed if we ever decide to change the default would not be able to reason about what that does and why =None isn't the same behavior as =False or =0. Seeing a named constant makes it much more clear to those not intimately familiar with the API.

This is good even if we never decide to work towards a default change.

@gpshead gpshead reopened this Mar 23, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Mar 28, 2024
Replace tri-state `follow_symlinks` with boolean `recurse_symlinks`
argument. The new argument controls whether symlinks are followed when
expanding recursive `**` wildcards. The arguments correspond as follows:

    follow_symlinks  recurse_symlinks
    ===============  ================
    False            N/A
    None             False
    True             True
barneygale added a commit that referenced this issue Apr 5, 2024
…7311)

Replace tri-state `follow_symlinks` with boolean `recurse_symlinks` argument. The new argument controls whether symlinks are followed when expanding recursive `**` wildcards. The possible argument values correspond as follows:

    follow_symlinks  recurse_symlinks
    ===============  ================
    False            N/A
    None             False
    True             True

We therefore drop support for not following symlinks when expanding non-recursive pattern parts; it wasn't requested in the original issue, and it's a feature not found in any shells.

This makes the API a easier to grok by eliminating `None` as an option.

No news blurb as `follow_symlinks` was new in 3.13.
@barneygale
Copy link
Contributor

Re-resolving - the argument is now recurse_symlinks and accepts a boolean.

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
python#117311)

Replace tri-state `follow_symlinks` with boolean `recurse_symlinks` argument. The new argument controls whether symlinks are followed when expanding recursive `**` wildcards. The possible argument values correspond as follows:

    follow_symlinks  recurse_symlinks
    ===============  ================
    False            N/A
    None             False
    True             True

We therefore drop support for not following symlinks when expanding non-recursive pattern parts; it wasn't requested in the original issue, and it's a feature not found in any shells.

This makes the API a easier to grok by eliminating `None` as an option.

No news blurb as `follow_symlinks` was new in 3.13.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows stdlib Python modules in the Lib dir topic-pathlib type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

6 participants