Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-77609: Add follow_symlinks argument to pathlib.Path.glob() #102616

Merged
merged 17 commits into from
May 29, 2023

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Mar 12, 2023

Add a keyword-only follow_symlinks parameter to pathlib.Path.glob() and rglob().

When follow_symlinks is None (the default), these methods follow symlinks except when evaluating "**" wildcards. When set to true or false, symlinks are always or never followed, respectively.

This allows us to address GH-102613 in future.

Add a keyword-only *follow_symlinks* parameter to `pathlib.Path.glob()` and
`rglob()`, defaulting to false. When set to true, symlinks to directories
are followed as if they were directories.

Previously these methods followed symlinks except when evaluating "`**`"
wildcards; on Windows they returned paths in filesystem casing except when
evaluating non-wildcard tokens. Both these problems are solved here. This
will allow us to address pythonGH-102613 and pythonGH-81079 in future commits.
@barneygale
Copy link
Contributor Author

barneygale commented Mar 13, 2023

Table showing whether symlinks to directories are followed:

Literal (e.g. foo/) Wildcard (e.g. foo*/) Recursive wildcard (**/)
glob() (previous behaviour)
glob()
glob(follow_symlinks=True)

Table showing whether filesystem case is returned:

Literal (e.g. foo/) Wildcard (e.g. foo*/) Recursive wildcard (**/)
glob() (previous behaviour)
glob()

@zooba
Copy link
Member

zooba commented Mar 14, 2023

I wrote a fairly long post about my concerns on the changed behaviour, but then I realised that my main question is whether the symlink following changes applies to the entire path or only the bit in pattern? If symlinks in the base path are going to be followed regardless of the option, I'm far less concerned.

However, I'd like to be able to figure this out from the documentation :) So there's a change required there to clarify (and maybe I'll be more concerned if it clarifies the wrong way)

@barneygale
Copy link
Contributor Author

Symlinks in the base path (the p in p.glob(...)) are followed regardless of the option! I'll beef up the docs.

@barneygale
Copy link
Contributor Author

Thanks for the feedback btw!

I'm also nervous about changing behaviour. I could revise this to make follow_symlinks=None the default, and have None mean "follow symlinks except when expanding **" (i.e. current behaviour), but it's a bit icky. Or we could make follow_symlinks=False still follow symlinks when expanding literal or non-recursive wildcards (again preserving existing behaviour), but it feels like misleading naming to me. Or perhaps we introduce a zsh-like *** token that follows symlinks. I'm not totally sure.

@zooba
Copy link
Member

zooba commented Mar 14, 2023

I don't want to add ***. I think we can live without that.

I think the behaviour of path parts before any wildcard are easily assumed to be equivalent of being part of the base path. That is:

>>> Path("a1/b2").glob("c3/**")
# is the same as
>>> Path("a1/b2/c3").glob("**")

Everything from the first wildcard becomes a match pattern. We need to recurse deep enough to create paths with the same number of segments, and then match all the candidates against the pattern. Since we're now recursing, choosing not to follow symlinks by default is justifiable.

Given the current behaviour is to follow "spam*" but not ** (that is, it's inconsistent), choosing one or the other feels more like a bugfix. I think we can describe it as "directory symlinks matched by using a wildcard will only be followed if the flag is set", whereas currently it depends on context, and a symlink may or may not be followed.

I'm pretty much leaning towards needing a deprecation period though. I think that'll be safest, and probably necessary, so the best way to do that is to support =None for the option right now and plan to change it to False in 3.14. Then we can show a deprecation warning when it's used. Hopefully there's a fairly separate way to implement the mixed behaviour that makes it easy to see what to remove later on.

@barneygale
Copy link
Contributor Author

Right! I think I agree that leading literals should always be followed.

What about trailing literals, e.g. path.glob('**/.ci/tox.ini')? There's a nice optimization we can do in these cases: we walk the directory tree once when we hit the ** token, and then filter the results through a regex to match the .ci/tox.ini parts. This avoids us needing to scandir() the same directory multiple times, or needing to keep a set to de-duplicate results.

... but that optimization only works if the ** token walks the same directories as the .ci/tox.ini literal part would hit. If we treat symlinks inconsistently the whole thing breaks down D:

@barneygale
Copy link
Contributor Author

Also agree on the need to preserve existing behaviour and emit deprecation warnings before we change defaults. I'd argue the default should become true, not false, in future. It's the more useful behaviour. glob.glob() only supports following symlinks!

@barneygale
Copy link
Contributor Author

I've switched the default to follow_symlinks=None, which preserves previous behaviour.

Passing None to a boolean parameter wounds me. Wild idea: introduce new functions with better naming?

  • Path.expand(pattern, *, follow_symlinks=True) eventually replaces glob().
  • Path.search(pattern, *, follow_symlinks=True) eventually replaces rglob()

I mention this only because the names "glob" and "rglob" are a bit opaque unless you've done shell scripting before. The suggested names align with PurePath.match(). But feel free to shoot this right back down!

@zooba
Copy link
Member

zooba commented Mar 14, 2023

Passing None to a parameter is fine, especially since we're going to document it as deprecated and you shouldn't do it explicitly.

... but that optimization only works if the ** token walks the same directories as the .ci/tox.ini literal part would hit. If we treat symlinks inconsistently the whole thing breaks down D:

Yeah, and so if we have **/.*/tox.ini we won't be able to follow symlinks matching .* because we never followed them in the first place. We can only match real directories that start with a dot.

Maybe we can optimise the case where the ** is followed by a literal? And if there are any other wildcard characters (and an unspecified follow_symlinks) we revert to a slower algorithm?

@barneygale
Copy link
Contributor Author

barneygale commented Mar 14, 2023

Maybe we can optimise the case where the ** is followed by a literal? And if there are any other wildcard characters (and an unspecified follow_symlinks) we revert to a slower algorithm?

👍 sounds good. As things stand with this PR, follow_symlinks=False and True could be optimized, whereas follow_symlinks=None cannot. I'll leave the optimizations for a future PR as there's a couple other things I need to land first, e.g. #101398.

This PR still breaks backwards compat in two (minor?) ways:

  • ../ tokens in globs no longer navigate to the parent directory. Instead they're matched like other literals among the directory children, and so the match fails.
  • Filesystem case is returned for literal tokens, so if you have "C:/foo/bar.txt", then Path("C:/").glob("FOO*/BAR.TXT") yields Path("C:/foo/bar.txt") and not Path("C:/foo/BAR.TXT").

@zooba
Copy link
Member

zooba commented Mar 14, 2023

Matching filesystem case is an improvement, IMHO, even for back-compat. glob is a query operation, so there should be an expectation that the result comes from the FS and not the arguments.

On the .. case, I assume you mean after a wildcard? Before any wildcard we should be able to interpret it normally, and that is important. But I'm not even sure what behaviour I would expect after a wildcard? Is it meant to mean glob('**/spam/../eggs/*.txt') finds all text files in any eggs directory provided it has a spam directory next to it? That's... clever... I guess? But I don't think we've ever suggested it would work.

I also don't have any real feeling for whether we should normpath before globbing, or split at wildcards and normpath, or do something else. What do other tools do with this kind of pattern? Particularly something that "removes" the recursion marker like glob('**/../')

@barneygale
Copy link
Contributor Author

barneygale commented Mar 14, 2023

I also don't have any real feeling for whether we should normpath before globbing, or split at wildcards and normpath, or do something else. What do other tools do with this kind of pattern? Particularly something that "removes" the recursion marker like glob('**/../')

Pathlib uniformly avoids collapsing .. segments by purely lexical means, as it can change the meaning of a path that involves symlinks. So there's no use of normpath() anywhere!

I've fixed handling of .., so I think this might now conform with your expectations!

An updated table showing whether symlinks are followed:

Literal (e.g. foo/) Wildcard (e.g. foo*/) Recursive wildcard (**/)
glob() (previous behaviour)
glob()
glob(follow_symlinks=True)
glob(follow_symlinks=False)

Note that, for the moment, literals are handled exactly the same whether they appear before or after recursive wildcards.

@zooba
Copy link
Member

zooba commented Mar 14, 2023

Looks good. We should still add the deprecation warning for follow_symlinks=None (I forget if want to skip the warning for 3.12, but even so, add it in and we can comment it out so that at least it's done). The test that uses the default value should suppress the warning.

@barneygale
Copy link
Contributor Author

barneygale commented Mar 14, 2023

Done. But honestly I don't feel great about path.glob('foo') emitting a deprecation warning unless users pass follow_symlinks=False or follow_symlinks=True. It's wordy and annoying and in many cases (e.g. when users don't expect symlinks), the choice doesn't matter. Also makes it harder for users to write backwards-compatible code. I'd love if there's a better way I'm not seeing.

@barneygale
Copy link
Contributor Author

Oh right, I also need to update the docs examples to pass follow_symlinks. I'll do that tomorrow.

@barneygale
Copy link
Contributor Author

Alternative PR that adds support for a *** wildcard:

@barneygale
Copy link
Contributor Author

Closing this PR as I believe #104176 is the right way forward.

@barneygale barneygale closed this May 8, 2023
@barneygale
Copy link
Contributor Author

@zooba I'd be much more comfortable with this PR if we delayed deprecating follow_symlinks=None for a couple of releases. Would that be alright? Otherwise I think users would need to either drop support for 3.11 in their code, or suppress deprecation warnings, right?

@zooba
Copy link
Member

zooba commented May 10, 2023

We can document it as deprecated with no planned removal date (or planned change of default, in this case). Or just strongly recommend the use of True or False and say that None is intended for legacy compatibility only.

If we think (or people start saying) that the default is harmful/a trap, we can deprecate properly at that point (and wait two releases before changing it).

@barneygale barneygale reopened this May 10, 2023
@barneygale
Copy link
Contributor Author

I'll pick this up again once #104512 lands.

@barneygale barneygale marked this pull request as ready for review May 23, 2023 23:11
@barneygale
Copy link
Contributor Author

In fact, it doesn't really matter whether #104512 lands first. Marking as ready for review!

@barneygale
Copy link
Contributor Author

On follow_symlinks=None: I don't consider it wrong (after all, Guido considered it correct when he fixed #70200). And at this stage it's no less performant. When I fix #102613 and implement the walk-and-match stategy, I'll add a warning in the docs that follow_symlinks=None can be much less performant. Around the Python 3.15 timeframe I'll deprecate follow_symlinks=None. Hope that sounds alright, @zooba?

@barneygale barneygale requested a review from zooba May 27, 2023 18:12
@zooba
Copy link
Member

zooba commented May 29, 2023

after all, Guido considered it correct when he fixed #70200

Interesting.

I think if you can clearly explain the different behaviour ("following symlinks matched by ** but not other wildcard matches" would do) and make it sound useful (not just preserving compatibility), I don't think there's a need to deprecate it at all unless we want to change the default.

@barneygale
Copy link
Contributor Author

I think we're in agreement, then!

The docs in this PR say:

   By default, or when the *follow_symlinks* keyword-only argument is set to
   ``None``, this method follows symlinks except when expanding "``**``"
   wildcards. Set *follow_symlinks* to ``True`` to always follow symlinks, or
   ``False`` to treat all symlinks as files.

Does that sound OK to you?

@zooba
Copy link
Member

zooba commented May 29, 2023

Sounds good to me

@barneygale
Copy link
Contributor Author

Mega, thank you. Would you be up for reviewing the PR as a whole, if you have the time? No worries if not, and no urgency from my side.

The patch is somewhat repetitive due to the way globbing is currently implemented. If you find it offensive, we might want to simplify things first via #104512

@barneygale
Copy link
Contributor Author

Woot! Thank you Steve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants