Python: Extract files in hidden dirs by default #19424

tausbn · 2025-04-30T12:38:16Z

Changes the default behaviour of the Python extractor so that files inside hidden directories are extracted by default.

Also adds an extractor option, skip_hidden_directories, which can be set to true in order to revert to the old behaviour.

Finally, I made the logic surrounding what is logged in various cases a bit more obvious.

Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as (excluded), but I think this makes more sense anyway.

Changes the default behaviour of the Python extractor so files inside hidden directories are extracted by default. Also adds an extractor option, `skip_hidden_directories`, which can be set to `true` in order to revert to the old behaviour. Finally, I made the logic surrounding what is logged in various cases a bit more obvious. Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as `(excluded)`, but I think this makes more sense anyway.

Copilot

Pull Request Overview

This PR changes the Python extractor to include files in hidden directories by default, adds an option to revert to the old behavior, and makes the logging logic more explicit.

Default extraction now includes files in hidden directories.
New skip_hidden_directories extractor option to restore previous behavior.
Refactored logging branches and updated tests, change notes, and schema.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
python/ql/test/extractor-tests/filter-option/Test.expected	Added expected line for hidden_foo.py script.
python/ql/test/2/extractor-tests/hidden/test.expected	Updated output to include files under hidden directories.
python/ql/lib/change-notes/2025-04-30-extract-hidden-files-by-default.md	Documented new default behavior and `skip_hidden_directories` option.
python/extractor/semmle/traverser.py	Reordered exclusion checks, added env‑var‑driven `is_hidden` override.
python/extractor/cli-integration-test/hidden-files/test.sh	Added CLI integration tests for default and skipped hidden behavior.
python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py	Base script for the CLI test.
python/extractor/cli-integration-test/hidden-files/query.ql	Query to list extracted file names.
python/extractor/cli-integration-test/hidden-files/query-default.expected	Expected output when hidden files are included.
python/extractor/cli-integration-test/hidden-files/query-skipped.expected	Expected output when `skip_hidden_directories=true`.
python/codeql-extractor.yml	Added `skip_hidden_directories` option to extractor schema.

Comments suppressed due to low confidence (3)

python/codeql-extractor.yml:47

[nitpick] Consider defining skip_hidden_directories as a boolean type instead of a string with a pattern, to align with typical boolean option conventions.

skip_hidden_directories:

python/extractor/semmle/traverser.py:103

There is no test for the Windows-specific is_hidden branch when skip_hidden_directories=true on os.name == 'nt'. Consider adding a Windows-mode test to cover that path.

if os.environ.get("CODEQL_EXTRACTOR_PYTHON_OPTION_SKIP_HIDDEN_DIRECTORIES", "false") == "false":

python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py:1

The CLI integration test expectations reference .hidden_file.py and visible_file_in_hidden_dir.py, but those files are not present in repo_dir. Add these files to match query-default.expected and query-skipped.expected.

print(42)

yoff

This is OK. However, setting is_hidden to always return false is almost "too clever" as it becomes semantically a bit confusing. I would have preferred a more direct structure; something like

is_hidden stays as is
define is_excluded (or is_traversed) to deal with the logic involving exclude_paths, is_hidden and the configuration regarding traversing hidden files
use is_excluded on line 86

is_excluded would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.

tausbn · 2025-05-02T21:02:29Z

This is OK. However, setting is_hidden to always return false is almost "too clever" as it becomes semantically a bit confusing.

That's a fair point. In fact, originally my intention was to have the "do we skip hidden dirs or not" logic be located inside the _treewalk function (hence why I decided to refactor this part of the code to only make one call to is_hidden). But then I realised that is_hidden was already a function that we were defining ourselves, and so it seemed less intrusive to just add a third definition for that function.

Really, what I think we should do is get rid of is_hidden entirely, and I'm beginning to wonder if that's not an even better solution overall. If you want to exclude hidden directories, then you should exclude them using the existing mechanisms. I need to double-check the behaviour, but I would hope that excluding **/.*/** or something along those lines would do the trick. (Although the double ** is a bit weird, and for Windows it won't match the current behaviour.)

To me, at least, it seems like a much better default behaviour to say "we extract all Python files" rather than have weird heuristics for which ones to extract or not.

I would have preferred a more direct structure; something like

is_hidden stays as is

define is_excluded (or is_traversed) to deal with the logic involving exclude_paths, is_hidden and the configuration regarding traversing hidden files

use is_excluded on line 86

is_excluded would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.

I feel abstracting the exclusion logic into an is_excluded function would make the code less readable, and I feel like having this function return a reason would make the ergonomics a bit awkward (unless we decide to use the empty string as False, which would work but feels a bit icky).

Alternatively, how would you feel if I renamed is_hidden to is_skipped? Currently we only ever skip hidden directories, so this would have the same result. The reason logged (with debugging enabled) would need to change to (skipped) for consistency, but I don't think this would cause any issues.

I'm going to test if excluding hidden directories can be done using the existing mechanism for file filtering. If it does, then I'll just get rid of is_hidden entirely and rewrite the change note to highlight this new approach.

If you have a filter like `**/foo/**` set in the `paths-ignore` bit of your config file, then currently the following happens: - First, the CodeQL CLI observes that this string ends in `/**` and strips off the `**` leaving `**/foo/` - Then the Python extractor strips off leading and trailing `/` characters and proceeds to convert `**/foo` into a regex that is matched against files to (potentially) extract. The trouble with this is that it leaves us unable to distinguish between, say, a file `foo.py` and a file `foo/bar.py`. In other words, we have lost the ability to exclude only the _folder_ `foo` and not any files that happen to start with `foo`. To fix this, we instead make a note of whether the glob ends in a forward slash or not, and adjust the regex correspondingly.

If it is necessary to exclude hidden files, then adding ``` paths-ignore: ['**/.*/**'] ``` to the relevant config file is recommended instead.

The second test case now sets the `paths-ignore` setting in the config file in order to skip files in hidden directories.

Removes the previously added extractor option and updates the change note to explain how to use `paths-ignore` to exclude files in hidden directories.

The previous version was tested on a version of the code where we had temporarily removed the `glob.strip("/")` bit, and so the bug didn't trigger then. We now correctly remember if the glob ends in `/`, and add an extra part in that case. This way, if the path ends with multiple slashes, they effectively get consolidated into a single one, which results in the correct semantics.

yoff

One suggestion for the change log, since you have been so precise around this difference in the code. And an observation on readability, but nothing blocking.

python/ql/lib/change-notes/2025-04-30-extract-hidden-files-by-default.md

yoff · 2025-05-16T11:35:12Z

python/extractor/semmle/path_filters.py

+    # When the glob ends in `/`, we need to remember this so that we don't accidentally add an
+    # extra separator to the final regex.
+    end_sep = "" if glob.endswith("/") else SEP
    glob = glob.strip().strip("/")
    parts = glob.split("/")
    #Trailing '**' is redundant, so strip it off.
    if parts[-1] == "**":
        parts = parts[:-1]
        if not parts:
            return ".*"
+    # The `glob.strip("/")` call above will have removed all trailing slashes, but if there was at
+    # least one trailing slash, we want there to be an extra part, so we add it explicitly here in
+    # that case, using the emptyness of `end_sep` as a proxy.
+    if end_sep == "":
+        parts += [""]


As I understand it, this is equivalent to:

glob = glob.strip().lstrip("/").rstrip_only_repeated("/") end_sep = "" if glob.endswith("/") else SEP parts = glob.split("/") #Trailing '**' is redundant, so strip it off. if parts[-1] == "**": parts = parts[:-1] if not parts: return ".*"

which might be simpler, if rstrip_only_repeated was something that existed :-)

I will not insist on rewriting this, though.

Indeed. In hindsight, perhaps I should have done

end_sep = ... glob = glob.strip().strip("/") if end_sep == "": glob += "/"

Or perhaps an even better idea would be to first transform all runs of / into single occurrences, and then only do lstrip.

I think I'll leave it as-is for now. At some point we can hopefully just get rid of this code altogether.

Co-authored-by: yoff <yoff@github.com>

yoff

LGTM

github-actions bot added Python documentation labels Apr 30, 2025

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from eef5a49 to 4d00556 Compare April 30, 2025 12:40

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from 4d00556 to 5cf3242 Compare May 2, 2025 12:45

tausbn changed the title ~~Python: Extract hidden files/dirs by default~~ Python: Extract files in hidden dirs by default May 2, 2025

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from c3ac796 to be88e61 Compare May 2, 2025 14:01

tausbn added 3 commits May 2, 2025 14:27

Python: Add integration test

605f2bf

Python: Add change note

67d04d5

Python: Update extractor tests

2ded42c

tausbn force-pushed the tausbn/python-extract-hidden-file-by-default branch from be88e61 to 2ded42c Compare May 2, 2025 14:27

tausbn marked this pull request as ready for review May 2, 2025 14:28

Copilot AI review requested due to automatic review settings May 2, 2025 14:28

tausbn requested a review from a team as a code owner May 2, 2025 14:28

Copilot AI reviewed May 2, 2025

View reviewed changes

yoff previously approved these changes May 2, 2025

View reviewed changes

tausbn added 5 commits May 15, 2025 14:48

Python: Remove special casing of hidden files

98388be

If it is necessary to exclude hidden files, then adding ``` paths-ignore: ['**/.*/**'] ``` to the relevant config file is recommended instead.

Python: Update test

96558b5

The second test case now sets the `paths-ignore` setting in the config file in order to skip files in hidden directories.

Python: Update change note and extractor config

72ae633

Removes the previously added extractor option and updates the change note to explain how to use `paths-ignore` to exclude files in hidden directories.

Python: Bump extractor version

c8cca12

tausbn dismissed yoff’s stale review via c8cca12 May 15, 2025 14:59

tausbn requested a review from yoff May 16, 2025 10:22

yoff previously approved these changes May 16, 2025

View reviewed changes

Python: Update change note

9ee3e4c

Co-authored-by: yoff <yoff@github.com>

tausbn dismissed yoff’s stale review via 9ee3e4c May 16, 2025 11:50

yoff requested review from yoff May 16, 2025 12:41

yoff approved these changes May 16, 2025

View reviewed changes

tausbn merged commit 579cf4a into main May 16, 2025
15 checks passed

tausbn deleted the tausbn/python-extract-hidden-file-by-default branch May 16, 2025 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Extract files in hidden dirs by default #19424

Python: Extract files in hidden dirs by default #19424

Uh oh!

tausbn commented Apr 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

yoff left a comment

Uh oh!

tausbn commented May 2, 2025

Uh oh!

yoff left a comment

Uh oh!

Uh oh!

yoff May 16, 2025

Uh oh!

tausbn May 16, 2025

Uh oh!

yoff left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Python: Extract files in hidden dirs by default #19424

Python: Extract files in hidden dirs by default #19424

Uh oh!

Conversation

tausbn commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

tausbn commented May 2, 2025

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yoff May 16, 2025

Choose a reason for hiding this comment

Uh oh!

tausbn May 16, 2025

Choose a reason for hiding this comment

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tausbn commented Apr 30, 2025 •

edited

Loading