-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Python: Extract files in hidden dirs by default #19424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eef5a49
to
4d00556
Compare
Changes the default behaviour of the Python extractor so files inside hidden directories are extracted by default. Also adds an extractor option, `skip_hidden_directories`, which can be set to `true` in order to revert to the old behaviour. Finally, I made the logic surrounding what is logged in various cases a bit more obvious. Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as `(excluded)`, but I think this makes more sense anyway.
4d00556
to
5cf3242
Compare
c3ac796
to
be88e61
Compare
be88e61
to
2ded42c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR changes the Python extractor to include files in hidden directories by default, adds an option to revert to the old behavior, and makes the logging logic more explicit.
- Default extraction now includes files in hidden directories.
- New
skip_hidden_directories
extractor option to restore previous behavior. - Refactored logging branches and updated tests, change notes, and schema.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
python/ql/test/extractor-tests/filter-option/Test.expected | Added expected line for hidden_foo.py script. |
python/ql/test/2/extractor-tests/hidden/test.expected | Updated output to include files under hidden directories. |
python/ql/lib/change-notes/2025-04-30-extract-hidden-files-by-default.md | Documented new default behavior and skip_hidden_directories option. |
python/extractor/semmle/traverser.py | Reordered exclusion checks, added env‑var‑driven is_hidden override. |
python/extractor/cli-integration-test/hidden-files/test.sh | Added CLI integration tests for default and skipped hidden behavior. |
python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py | Base script for the CLI test. |
python/extractor/cli-integration-test/hidden-files/query.ql | Query to list extracted file names. |
python/extractor/cli-integration-test/hidden-files/query-default.expected | Expected output when hidden files are included. |
python/extractor/cli-integration-test/hidden-files/query-skipped.expected | Expected output when skip_hidden_directories=true . |
python/codeql-extractor.yml | Added skip_hidden_directories option to extractor schema. |
Comments suppressed due to low confidence (3)
python/codeql-extractor.yml:47
- [nitpick] Consider defining
skip_hidden_directories
as a boolean type instead of a string with a pattern, to align with typical boolean option conventions.
skip_hidden_directories:
python/extractor/semmle/traverser.py:103
- There is no test for the Windows-specific
is_hidden
branch whenskip_hidden_directories=true
onos.name == 'nt'
. Consider adding a Windows-mode test to cover that path.
if os.environ.get("CODEQL_EXTRACTOR_PYTHON_OPTION_SKIP_HIDDEN_DIRECTORIES", "false") == "false":
python/extractor/cli-integration-test/hidden-files/repo_dir/foo.py:1
- The CLI integration test expectations reference
.hidden_file.py
andvisible_file_in_hidden_dir.py
, but those files are not present inrepo_dir
. Add these files to matchquery-default.expected
andquery-skipped.expected
.
print(42)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK. However, setting is_hidden
to always return false is almost "too clever" as it becomes semantically a bit confusing. I would have preferred a more direct structure; something like
is_hidden
stays as is- define
is_excluded
(oris_traversed
) to deal with the logic involvingexclude_paths
,is_hidden
and the configuration regarding traversing hidden files - use
is_excluded
on line 86
is_excluded
would have to return a reason for logging so could not be a clean boolean, but it still feels like a "simpler" solution.
That's a fair point. In fact, originally my intention was to have the "do we skip hidden dirs or not" logic be located inside the Really, what I think we should do is get rid of To me, at least, it seems like a much better default behaviour to say "we extract all Python files" rather than have weird heuristics for which ones to extract or not.
I feel abstracting the exclusion logic into an Alternatively, how would you feel if I renamed I'm going to test if excluding hidden directories can be done using the existing mechanism for file filtering. If it does, then I'll just get rid of |
If you have a filter like `**/foo/**` set in the `paths-ignore` bit of your config file, then currently the following happens: - First, the CodeQL CLI observes that this string ends in `/**` and strips off the `**` leaving `**/foo/` - Then the Python extractor strips off leading and trailing `/` characters and proceeds to convert `**/foo` into a regex that is matched against files to (potentially) extract. The trouble with this is that it leaves us unable to distinguish between, say, a file `foo.py` and a file `foo/bar.py`. In other words, we have lost the ability to exclude only the _folder_ `foo` and not any files that happen to start with `foo`. To fix this, we instead make a note of whether the glob ends in a forward slash or not, and adjust the regex correspondingly.
If it is necessary to exclude hidden files, then adding ``` paths-ignore: ['**/.*/**'] ``` to the relevant config file is recommended instead.
The second test case now sets the `paths-ignore` setting in the config file in order to skip files in hidden directories.
Removes the previously added extractor option and updates the change note to explain how to use `paths-ignore` to exclude files in hidden directories.
The previous version was tested on a version of the code where we had temporarily removed the `glob.strip("/")` bit, and so the bug didn't trigger then. We now correctly remember if the glob ends in `/`, and add an extra part in that case. This way, if the path ends with multiple slashes, they effectively get consolidated into a single one, which results in the correct semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion for the change log, since you have been so precise around this difference in the code. And an observation on readability, but nothing blocking.
python/ql/lib/change-notes/2025-04-30-extract-hidden-files-by-default.md
Outdated
Show resolved
Hide resolved
# When the glob ends in `/`, we need to remember this so that we don't accidentally add an | ||
# extra separator to the final regex. | ||
end_sep = "" if glob.endswith("/") else SEP | ||
glob = glob.strip().strip("/") | ||
parts = glob.split("/") | ||
#Trailing '**' is redundant, so strip it off. | ||
if parts[-1] == "**": | ||
parts = parts[:-1] | ||
if not parts: | ||
return ".*" | ||
# The `glob.strip("/")` call above will have removed all trailing slashes, but if there was at | ||
# least one trailing slash, we want there to be an extra part, so we add it explicitly here in | ||
# that case, using the emptyness of `end_sep` as a proxy. | ||
if end_sep == "": | ||
parts += [""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, this is equivalent to:
glob = glob.strip().lstrip("/").rstrip_only_repeated("/")
end_sep = "" if glob.endswith("/") else SEP
parts = glob.split("/")
#Trailing '**' is redundant, so strip it off.
if parts[-1] == "**":
parts = parts[:-1]
if not parts:
return ".*"
which might be simpler, if rstrip_only_repeated
was something that existed :-)
I will not insist on rewriting this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. In hindsight, perhaps I should have done
end_sep = ...
glob = glob.strip().strip("/")
if end_sep == "":
glob += "/"
Or perhaps an even better idea would be to first transform all runs of /
into single occurrences, and then only do lstrip
.
I think I'll leave it as-is for now. At some point we can hopefully just get rid of this code altogether.
Co-authored-by: yoff <yoff@github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes the default behaviour of the Python extractor so that files inside hidden directories are extracted by default.
Also adds an extractor option,
skip_hidden_directories
, which can be set totrue
in order to revert to the old behaviour.Finally, I made the logic surrounding what is logged in various cases a bit more obvious.
Technically this changes the behaviour of the extractor (in that hidden excluded files will now be logged as
(excluded)
, but I think this makes more sense anyway.