Skip to content

Conversation

@glankk
Copy link
Collaborator

@glankk glankk commented Jul 6, 2025

No description provided.

@glankk glankk force-pushed the 12821 branch 3 times, most recently from f29f42d to 572dedf Compare July 6, 2025 07:00
@firewave
Copy link
Collaborator

firewave commented Jul 6, 2025

See #7540. I had some issues with the behavior which I have not outlined in that PR yet because the code contains a bug which made the tests all pass although that should not have.

Also std::regex has abysmal performance and no code should ever use it. Thus we should not rely on it's interface. See #6211 a wrapper for our existing regex. That is complete and working but has a memory leak. I think the leak should be ignored for now since it is intermediate code anyways. I had a local branch which replaces the implement with std::regex as another intermediate step to see if our wrapper wi8ll work with other implementation.

proj_dir = tmp_path / 'proj2'
shutil.copytree(__proj_dir, proj_dir)
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b'])
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b/'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike that we need to rely on some options ending with / so they are treated as a path. That also complicated things while working on my PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think the path matching is currently trying to do too much which makes the syntax weird and difficult to implement. Ideally I would like it to make it simpler but I've been hesitant to do so because I don't know what might break because of it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like it to make it simpler but I've been hesitant to do so because I don't know what might break because of it.

I dislike it also.. but it is good imho that you try to limit the effects for now.. it's already a big PR anyway.

@firewave
Copy link
Collaborator

firewave commented Jul 6, 2025

See also https://trac.cppcheck.net/ticket/12268. And if you work on a ticket please assign yourself to it in Trac.

@glankk
Copy link
Collaborator Author

glankk commented Jul 7, 2025

See #7540. I had some issues with the behavior which I have not outlined in that PR yet because the code contains a bug which made the tests all pass although that should not have.

Also std::regex has abysmal performance and no code should ever use it. Thus we should not rely on it's interface. See #6211 a wrapper for our existing regex. That is complete and working but has a memory leak. I think the leak should be ignored for now since it is intermediate code anyways. I had a local branch which replaces the implement with std::regex as another intermediate step to see if our wrapper wi8ll work with other implementation.

I only noticed you were also working on PathMatch after I made this, sorry about that.

I agree the standard regex is slow and I'd rather use pcre, but I didn't want to make it a hard dependency just for this. For the common use case of PathMatch, which is matching maybe a few hundred files or so against a handful of patterns, the performance penalty of using std::regex instead is on the order of milliseconds per run. Having a backend-agnostic regex wrapper sounds like a good idea though.

@firewave
Copy link
Collaborator

firewave commented Jul 7, 2025

I only noticed you were also working on PathMatch after I made this, sorry about that.

No problem. I started working on it thinking it would a low-hanging fruit which wasn't the case. And I forgot to tag it with the ticket. And the one you worked on was no properly triaged.

I agree the standard regex is slow and I'd rather use pcre, but I didn't want to make it a hard dependency just for this.

Right, PCRE is currently not a hard dependency. So PCRE2 might not be a good replacement.

For the common use case of PathMatch, which is matching maybe a few hundred files or so against a handful of patterns, the performance penalty of using std::regex instead is on the order of milliseconds per run.

That might depend on the scenario. If there might be thousands of files it could be slow. It has been ages since I tried using std::regex but it was it beyond unacceptable I doubt it was improved upon (it seems like it was completely abandoned and nobody has the balls to suggest to remove it from the standard). PCRE feels like a nop.

Having a backend-agnostic regex wrapper sounds like a good idea though.

I will try to get the regex refactoring in. The first step is awkward but with some help we might get it into better shape fast.

@firewave
Copy link
Collaborator

firewave commented Jul 7, 2025

With a glob implementation inline in PathMatch we now have two different implementations. The other is matchglob() in utils.cpp. We should only have a single one.

That is probably where my performance concerns might have come from since the latter is also used outside of path handling stuff.

@glankk
Copy link
Collaborator Author

glankk commented Jul 8, 2025

With a glob implementation inline in PathMatch we now have two different implementations. The other is matchglob() in utils.cpp. We should only have a single one.

That is probably where my performance concerns might have come from since the latter is also used outside of path handling stuff.

I mostly agree with this, for the three places where matchglob is currently used for something other than paths it makes sense to keep it that way, all other uses of matchglob should be changed to use PathMatch instead.

@firewave
Copy link
Collaborator

firewave commented Jul 8, 2025

I mostly agree with this, for the three places where matchglob is currently used for something other than paths it makes sense to keep it that way, all other uses of matchglob should be changed to use PathMatch instead.

I agree since it makes things more explicit. But I would still prefer if we only have a single implementation of the glob logic.

@glankk glankk marked this pull request as ready for review July 11, 2025 16:01
@glankk
Copy link
Collaborator Author

glankk commented Jul 11, 2025

I've updated this to use PathMatch everywhere a pattern is matched against a file, so that file matching is consistent. While doing so I found that there are places where the performance of PathMatch is actually quite important (especially in suppressions), so I've opted to use a hand-written matcher instead of regex.

Another noteworthy change is that directories can now be matched without a trailing '/'. This has the awkward side-effect that 'foo.cpp' matches 'foo.cpp/bar.cpp' when 'foo.cpp' happens to be a directory, and 'foo/' matches 'foo' even if foo is not a directory (enforcing directories for such patterns would require a file system lookup), but I still think it's preferable.

Regarding matchglob, I think it makes sense to have a more general wildcard implementation in matchglob, and a separate implementation in PathMatch because I'm not convinced that the rules for path globs should be the same as for matching general wildcards in error id's and such. Path globs need to match on path component boundaries, and '*' and '?' do not match path separators. For matching path separators as well, '**' is used instead.

These changes fix #12821, #12268, and #13997. Possibly related: #13983.

proj_dir = tmp_path / 'proj2'
shutil.copytree(__proj_dir, proj_dir)
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b'])
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b/'])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like it to make it simpler but I've been hesitant to do so because I don't know what might break because of it.

I dislike it also.. but it is good imho that you try to limit the effects for now.. it's already a big PR anyway.

ASSERT(!fillSettingsFromArgs(argv));
ASSERT_EQUALS(1, parser->getIgnoredPaths().size());
ASSERT_EQUALS("src/file.cpp", parser->getIgnoredPaths()[0]);
ASSERT_EQUALS("src\\file.cpp", parser->getIgnoredPaths()[0]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho it is preferable to always use / as path separator internally. we don't have to ensure that backslash is handled properly everywhere then. would it cause problems to keep this behavior?

@danmar
Copy link
Owner

danmar commented Jul 15, 2025

The ticket and PR is called "GUI: exclude folders in imported project"

I think this PR does so much more. Either I think we should create a new ticket that better describes the work in this PR. Or we should rename the ticket+PR..

@glankk
Copy link
Collaborator Author

glankk commented Jul 15, 2025

Here's an overview of the current state of this PR. Since the scope has grown beyond the original ticket I think it'd be appropriate to open a new ticket.

Summary of changes:

  • PathMatch changes, notably:
    • Now supports globs. The syntax is specific to path-matching and distinct from matchglob which has been generalized and is only used for e.g. suppression wildcards.
    • Directories don't require trailing slash
    • Better support for relative paths with an explicit syntax
  • The following now use PathMatch instead of matchglob:
    • File filters (with --file-filter=)
    • Ignored paths for projects (with -i <dir> and <exclude>)
    • Suppression filenames
  • The following already used PathMatch and will be affected:
    • Ignored paths for CLI (with -i <dir>)
    • GUI exclude list

New PathMatch behavior:

All match patterns and paths are canonicalized during matching.
/foo/bar/./.. => /foo
/foo/../bar/./.// => /bar
Trailing slashes are removed and thus have no special meaning.

Absolute patterns are matched from the start until a path component boundary in the canonicalized path.
Example:

	/foo/
matches
	/foo
	^~~~ OK
	/foo/
	^~~~ OK
	/foo/bar
	^~~~ OK
but not
	/foobar
	^~~~ No match, only partial

Relative patterns must be explicit by having their first path component be . or ... They are translated to absolute patterns and then treated as above. This is changed from the previous behavior, where "relative" names could match starting at any path component (the previous syntax is still supported though, described further down).
Example:

	../test.c
In a project file located at
	/home/foo/test.cppcheck
is translated to
	/home/test.c
and matches
	/home/test.c
	^~~~~~~~~~~~ OK
	/home/test.c/test.h
	^~~~~~~~~~~~ OK
but not
	/home/test.cpp
	^~~~~~~~~~~~ No match, only partial
	/home/project/test.cpp
	^~~~~~ No match, only partial

Other patterns, not explicitly absolute or relative, are matched between any two path component boundaries in the canonicalized path. This is working as before, except matching the middle of a path previously required a trailing slash. Also, the matching is now always done on the canonical absolute path of the specified path name, where previously a match could depend on whether a filename was given as relative or absolute (i.e. /foo/ would match /foo/bar.c, but not ./bar.c even if the current directory is /foo).
Example:

	foo.h
matches
	/foo.h
	 ^~~~~ OK
	/home/foo.h
	      ^~~~~ OK
	/home/foo.h/bar.c
	      ^~~~~ OK	
but not
	/foo.hpp
	 ^~~~~ No match, only partial
	/home/foo.hpp
	      ^~~~~ No match, only partial
	/home/foo.hpp/bar.cpp
	      ^~~~~ No match, only partial

Patterns can use globs:
** Matches zero or more of any character, including path separators.
* Matches zero or more of any character, not including path separators.
? Matches exactly one character, not including path separators.
Examples:

	foo.*
matches
	/foo./bar
	 ^~~~ OK
	/home/foo.c
	      ^~~~~ OK
	/home/foo.cppcheck/bar
	      ^~~~~~~~~~~~ OK

	/home/cppcheck/**.c
matches
	/home/cppcheck/src/foo.c
	^~~~~~~~~~~~~~~~~~~~~~~~ OK
	/home/cppcheck/.c
	^~~~~~~~~~~~~~~~~ OK (empty glob)

	/home/cppcheck/**/*.c
matches
	/home/cppcheck/src/bar.c
	^~~~~~~~~~~~~~~~~~~~~~~~ OK
	/home/cppcheck/src/foo/bar.c
	^~~~~~~~~~~~~~~~~~~~~~~~~~~~ OK
but not
	/home/cppcheck/.c
	^~~~~~~~~~~~~~~ No match, only partial

@glankk glankk changed the title Fix #12821 (GUI: exclude folders in imported project) Fix #14021 (Better path matching) Jul 15, 2025
@danmar
Copy link
Owner

danmar commented Jul 15, 2025

I changed my mind; take the manual in a separate PR. The manual needs to describe the path matching much better.

@sonarqubecloud
Copy link

@danmar danmar merged commit 4d5a7f9 into danmar:main Jul 16, 2025
63 checks passed
@danmar
Copy link
Owner

danmar commented Jul 16, 2025

It's aggressive to merge this now so soon before the release but the old behavior is biting customers; I want to have this in 2.18.

@glankk glankk deleted the 12821 branch July 16, 2025 19:46
danmar added a commit that referenced this pull request Jul 19, 2025
#7645 added support for `**`
matching path separators, but didn't update `isValidGlobPattern` for
`--suppress` to enable it.

This also updates test cases and makes `isValidGlobPattern` more correct
and robust overall. In particular multiple consecutive `?` for a glob is
absolutely valid and frequently useful to ensure exactly or at least
some number of characters.

---------

Co-authored-by: Joel Johnson <mrjoel@stellarscience.com>
Co-authored-by: Daniel Marjamäki <daniel.marjamaki@gmail.com>
danmar pushed a commit that referenced this pull request Jul 31, 2025
This is an addendum to #7645 that adds the ability to specify
directory-only matching in PathMatch patterns with a trailing slash. I
originally didn't add this because I wanted to keep PathMatch purely
syntactic, and this feature seemed like it would require filesystem
calls in the PathMatch code.

This PR solves that problem by lifting the responsibility of file mode
checking to the caller, thus keeping PathMatch purely syntactic while
still supporting directory-only matching.

Previously `/test/foo/` would match `/test/foo` even if `foo` is a
regular file. With this change the caller can specify the file mode of
the file named by the provided path, and `/test/foo` will only match
when the file mode is directory.

The semantics of patterns that do not have a trailing slash is
unchanged, e.g. `/test/foo` still matches `/test/foo/` and
`/test/foo/bar.cpp`, regardless of the file mode.

Also adds some more tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants