Refactor the Generic Cataloger re-uses filesystem indexing #1463
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Perfomance enhancement
There have been issues raised which indicate that the catalogging process is slow (e.g. #1355 , #1446 , #1353 )
A large part of this is due to glob pattern searches.
More precisly, each catagolger has a collection of defined glob patterns, many of which are against the entire file system
**/*
to search..Now that syft supports multiple catalogers with multiple glob patterns, this can cause significant os time bound restrictions on the perfomance, the entire file system is searched again and again, once per glob..
This glob pattern search does not reuse the initial filesystem indexing that syft does. This PR refactors the cataloger such that the indexing can be reused.
Note the generic cataloger support globs, paths, and mime-types. This PR reuses the index for all three of the types. The globs are addressed using a naive mapping of glob->regex.
Benchmark against this project vs the current main branch:
Setting the parallism=4 brings this PR's Results down even further:
We can see that this PR is almost as effective (slightly more) as increasing the parallelism, which is expected given that everything now is in-memory rather that os system bound.