Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the Generic Cataloger re-uses filesystem indexing #1463

Closed
wants to merge 5 commits into from

Conversation

Mikcl
Copy link
Contributor

@Mikcl Mikcl commented Jan 16, 2023

Perfomance enhancement

There have been issues raised which indicate that the catalogging process is slow (e.g. #1355 , #1446 , #1353 )

A large part of this is due to glob pattern searches.

More precisly, each catagolger has a collection of defined glob patterns, many of which are against the entire file system **/* to search..

Now that syft supports multiple catalogers with multiple glob patterns, this can cause significant os time bound restrictions on the perfomance, the entire file system is searched again and again, once per glob..

This glob pattern search does not reuse the initial filesystem indexing that syft does. This PR refactors the cataloger such that the indexing can be reused.

Note the generic cataloger support globs, paths, and mime-types. This PR reuses the index for all three of the types. The globs are addressed using a naive mapping of glob->regex.

Benchmark against this project vs the current main branch:

8.86s user 2.34s system 90% cpu 12.421 total
17.16s user 2.14s system 117% cpu 16.455 total

Setting the parallism=4 brings this PR's Results down even further:

9.20s user 2.52s system 131% cpu 8.888 total
15.97s user 1.59s system 178% cpu 9.830 total

We can see that this PR is almost as effective (slightly more) as increasing the parallelism, which is expected given that everything now is in-memory rather that os system bound.

@Mikcl Mikcl force-pushed the mikcl/files-concurrent branch from b79dd83 to 54f4eba Compare January 16, 2023 09:37
Mikcl added 5 commits January 16, 2023 13:27
Signed-off-by: mikcl <mikesmikes400@gmail.com>
Signed-off-by: mikcl <mikesmikes400@gmail.com>
Signed-off-by: mikcl <mikesmikes400@gmail.com>
Signed-off-by: mikcl <mikesmikes400@gmail.com>
Signed-off-by: mikcl <mikesmikes400@gmail.com>
@Mikcl Mikcl force-pushed the mikcl/files-concurrent branch from bf91765 to 52b3086 Compare January 16, 2023 13:28
@Mikcl Mikcl changed the title Refactor the Generic Cataloger re-uses filesystem indexing (WIP) Refactor the Generic Cataloger re-uses filesystem indexing Jan 17, 2023
@Mikcl
Copy link
Contributor Author

Mikcl commented Jan 24, 2023

hi @wagoodman

Noticed that you are addressing file system reuse in #1510 as well.

I think youre approach is better (i.e not mapping globs->regex's)

In which case, feel free to insepct this PR for any common refactorings, otherwise, happy for your PR to eclipse this. And for this to be closed

@Mikcl Mikcl marked this pull request as draft January 24, 2023 21:40
@wagoodman
Copy link
Contributor

@Mikcl thanks for your work on this -- I was tempted to head in a similar direction (rewriting glob strings to leverage indexes) but was a little timid on what that might mean troubleshooting-wise in the future. I'm going to close this PR since the aforementioned #1510 has performance benefits over this PR and doesn't have the glob-rewriting concerns. However, we welcome future PRs that nudge us forward!

@wagoodman wagoodman closed this Jan 25, 2023
@Mikcl
Copy link
Contributor Author

Mikcl commented Jan 25, 2023

thanks @wagoodman

glad that this is being addressed directly! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants