-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[store] Unique reports _before_ storing #4152
Conversation
We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file). In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically: BEFORE: Compressing report zip file done (1.6GiB / 62.2MiB). AFTER: Compressing report zip file done (372.6MiB / 7.9MiB). While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.
4c3bb47
to
3ba2130
Compare
tools/report-converter/codechecker_report_converter/report/hash.py
Outdated
Show resolved
Hide resolved
analyzer_result_files)): | ||
analyzer_result_files))): | ||
if idx % 10 == 0: | ||
LOG.debug(f"Parsed {idx}/{len(analyzer_result_files)} files...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should leave debug logs in the code only if they are valuable when asking them from the users for some debugging session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the output to print the name of the file as it is parsed. I found this helpful during development, and I don't think its too verbose.
[DEBUG] store.py:458 assemble_zip() - Processing 3 report files ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [0/3] Parsed '<path>/prune_paths.cpp_clangsa_e17b9269d5755586c9edd407de32f1df.plist' ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [1/3] Parsed '<path>/prune_paths.cpp_clang-tidy_e17b9269d5755586c9edd407de32f1df.plist' ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [2/3] Parsed '<path>/prune_paths.cpp_cppcheck_e17b9269d5755586c9edd407de32f1df.plist' ...
We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file).
In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically:
BEFORE:
AFTER:
While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.
I also did a little cleanup on the debug logs -- the one in report converter spit out a line for literally every report, that feels a little unnecessary. I also added some logs to show that the parsing of result files is in progress, though that could be a little more verbose.
edit:
In addition, I needed to fix plist parsing at one point, and also, I had a looooot of trouble with
AnalysisInfo
not being set for each report. The problem was in this nasty ass loop:Now, since we support multiple report dir storages at once, in the zipped file, each of those report dirs get their own report dir. However, do you also not see the clear implementation of that in the form of "for each input dir, create an output dir"? That nasty
os.path.dirname
part does all that, and my initial implementation didn't set the dirname right. Since this manifested in errors seemingly unrelated parts of the code, I was lead on a month-long chase to find out that this was the issue.Also, since we handle all input report dirs all at once in a rather cryptic manner, I also chose to just brute-force my implementation into the code, instead of refactoring it into an easy-to-digest "for each dir in input dir, zip dir". That might be worth a followup refactoring patch.