-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[plugins] speedup big journal collection #3873
Conversation
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
7590868
to
6c741e6
Compare
Instead of having journalctl format all the logs, estimate the number of lines of logs needed to fill the log_size limit. On a EL 9.5 server with SSD with 500MB of formated logs, the logs runtime goes from 32s to 11s. Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
6c741e6
to
938d6d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line-count limiting is never going to be accurate, and will almost certainly result in not just wildly different collections with regards to any limiting put in place (either default or by end user usage of options), but also inconsistency between collections within the same run that use the inherited size limit.
There is no control over what a given line's sizing will be, and averaging over a large number of lines doesn't mitigate that - we could have 1000 extremely small lines and then in the next 100, 200, or whatever, have extremely dense lines.
On top of that, this now causes us to execute 2 journalctl
commands for every collection, which is inefficient for execution and memory consumption throughout a given sos
run.
Runtimes are something we look to improve when it's controlled entirely within our own code, but if an external tool simply takes a certain amount of time to execute when we call it, we shouldn't be jumping through hoops in the name of runtime "optimization" if it means non-deterministic results. At the end of the day, sos
is a tool that doesn't have high performance being a more important requirement than stability and predictability.
That is why I use *1.5, but then still use the size_limit, so it still returns 100MB of logs, there is no inconsistency most of the time, instead of tailing 500MB and keeping 100MB, we tail ~150MB and still keep 100MB with my dev server with SSD I go from 30 to 10s, but I've shown worse cases, going from 10min to 2min (#3615 (comment))
That's correct and why I use *1.5, we can end up with less than 100MB, but 100MB is already arbitrary.
it's inefficient, but 3x more efficient than the current solution (or even 5x in more extreme cases)
For context I have customers in the field who refuse to grab an sos report before reboot because it's too slow, having their service up is their concern, having data to troubleshoot is my problem :( I have higher hopes improving sos execution speed than my customer behavior, so I'll keep thinking about solutions ;) |
This comment does not want to assess the particular approach in this PR, it deals with rather generic / philosophical / high level point of view. There are contradicting requirements here, how to limit logs to be collected:
As a support engineer analysing a sosreport, my requirements are "limit by time and/or log records". As a support engineer or customer generating sosreport, my requirement is "limit amount of data" - and also execution time. The requirements are contradicting, the current code follows these priorities imho in this ordering:
So we much more focus on the generating part than on the analysing part (while this PR aims to balance contradicting "time vs space" requirements of generating part, plus adding more weight to the analysing part - which sounds good in general). Any strategy balancing the requirements will be just a heuristic, with its corner cases where it behaves badly. Jake's arguments are valid - but my devil's advocate can easily find a counter-example where current approach slowly collects insufficient data. Also @champtar data come from an experiment just on one system. How generic the system is? How much differently this approach would behave on different systems? I dont know.. Currently, we do size-correct execution of something like |
Yet alternate idea (kudos to @jcastill) is to have |
The idea is that we already have the code in place to collect a number of lines in journalctl via add_journal() , so we could just add a new command line option and pass it through in any plugin we want to implement this limitation. |
Even if arbitrary, I think a default size limit is the best
I've provided numbers on 3 other systems long ago #3615 (comment) (click on details)
There are 2 issues with this approach I suggested some time ago (#3615 (comment)):
I think I can fix both, will provide an PR when I have something so we can discuss further |
New version: #3879 |
Instead of having journalctl format all the logs,
estimate the number of lines of logs needed to fill the log_size limit.
On a EL 9.5 server with SSD with 500MB of formated logs, the logs runtime goes from 32s to 11s.
Fixes #3615
Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines