Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[plugins] speedup big journal collection #3873

Closed
wants to merge 1 commit into from

Conversation

champtar
Copy link
Contributor

@champtar champtar commented Dec 11, 2024

Instead of having journalctl format all the logs,
estimate the number of lines of logs needed to fill the log_size limit.
On a EL 9.5 server with SSD with 500MB of formated logs, the logs runtime goes from 32s to 11s.

Fixes #3615

# journalctl --version
systemd 252 (252-46.el9_5.2.alma.1)
# journalctl --no-pager | wc -c
533252817
# journalctl --disk-usage 
Archived and active journals take up 1.6G in the file system.
# sos report -o logs --batch -vvv
### Before patch
[plugin:logs] collected plugin 'logs' in 32.05628514289856
### After patch
[plugin:logs] collected plugin 'logs' in 10.8226797580719

Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

  • Is the commit message split over multiple lines and hard-wrapped at 72 characters?
  • Is the subject and message clear and concise?
  • Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
  • Does the commit contain a Signed-off-by: First Lastname email@example.com?
  • Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
  • Are all passwords or private data gathered by this PR obfuscated?

Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo yum install -y dnf-plugins-core on RHEL 8
  • sudo dnf install -y dnf-plugins-core on Fedora
  • dnf copr enable packit/sosreport-sos-3873
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

@champtar champtar marked this pull request as draft December 11, 2024 17:23
Instead of having journalctl format all the logs,
estimate the number of lines of logs needed to fill
the log_size limit.
On a EL 9.5 server with SSD with 500MB of formated logs,
the logs runtime goes from 32s to 11s.

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
@champtar champtar marked this pull request as ready for review December 11, 2024 18:09
Copy link
Member

@TurboTurtle TurboTurtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line-count limiting is never going to be accurate, and will almost certainly result in not just wildly different collections with regards to any limiting put in place (either default or by end user usage of options), but also inconsistency between collections within the same run that use the inherited size limit.

There is no control over what a given line's sizing will be, and averaging over a large number of lines doesn't mitigate that - we could have 1000 extremely small lines and then in the next 100, 200, or whatever, have extremely dense lines.

On top of that, this now causes us to execute 2 journalctl commands for every collection, which is inefficient for execution and memory consumption throughout a given sos run.

Runtimes are something we look to improve when it's controlled entirely within our own code, but if an external tool simply takes a certain amount of time to execute when we call it, we shouldn't be jumping through hoops in the name of runtime "optimization" if it means non-deterministic results. At the end of the day, sos is a tool that doesn't have high performance being a more important requirement than stability and predictability.

@champtar
Copy link
Contributor Author

Line-count limiting is never going to be accurate, and will almost certainly result in not just wildly different collections with regards to any limiting put in place (either default or by end user usage of options), but also inconsistency between collections within the same run that use the inherited size limit.

That is why I use *1.5, but then still use the size_limit, so it still returns 100MB of logs, there is no inconsistency most of the time, instead of tailing 500MB and keeping 100MB, we tail ~150MB and still keep 100MB

with my dev server with SSD I go from 30 to 10s, but I've shown worse cases, going from 10min to 2min (#3615 (comment))

There is no control over what a given line's sizing will be, and averaging over a large number of lines doesn't mitigate that - we could have 1000 extremely small lines and then in the next 100, 200, or whatever, have extremely dense lines.

That's correct and why I use *1.5, we can end up with less than 100MB, but 100MB is already arbitrary.
In my case speed is more important than reaching the arbitrary 100MB limit.

On top of that, this now causes us to execute 2 journalctl commands for every collection, which is inefficient for execution and memory consumption throughout a given sos run.

it's inefficient, but 3x more efficient than the current solution (or even 5x in more extreme cases)

Runtimes are something we look to improve when it's controlled entirely within our own code, but if an external tool simply takes a certain amount of time to execute when we call it, we shouldn't be jumping through hoops in the name of runtime "optimization" if it means non-deterministic results. At the end of the day, sos is a tool that doesn't have high performance being a more important requirement than stability and predictability.

For context I have customers in the field who refuse to grab an sos report before reboot because it's too slow, having their service up is their concern, having data to troubleshoot is my problem :(

I have higher hopes improving sos execution speed than my customer behavior, so I'll keep thinking about solutions ;)

@pmoravec
Copy link
Contributor

This comment does not want to assess the particular approach in this PR, it deals with rather generic / philosophical / high level point of view.

There are contradicting requirements here, how to limit logs to be collected:

  • by time ("cover last day")
  • by lines / log records ("collect 100k lines of logs to have some minimal statistically significant sample of system behaviour")
  • by amount of data ("collect at most 100MB as I want small tarball")
  • get some data in reasonably small time

As a support engineer analysing a sosreport, my requirements are "limit by time and/or log records".

As a support engineer or customer generating sosreport, my requirement is "limit amount of data" - and also execution time.

The requirements are contradicting, the current code follows these priorities imho in this ordering:

  • limit by amount of data by default
  • optionally limit by time (--since option)
  • try to achieve some reasonable execution time

So we much more focus on the generating part than on the analysing part (while this PR aims to balance contradicting "time vs space" requirements of generating part, plus adding more weight to the analysing part - which sounds good in general).

Any strategy balancing the requirements will be just a heuristic, with its corner cases where it behaves badly. Jake's arguments are valid - but my devil's advocate can easily find a counter-example where current approach slowly collects insufficient data.

Also @champtar data come from an experiment just on one system. How generic the system is? How much differently this approach would behave on different systems? I dont know..

Currently, we do size-correct execution of something like journalctl --since=boot | tail -c 100MB. Maybe an alternative would be journalctl --since=boot -r | head -c 100MB | tac? It would need the extra tac that can be difficult to orchestrate though. And it is yet another approach I am unable to assess its caveats or generic performance /o.

@pmoravec
Copy link
Contributor

Yet alternate idea (kudos to @jcastill) is to have --journal-lines option that would cause --journal-size option to be ignored. A user - knowing the system specifics well - can use this option to speedup the journal logs collection, with the risk of collecting random-ish amount of MBs of journal logs.

@jcastill
Copy link
Member

The idea is that we already have the code in place to collect a number of lines in journalctl via add_journal() , so we could just add a new command line option and pass it through in any plugin we want to implement this limitation.

@champtar
Copy link
Contributor Author

champtar commented Dec 12, 2024

Even if arbitrary, I think a default size limit is the best
For operator, it can give a predictable execution time and archive size
For dev it gives as much log as possible while keeping operators happy

Also @champtar data come from an experiment just on one system. How generic the system is? How much differently this approach would behave on different systems? I dont know..

I've provided numbers on 3 other systems long ago #3615 (comment) (click on details)
TLDR journalctl is slow, and formatting 1GB of logs to keep 100MB is obviously wasteful
note: if I read sos code correctly the tail implementation (AsyncReader) uses 2xsizelimit of memory, we can avoid that

Maybe an alternative would be journalctl --since=boot -r | head -c 100MB | tac

There are 2 issues with this approach I suggested some time ago (#3615 (comment)):

  • it doesn't handle timeout properly (end up with nothing, right now we end up with some old logs but not the lastest ones)
  • it breaks multiline logs

I think I can fix both, will provide an PR when I have something so we can discuss further

@champtar
Copy link
Contributor Author

New version: #3879

@champtar champtar closed this Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFE] speedup journal export
4 participants