-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beats panic if there are more than 32767 pipeline clients #38197
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
How is this a work around? Do the additional harvesters still execute after waiting for some undefined period of time? Or do they possibly never execute? If the latter, this seems more like converting an obvious failure into a silent failure. Of the 32K active harvesters, how many of them are actually sending data concurrently? I wonder if it would be more effective to put the beat.Clients into something like a sync.Pool so that they are only actually kept around if they are used? Is there a way to tell how many of the pipeline clients are idle and for how long? |
Another work around at the input level would be to figure out a way to have multiple inputs harvest the set of files, essentially sharding the input work load. We could also shard or change the structure of the select cases, it looks like we are only waiting on the done signals. We could just create an another select case once we go past the limit on how many a single select can handle: beats/libbeat/publisher/pipeline/pipeline.go Lines 332 to 342 in ae312c5
|
The
On my tests, not many as the files were not being updated. Theoretically, all files could be live updated, which would also generate other issues like the not running harvesters starving. That is definitely a scale edge case we do not cover well.
I did not investigate the code to see how we could do it mitigate it. At least on Filestream there is an infinity loop that reads from the file and then publishes the line/event read. Those The affected inputs are the ones that call `pipeline.ConnectWith) beats/libbeat/publisher/pipeline/pipeline.go Lines 211 to 215 in ae312c5
beats/libbeat/beat/pipeline.go Lines 47 to 50 in ae312c5
beats/libbeat/beat/pipeline.go Lines 103 to 137 in ae312c5
We probably can re-use the same
I did not look into this. On my test case I had 33000 files and 100 lines on each. So most of the havesters/clients were all idle for sure.
That sounds like a pretty good idea! I can easily see that working. |
We likely wouldn't want to do this because the
Agreed I think this is the idea to pursue first. |
This has been tested with Filebeat but the bug is on libbeat, so it's likely affecting all beats. The OS tested was Linux but the issue is not dependant on OS.
Description
When Filebeat, using the filestream input, (other inputs are likely affected as well, the log input is not affected) is configured to harvest more than 32767 files at once it will panic. This happens because for each file, two elements are added to a slice of channels by
Pipeline.runSitnalPropagation
, once this slice reaches 65536 elements, areflect.Select
on this slice will cause Filebeat to panic. The panic happens here:beats/libbeat/publisher/pipeline/pipeline.go
Line 324 in ae312c5
For every new client this infinite for loop adds two elements to the slice of channels
beats/libbeat/publisher/pipeline/pipeline.go
Lines 332 to 346 in ae312c5
Once the slice contains 65536 or more, then Filebeat panics with a message/stacktrace like this
How to reproduce
The easiest way to reproduce this issue is to create 33000 small log files, then start Filebeat to harvest them. While testing I faced some issues/OS limitations when trying to have all files being constantly updated.
You can use anything to generate the files, I used the following shell script and flog
Save it as
gen-logs.sh
, make it executable and run./gen-logs.sh 33000
.Use the file output for the simplicity and easy to validate all logs have been ingested. The output and logging configuration ensure there will be a single log file and single output file.
Start Filebeat and wait until it panics.
If you uncomment the
#harvester_limit: 32000
Filebeat will work without issues and ingest all files. If using the script provided, there should be 3300000 events in the output file, you can verify that with:Workaround
One workaround is to set the
harvester_limit
to a number smaller than 32000 if using a single input. If using multiple affected inputs all inputs should have limits to the number of pipeline clients they create in a way that for a running Filebeat process there will never be more than 32000 pipeline clients running concurrently.For the filestream input, here is an example configuration:
The text was updated successfully, but these errors were encountered: