-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: ryanohnemus <ryanohnemus@gmail.com>
Heya is the file already present prior to read? I’m wondering if iops are a contributing factor to slow down multiple filters chained together is one we know is not optimal which is why we’ve introduced processors. Have you tried the benchmark with processors? Checking the commit also looks like latest? Thought I would doublecheck though |
hi @agup006, Heya is the file already present prior to read? I’m wondering if iops are a contributing factor to slow down
multiple filters chained together is one we know is not optimal which is why we’ve introduced processors. Have you tried the benchmark with processors?
Drawbacks:
side note:
Checking the commit also looks like latest? Thought I would doublecheck though.
|
Hi there. Thanks for this PR.
|
Hi @lecaros! what's your expectation for throughput? Obviously this can't exceed the max that fluent-bit could read from the file off the disk via a tail input, which in my testing i found to be around 41-44mb/s without any parsers defined. Adding the current regex cri multiline parser drops that performance to 18mb/s. I rewrote that parser without regex in #9418 which gives input with cri parsing a 9Mb/s perf gain on the hardware I tested on. I am hoping we can keep throughput above 25Mb/s. is filesystem storage out of the picture for any reason? I noticed you are handling the file rotation instead of leaving it to k8s runtime. Is there any reason for this? |
@ryanohnemus super interesting stuff! Did you take into account CPU / Memory consumption when testing the different setups? We noticed that higher buffers cause very high memory utilization when having to deal with lots of logs |
@uristernik I wasn't as concerned with the memory utilization for my testing, but the test could be repeated looking specifically at that. As far as CPU, I was pushing as much log data as I could through fluent-bit so I was expecting CPU utilization quite high. The main purpose of my test was for throughput, but if you're looking for ways to limit memory utilization there's a few things you can try:
|
This is more of a Question/Issue, but I created an example test that can be used so created this as a PR.
Background
I have been running into a few performance bottlenecks within my fluent-bit setup in kubernetes so I created a k8s_perf_test test that can be used to (hopefully) show the issue and hopefully this will result in some discussion about performance tuning, fluent-bit defaults, or possibly even pointing out some flaws with my own set up 😄.
Test Setup
examples/k8s_perf_test/run-test.sh
andexamples/k8s_perf_test/values.yaml
are setup to use the standard fluent-bit helm chart.extraContainers
to create a python/ubuntu container (calledlogwriter
) that is sidecar'd with fluent-bitextraFiles
to store my container startup script andtest_runner.py
emptyDir
(perftest-volume) for ephermal shared storage betweenfluent-bit
andlogwriter
, mounted in both at/app/perftest/containers
, the/fluent-bit
configmap is also mounted in both containersrun-log-writer-test.sh
passes configuration totest_runner.py
, specifically it builds a logfile name that "impersonates" a log filename that would be created bycontainerd
. thetest_runner.py
creates the logfile in/app/perftest/containers/
which is watched by the fluent-bittail
input.test_runner.py
has a small bit of logic, but has been performant enough on a macbook pro 2019, and gcp (gke) n2-standard-8 w/ ssd boot disks to write >50Mb/s to a file. It writes in thecontainerd
(cri) format and it also does file renames to mimic logrotate./api/v1/metrics/
output null.0 proc_recordsnull
Results
I ran this on both a gcp n2-standard-8 host that used ssd for it's bootdisk as well as a macbook. The results were similar in both cases in term of fluent-bit throughput. The numbers below are from a macbook pro 2019 2.3Ghz i7 running a single node kind (k8s) on docker.
1. tail input defaults do not seem optimal, setting larger input buffers are more performant, but can then result in downstream issue
tail
input that uses nomultiline.parser
nor any filters using, and the default buffers ingests slower than when higher buffers are defined. However defining higher buffers can lead to output errors like: out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374 & Allow output plugins to configure a max chunk size #1938 as it tends to create larger chunks.Initial input config:
1a. A fluent-bit config that only reads and doesn't parse anything isn't super useful, so I re-tested the above config with
multiline.parser cri
and the following settings:I changed the buffer settings as followed for varying results:
This looks like we could add a few mb/s to fluent-bit throughput just by increasing these buffer sizes by default (which is only 32K). However this seems to create oversized chunks and output plugins can not handle that well (#1938). Is there any other suggestion for improving the initial parsing speed?
NOTE: for the setup above i used a
filters: ""
in values.yaml2. Adding common processing filters quickly slows down fluent-bit to a crawl
filters-simple
andfilters-extended
in values.yaml. When testing with those you will need to rename those sections to justfilters
for it to be activated.For these changes I kept the larger buffers, my input section was:
2a. Please review the values.yaml
filters-simple
section.I started by adding just the following:
_p
artifact that comes from cri parsing, this lowered the processing to 18.065Mb/s (down from the 22.78Mb/s w/ higher buffers and no filter)2b. adding filter kubernetes for namespace labels & annotations, and pod labels & annotations. This also used Merge_log to move the
log
filed tomessage
filters-extended
example, which move k8s and other fields around, and potentially removes other fields before being sent to an output2c. please look at
filters-extended
in values.yaml, this has what is infilters-simple
plus then uses a nest/lift to move kubernetes meta files and a modify filter.After using the
filters-extended
config, I ran into several issues with fluent-bit being able to keep up with log-rotation, something I also have seen in my production setups. It potentially misses logrotates and does not realize it (switching Inotify_watcher false) does not seem to be an improvement and it's hard to tell because this is also not reflected within fluent-bit metrics (it doesn't know it missed a rotation so how can it record it). To address it for this test only you can change Rotate_Wait in the the input to an extremely high number like 300. In standard k8s setup's, you will miss data as kubelet generally is doing logrotation when a container log size reaches 10mb (usually at 10s interval checks). So as fluent-bit backs up and a container is writing faster than fluent-bit can process, logs are missed w/ no metrics available to know they've been missed.The input pauses constantly because the engine thread is backed up since all filters are executed single-thread in the engine thread (iirc) and fluent-bit is at a processing rate of 4.9Mb/s. (In my actual prod setup i have another lua script that runs between the last 2 filters, and that loses another 1.5Mb/s of throughput to the point fluent-bit pipeline can only process 3.5Mb/s).
Questions
filters-extended
version)