Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

ryanohnemus · 2024-09-18T03:46:02Z

This is more of a Question/Issue, but I created an example test that can be used so created this as a PR.

Background

I have been running into a few performance bottlenecks within my fluent-bit setup in kubernetes so I created a k8s_perf_test test that can be used to (hopefully) show the issue and hopefully this will result in some discussion about performance tuning, fluent-bit defaults, or possibly even pointing out some flaws with my own set up 😄.

Test Setup

This runs a single fluent-bit pod w/ a fluent-bit container and a logwriter container that runs the perf test
examples/k8s_perf_test/run-test.sh and examples/k8s_perf_test/values.yaml are setup to use the standard fluent-bit helm chart.
I used the extraContainers to create a python/ubuntu container (called logwriter) that is sidecar'd with fluent-bit
I used extraFiles to store my container startup script and test_runner.py
I setup an extra emptyDir (perftest-volume) for ephermal shared storage between fluent-bit and logwriter, mounted in both at /app/perftest/containers, the /fluent-bit configmap is also mounted in both containers
run-log-writer-test.sh passes configuration to test_runner.py, specifically it builds a logfile name that "impersonates" a log filename that would be created by containerd. the test_runner.py creates the logfile in /app/perftest/containers/ which is watched by the fluent-bit tail input.
test_runner.py has a small bit of logic, but has been performant enough on a macbook pro 2019, and gcp (gke) n2-standard-8 w/ ssd boot disks to write >50Mb/s to a file. It writes in the containerd (cri) format and it also does file renames to mimic logrotate.
test_runner.py writes the containerd log file and then polls the fluent-bit api until all the written records are accounted for within /api/v1/metrics/ output null.0 proc_records
output in fluentbit is set to null

Results

I ran this on both a gcp n2-standard-8 host that used ssd for it's bootdisk as well as a macbook. The results were similar in both cases in term of fluent-bit throughput. The numbers below are from a macbook pro 2019 2.3Ghz i7 running a single node kind (k8s) on docker.

1. tail input defaults do not seem optimal, setting larger input buffers are more performant, but can then result in downstream issue

Fluent-bit with a single tail input that uses no multiline.parser nor any filters using, and the default buffers ingests slower than when higher buffers are defined. However defining higher buffers can lead to output errors like: out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374 & Allow output plugins to configure a max chunk size #1938 as it tends to create larger chunks.

Initial input config:

    [INPUT]
        name                     tail
        read_from_head           false
        skip_long_lines          on
        path                     /app/perftest/containers/*.log
        Tag                      kube.*
        Rotate_Wait              60
        Refresh_Interval         1
        mem_buf_limit            50MB
        threaded                 on

this uses the default buffer_chunk_size and buffer_max_size and resulted in a write throughput of 41.196Mb/s
changing the buffer_chunk_size AND buffer_max_size both to 5M resulted in a write throughput of 44.097Mb/s

1a. A fluent-bit config that only reads and doesn't parse anything isn't super useful, so I re-tested the above config with multiline.parser cri and the following settings:

    [INPUT]
        name                     tail
        read_from_head           false
        skip_long_lines          on
        path                     /app/perftest/containers/*.log
        multiline.parser         cri
        Tag                      kube.*
        #buffer_chunk_size        5M
        #buffer_max_size          5M
        Rotate_Wait              60
        Refresh_Interval         1
        mem_buf_limit            50MB
        threaded                 on
        #Inotify_Watcher         false

I changed the buffer settings as followed for varying results:

defaults for buffer_chunk_size + buffer_max_size at = 17.64Mb/s
256K buffer_chunk_size + buffer_max_size = 19.15Mb/s
5M buffer_chunk_size + buffer_max_size = 22.78Mb/s
10M buffer_chunk_size + buffer_max_size = 21.42Mb/s

This looks like we could add a few mb/s to fluent-bit throughput just by increasing these buffer sizes by default (which is only 32K). However this seems to create oversized chunks and output plugins can not handle that well (#1938). Is there any other suggestion for improving the initial parsing speed?

NOTE: for the setup above i used a filters: "" in values.yaml

2. Adding common processing filters quickly slows down fluent-bit to a crawl

I added and tested with both the filters-simple and filters-extended in values.yaml. When testing with those you will need to rename those sections to just filters for it to be activated.
For these changes I kept the larger buffers, my input section was:

    [INPUT]
        name                     tail
        read_from_head           false
        skip_long_lines          on
        path                     /app/perftest/containers/*.log
        multiline.parser         cri
        Tag                      kube.*
        buffer_chunk_size        5M
        buffer_max_size          5M
        Rotate_Wait              60
        Refresh_Interval         1
        mem_buf_limit            50MB
        threaded                 on
        #Inotify_Watcher         false

2a. Please review the values.yaml filters-simple section.

I started by adding just the following:

    [FILTER]
        Name        modify
        Match       kube*
        Copy        level severity
        Remove      _p

this is to copy a field, then drops the _p artifact that comes from cri parsing, this lowered the processing to 18.065Mb/s (down from the 22.78Mb/s w/ higher buffers and no filter)

2b. adding filter kubernetes for namespace labels & annotations, and pod labels & annotations. This also used Merge_log to move the log filed to message

    [FILTER]
        Name                kubernetes
        Alias               get_k8s_metadata
        Kube_Tag_Prefix     kube.app.perftest.containers
        Match               kube*
        K8S-Logging.Parser  Off
        K8S-Logging.Exclude Off
        #Use_Kubelet         On
        #Kubelet_Host        ${NODE_IP}
        #Kubelet_Port        10250
        Buffer_Size         2MB
        Labels              On
        Annotations         On
        #Merge_Log On moves 'log' field to message
        Merge_Log           On
        Merge_Log_Trim      On
        kube_meta_cache_ttl 15m
        kube_meta_namespace_cache_ttl 15m
        Namespace_labels    On
        Namespace_annotations    On
        namespace_metadata_only Off

this brought processing down to 13.14Mb/s. At this point in my test, I would consider this a very minimal fluentbit + k8s config. However I'd assume many people using this are also going on to do work similar to my filters-extended example, which move k8s and other fields around, and potentially removes other fields before being sent to an output

2c. please look at filters-extended in values.yaml, this has what is in filters-simple plus then uses a nest/lift to move kubernetes meta files and a modify filter.

After using the filters-extended config, I ran into several issues with fluent-bit being able to keep up with log-rotation, something I also have seen in my production setups. It potentially misses logrotates and does not realize it (switching Inotify_watcher false) does not seem to be an improvement and it's hard to tell because this is also not reflected within fluent-bit metrics (it doesn't know it missed a rotation so how can it record it). To address it for this test only you can change Rotate_Wait in the the input to an extremely high number like 300. In standard k8s setup's, you will miss data as kubelet generally is doing logrotation when a container log size reaches 10mb (usually at 10s interval checks). So as fluent-bit backs up and a container is writing faster than fluent-bit can process, logs are missed w/ no metrics available to know they've been missed.
The input pauses constantly because the engine thread is backed up since all filters are executed single-thread in the engine thread (iirc) and fluent-bit is at a processing rate of 4.9Mb/s. (In my actual prod setup i have another lua script that runs between the last 2 filters, and that loses another 1.5Mb/s of throughput to the point fluent-bit pipeline can only process 3.5Mb/s).

Questions

is there anything glaringly wrong with the fluent-bit pipeline setup (specifically looking at the full filters-extended version)
are there any suggestions on what can be done to improve this? I'd like to avoid having a single fluent-bit that just reads from the disk and multiple downstream fluent-bits for k8s enrichment before a final real output
is there any possibility of having more than a single engine thread as that seems to be the bottleneck?

Signed-off-by: ryanohnemus <ryanohnemus@gmail.com>

agup006 · 2024-09-19T19:42:54Z

Heya is the file already present prior to read? I’m wondering if iops are a contributing factor to slow down

multiple filters chained together is one we know is not optimal which is why we’ve introduced processors. Have you tried the benchmark with processors?

Checking the commit also looks like latest? Thought I would doublecheck though

ryanohnemus · 2024-09-20T15:08:24Z

hi @agup006,

Heya is the file already present prior to read? I’m wondering if iops are a contributing factor to slow down

No the file is not present before fluent-bit starts. The file writing process handled by the sidecar container is generally complete within 8-10 seconds and should be realistic to how fluent-bit works within kubernetes. Typically there will be a daemonset of fluentbit that is reading container log files written by the container runtime (such as containerd) and those will all be generally using the same underlying disk.

multiple filters chained together is one we know is not optimal which is why we’ve introduced processors. Have you tried the benchmark with processors?

I tried the setup with processors yesterday, this had 2 improvements:

inputs were able to generally keep up with log rotations without issue, but only after switching to utilize filesystem storage. Memory storage still allows the input to get backed up and pause.
ability to use multiple output workers, however the performance improvement here is only a few mb/s and from what I can tell i am hitting the same issue as Output Processor Multi-Threading not working as expected (Mutex wait) #8245, which looks acknowledged but then it closed (not sure if it just went stale) but there is no performance improvement past 2-3 output workers regardless of how many cpu resources you allocate fluentbit

Drawbacks:

you entirely lose filter/processor pipeline visibility. I no longer saw metrics from any filters that were defined as processors.

side note:

I also noticed the multiple filters bottleneck. For my setup I need to use a lua script, so I tried to combine my modifies (field rename, drop, and copy) from the filters-extended config directly into lua and saw an increase in mb/s because there were less filters. However the use of lua alone was seemingly a big bottleneck, is this expected?

Checking the commit also looks like latest? Thought I would doublecheck though.

yes, this perf script is added on top of master but uses the latest helm chart which has an actual fluent-bit release it tests from. I can clean up the shellcheck warning if we end up wanting to merge this in.

lecaros · 2024-09-23T17:28:48Z

Hi there. Thanks for this PR.
I've run in and noted the performance you mention.

what's your expectation for throughput?
is filesystem storage out of the picture for any reason?
I noticed you are handling the file rotation instead of leaving it to k8s runtime. Is there any reason for this?

ryanohnemus · 2024-09-24T17:11:01Z

Hi @lecaros!

what's your expectation for throughput?
I would expect to be able to scale fluentbit vertically by the num of cpus on the machine, tied closely to the number of output workers. I believe #8245 to be a main limiting factor for this.

Obviously this can't exceed the max that fluent-bit could read from the file off the disk via a tail input, which in my testing i found to be around 41-44mb/s without any parsers defined. Adding the current regex cri multiline parser drops that performance to 18mb/s. I rewrote that parser without regex in #9418 which gives input with cri parsing a 9Mb/s perf gain on the hardware I tested on. I am hoping we can keep throughput above 25Mb/s.

is filesystem storage out of the picture for any reason?
I've tested with and without filesystem storage. With filesystem storage, the input does a better job of keeping up with fast log rotations.

I noticed you are handling the file rotation instead of leaving it to k8s runtime. Is there any reason for this?
yes! I did this specifically for the ease of test setup. Otherwise you would need to deploy fluentbit with hostPath mounts and it would need higher privileges in the cluster in order to run the test. (You would need access to /var/log/containers and /var/log/pods on the host, and a different securityContext iirc). Since I am testing in a Deployment over a Daemonset I do not need to use that, so I mimic log rotation. I also can do log rotates faster than kubelet does and ensure fluent-bit keeps up. The test setup uses a shared ephemeral emptyDir for the 2 containers within the same pod.

uristernik · 2024-10-07T11:34:59Z

@ryanohnemus super interesting stuff!

Did you take into account CPU / Memory consumption when testing the different setups? We noticed that higher buffers cause very high memory utilization when having to deal with lots of logs

ryanohnemus · 2024-10-08T21:06:22Z

@uristernik I wasn't as concerned with the memory utilization for my testing, but the test could be repeated looking specifically at that. As far as CPU, I was pushing as much log data as I could through fluent-bit so I was expecting CPU utilization quite high. The main purpose of my test was for throughput, but if you're looking for ways to limit memory utilization there's a few things you can try:

in your tail input plugin, use a mem_buf_limit, this will prevent fluent-bit from ingesting more records if it has at least that much data waiting to be flushed. This only works if you're not using storage.type filesystem. Keep in mind if your input is queueing up, you have a high probability of missing log rotations and therefore logs.
You can toy around with storage.max_chunks_up and lower it from the 128 default, this will tell the engine to use less chunks in memory. Chunks can vary in size, but are typically 2mb. If you are using a higher buffer size/chunk size on the tail input, you will go over the 2mb size pretty easily. This size >2mb is something that could be addressed by [POC, Do not merge] input_chunk: split incoming buffer when it's too big #9385.
If you're using processors (basically filter pipelines handled in the output thread), this will more than double your initial memory requirement (usage generally becomes num worker threads + 1 * previous memory limit). Processors run after the input chunk is handed off to an output thread. The original input chunk remains while each worker thread runs processors that can enhance/add records to the processed chunk that is finally delivered to the output plugin's flush callback.

testing: k8s_perf_test

9c005c0

Signed-off-by: ryanohnemus <ryanohnemus@gmail.com>

ryanohnemus requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners September 18, 2024 03:46

github-actions bot added the docs-required label Sep 18, 2024

ryanohnemus mentioned this pull request Sep 24, 2024

multiline: introduce static cri parser backend, providing 9mb/s performance gain over current regex cri parser #9418

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

ryanohnemus commented Sep 18, 2024

agup006 commented Sep 19, 2024

ryanohnemus commented Sep 20, 2024

lecaros commented Sep 23, 2024

ryanohnemus commented Sep 24, 2024

uristernik commented Oct 7, 2024

ryanohnemus commented Oct 8, 2024

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Are you sure you want to change the base?

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Conversation

ryanohnemus commented Sep 18, 2024

Background

Test Setup

Results

1. tail input defaults do not seem optimal, setting larger input buffers are more performant, but can then result in downstream issue

2. Adding common processing filters quickly slows down fluent-bit to a crawl

Questions

agup006 commented Sep 19, 2024

ryanohnemus commented Sep 20, 2024

lecaros commented Sep 23, 2024

ryanohnemus commented Sep 24, 2024

uristernik commented Oct 7, 2024

ryanohnemus commented Oct 8, 2024