Skip to content

Latest commit

 

History

History
2092 lines (1485 loc) · 150 KB

debugging.md

File metadata and controls

2092 lines (1485 loc) · 150 KB

Guide to Debugging Fluent Bit issues

Table of Contents

Understanding Error Messages

How do I tell if Fluent Bit is losing logs?

Fluent Bit is a very a low level tool, both in terms of the power over its configuration that it gives you, and in the verbosity of its log messages. Thus, it is normal to have errors in your Fluent Bit logs; its important to understand what these errors mean.

[error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
[error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events

Here we see error messages associated with a single request to send logs. These errors both come from a single http request failure. It is normal for this to happen occasionally; where other tools might silently retry, Fluent Bit tells that a connection was dropped. In this case, the http client logged a connection error, and then this caused the CloudWatch plugin to log an error because the http request failed. Seeing error messages like these does not mean that you are losing logs; Fluent Bit will retry.

Retries can be configured: https://docs.fluentbit.io/manual/administration/scheduling-and-retries

This is an error message indicating a chunk of logs failed and will be retried:

 [ warn] [engine] failed to flush chunk '1-1647467985.551788815.flb', retry in 9 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.0 (out_id=0)

Even if you see this message, you still have not lost logs yet. Since it will retry. Retries however can cause Log Delay. You have lost logs if you see something like the following message:

[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1

When you see this message, you have lost logs.

The main other case that indicates log loss is when an input is paused, which is covered in the overlimit error section. Please also review our Log Loss Summary: Common Causes.

Common Network Errors

First, please read the section Network Connection Issues, which explains that Fluent Bit's http client is low level and logs errors that other clients may simply silently retry. That section also explains the auto_retry_requests option which defaults to true.

Your Fluent Bit deployment will occasionally log network connection errors. These are normal and should not cause concern unless they cause log loss. See How do I tell if Fluent Bit is losing logs? to learn how to check for log loss from network issues. If you see network errors happen very frequently or they cause your retries to expire, then please cut us an issue.

See Case Studies: Rate of Network Errors for a real world example of the frequency of these errors.

The following are common network connection errors that we have seen frequently:

[error] [tls] error: error:00000005:lib(0):func(0):DH lib
[error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
[error] [tls] error: unexpected EOF
[error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443
[error] [aws_client] connection initialization error
[error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[error] [tls] SSL error: NET - Connection was reset by peer

Tail Input Skipping File

[2023/01/19 04:23:24] [error] [input:tail:tail.6] file=/app/env/paymentservice/var/output/logs/system.out.log.2023-01-22-06 requires a larger buffer size, lines are too long. Skipping file.

Above is an example error message that can occur with the tail input. This happens due to the code here. When Fluent Bit tail skips a log file, the remaining un-read logs in the file will not be sent and will be lost.

The Fluent Bit tail documentation has settings that are relevant here:

  • Buffer_Max_Size: Defaults to 32k. This is the max size for the in memory Fluent Bit will use to read your log file(s). Fluent Bit reads log files line by line, so this value needs to be larger than the longest log lines in your file(s). Fluent Bit will only increase the buffer as needed, therefore, it is safe and recommended to err on the side of configuring a large value for this field for safety. There is one read buffer per file, and once it is increased, it never decreases.
  • Buffer_Chunk_Size: Defaults to 32k. This setting controls the initial size of the buffer that Fluent Bit uses to read log lines. And it also controls the increments that Fluent Bit will increase the buffer as needed. Let's understand this via an example. Consider that you have configured a Buffer_Max_Size of 1MB and the default Buffer_Chunk_Size of 32k. If Fluent Bit has to read a log line which is 100k in length, the buffer will start at 32k. Fluent Bit will read 32k of the log line and notice that it hasn't found a newline yet. It will then increase the buffer by another 32k, and read more of the line. That still isn't enough, so again and again the buffer will be increased by Buffer_Chunk_Size until its large enough to capture the entire line. So configure this setting to be a value close to the longest lines you have, this way Fluent Bit will not waste CPU cycles on many rounds of re-allocation of the buffer. There is one read buffer per file, and once it is increased, it never decreases.
  • Skip_Long_Lines: Defaults to off. When this setting is enabled it means "Skip long log lines instead of skipping the entire file". Fluent Bit will skip long log lines that exceed Buffer_Max_Size and continue reading the file at the next line. Therefore, we recommend always setting this value to On.

To prevent the error and log loss above, we can set our tail input to have:

[INPUT]
    Name tail
    Buffer_Max_Size { A value larger than the longest line your app can produce}
    Skip_Long_Lines On # skip unexpectedly long lines instead of skipping the whole file

If you set Skip_Long_Lines On, then Fluent Bit will emit the following error message when it encounters lines longer than its Buffer_Max_Size:

[warn] file=/var/log/app.log have long lines. Skipping long lines.

Tail Permission Errors

A common error message for ECS FireLens users who are reading log files is the following:

[2022/03/16 19:05:06] [error] [input:tail:tail.5] read error, check permissions: /var/output/logs/service_log*

This error will happen in many cases on startup when you use the tail input. If you are running Fluent Bit under FireLens, then remember that it alters the container ordering so that the Fluent Bit container starts first. This means that if it is supposed to scan a file path for logs on a volume mounted from another container- at startup the path might not exist.

This error is recoverable and Fluent Bit will keep retrying to scan the path.

There are two cases which might apply to you if you experience this error:

  1. You need to fix your task definition: If you do not add a volume mount between the app container and FireLens container for your logs files, Fluent Bit can't find them and you'll get this error repeatedly. We recommend using ECS Exec to exec into the Fluent Bit and app containers and check each of their filesystems.

  2. This is normal, ignore the error: If you did set up the volume mount correctly, you will still always get this on startup. This is because ECS will start the FireLens container first, and so that path won't exist until the app container starts up, which is after FireLens. Thus it will always complain about not finding the path on startup. This can either be safely ignored or fixed by creating the full directory path for the log files in your Dockerfile so that it exists before any log files are written. The error will only happen on startup and once the log files exist, Fluent Bit will pick them up and begin reading them.

Overlimit warnings

Fluent Bit has storage and memory buffer configuration settings.

When the storage is full the inputs will be paused and you will get warnings like the following:

[input] tail.1 paused (mem buf overlimit)
[input] tail.1 paused (storage buf overlimit
[input] tail.1 resume (mem buf overlimit)
[input] tail.1 resume (storage buf overlimit

Search for "overlimit" in the Fluent Bit logs to find the paused and resume messages about storage limits. These can be found in the code in flb_input_chunk.c.

The storage buf overlimit occurs when the number of in memory ("up") chunks exceeds the storage.max_chunks_up and you have set storage.type filesystem and storage.pause_on_chunks_overlimit On.

The mem buf overlimit occurs when the input has exceeded the configured Mem_Buf_Limit and storage.type memory is configured.

When the input is able to receive logs again, you will see one of the resume messages above.

With some inputs, an overlimit warning indicates that you are losing logs- new logs will not be ingested. This is the case with most inputs that stream in data (forward and TCP, for example). If you use ECS FireLens with Fluent Bit, then the stdout/stderr log input is a forward input and an overlimit warning means that new logs will not be ingested and will be lost. The exception to this is the tail input, which can safely pause and resume without losing logs because it tracks its file offset. When it resumes, it can pick back up reading the file at the last offset (assuming the file was not deleted).

How do I tell if Fluent Bit is ingesting and sending data?

This requires that you enabled debug logging.

This section contains CloudWatch Insights queries that allow you to check the frequency of data ingestion/send.

Recall the Checking Batch sizes section of this guide; you can check these messages to see if the outputs are active.

fields @timestamp, @message, @logStream
| sort @timestamp desc
| filter @message like 'payload='

The actual debug messages look like:

[2022/05/06 22:43:17] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=6, payload=553 bytes
[2022/05/06 22:43:17] [debug] [output:kinesis_firehose:kinesis_firehose.0] firehose:PutRecordBatch: events=10, payload=666000 bytes
[2022/05/06 22:43:17] [debug] [output:kinesis_streams:kinesis_streams.0] kinesis:PutRecords: events=4, payload=4000 bytes

When an input appends data to a chunk, there is a common message that is logged. This checks that the inputs are still active.

fields @timestamp, @message, @logStream
| sort @timestamp desc
| filter @message like 'update output instances with new chunk size'

Most chunk related messages have chunk in them, which can be used in queries.

For both of these, checking the frequency of debug messages may be more interesting than the events themselves.

invalid JSON message

Users who followed this tutorial or similar to send logs to the TCP input often see message like:

invalid JSON message, skipping

Or:

[debug] [input:tcp:tcp.0] JSON incomplete, waiting for more data...

These are caused by setting Format json in the TCP input. If you see these errors, then your logs are not actually being sent as JSON. Change the format to none.

For example:

[INPUT]
    Name          tcp
    Listen        0.0.0.0
    Port          5170
# please read docs on Chunk_Size and Buffer_Size: https://docs.fluentbit.io/manual/pipeline/inputs/tcp
    Chunk_Size    32
# this number of kilobytes is the max size of single log message that can be accepted
    Buffer_Size   64
# If you set this to json you might get error: "invalid JSON message, skipping" if your logs are not actually JSON
    Format        none
    Tag           tcp-logs
# input will stop using memory and only use filesystem buffer after 50 MB
    Mem_Buf_Limit 50MB
    storage.type  filesystem 

caught signal SIGTERM

You may see that Fluent Bit exited with one of its last logs as:

[engine] caught signal (SIGTERM)

This means that Fluent Bit received a SIGTERM. This means that Fluent Bit was requested to stop; SIGTERM is associated with graceful shutdown. Fluent Bit will be sent a SIGTERM when your container orchestrator/runtime wants to stop its container/task/pod.

When Fluent Bit recieves a SIGTERM, it will begin a graceful shutdown and will attempt to send all currently buffered logs. Its inputs will be paused and then it will shutdown once the configured Grace period has elapsed. Please note that the default Grace period for Fluent Bit is only 5 seconds, whereas, in most container orchestrators the default grace period time is 30 seconds.

If you see this in your logs and you are troubleshooting a stopped task/pod, this log indicates that Fluent Bit did not cause the shutdown. If there is another container in your task/pod check if it exited unexpectedly.

If you are an Amazon ECS FireLens user and you see your task status as:

STOPPED (Essential container in task exited)

And Fluent Bit logs show this SIGTERM message, then this indicates that Fluent Bit did not cause the task to stop. One of the other containers exited (and was essential), thus ECS chose to stop the task and sent a SIGTERM to the remaining containers in the task.

caught signal SIGSEGV

If you see a SIGSEGV in your Fluent Bit logs, then unfortunately you have encountered a bug! This means that Fluent Bit crashed due to a segmentation fault, an internal memory access issue. Please cut us an issue on GitHub, and consider checking out the SIGSEGV debugging section of this guide.

Too many open files

You might see errors like:

accept4: Too many open files
[error] [input:tcp:tcp.4] could not accept new connection

Upstream issue for reference.

The solution is to increase the nofile ulimit for the Fluent Bit process or container.

In Amazon ECS, ulimit can be set in the container definition.

In Kubernetes with Docker, you need to edit the /etc/docker/daemon.json to set the default ulimits for containers. In containerd Kubernetes, containers inherit the default ulimits from the node.

storage.path cannot initialize root path

[2023/02/03 23:38:21] [error] [storage] [chunkio] cannot initialize root path /var/log/flb-storage/
[2023/02/03 23:38:21] [error] [lib] backend failed
[2023/02/03 23:38:21] [error] [storage] error initializing storage engine

This happens when the storage.path in the [SERVICE] section is set to a path that is on a read-only volume.

multiline parser 'parser_name' not registered

[2023/05/22 04:03:11] [error] [multiline] parser 'parser_name' not registered
[2023/05/22 04:03:11] [error] [input:tail:tail.0] could not load multiline parsers
[2023/05/22 04:03:11] [error] Failed initialize input tail.0
[2023/05/22 04:03:11] [error] [lib] backend failed

This error happens with the following config in the tail input or the multiline filter:

multiline.parser parser_name

The causes are:

  1. The name referenced is incorrect.
  2. Fluent Bit could not load your multiline parser file or you did not specify your multiline parser file with parsers_file in the [SERVICE] section.
  3. You attempted to use a standard [PARSER] as a [MULTILINE_PARSER]. The two are completely different and must be loaded from different files. See the documentation.

CloudWatch output parsing template error

Go cloudwatch plugin error:

time="2023-05-18T15:54:43Z" level=error msg="[cloudwatch 0] parsing log_group_name template '/aws/containerinsights/eksdemo1/$(kubernetes['namespace_name'])' (using value of default_log_group_name instead): kubernetes['namespace_name']: tag name not found" 

C cloudwatch_logs plugin error:

[record accessor] translation failed, root key=kubernetes

Both CloudWatch outputs have templating support for log group and stream names.

A typical problem is failure in the templating. These errors mean that Fluent Bit could not find the requested metadata keys in a log message.

This most commonly happens due to the tag match pattern on the output. The output matches logs that do not contain the desired metadata.

Consider the example configuration below which has a tail input for kubernetes pod logs and a systemd input for kubelet logs. The kubernetes filter adds metadata to the pod logs from the tail input. However, kubernetes pod metadata will not be added to the kubelet logs. The output matches * which means all logs, and thus templating fails when the systemd kubelet logs are processed by the output.

[INPUT]
    name              tail
    path              /var/log/containers/*.log
    tag               app.*
    multiline.parser  docker, cri

[INPUT]
    Name            systemd
    Tag             host.*
    Systemd_Filter  _SYSTEMD_UNIT=kubelet.service

[FILTER]
    Name             kubernetes
    Match            app.*
    Kube_URL         https://kubernetes.default.svc:443
    Kube_Tag_Prefix  application.var.log.containers.

[OUTPUT]
    Name              cloudwatch
    Match             *
    region            us-east-1
    log_group_name    /eks/$(kubernetes['namespace_name'])/$(kubernetes['pod_name'])
    log_stream_name   $(kubernetes['namespace_name'])/$(kubernetes['container_name'])
    auto_create_group true

This config can be fixed by changing the output match pattern to Match app.* and adding a second output for the systemd logs:

[INPUT]
    name              tail
    path              /var/log/containers/*.log
    tag               app.*
    multiline.parser  docker, cri

[INPUT]
    Name            systemd
    Tag             host.*
    Systemd_Filter  _SYSTEMD_UNIT=kubelet.service

[FILTER]
    Name             kubernetes
    Match            app.*
    Kube_URL         https://kubernetes.default.svc:443
    Kube_Tag_Prefix  app.var.log.containers.

[OUTPUT]
    Name              cloudwatch
    Match             app.*
    region            us-east-1
    log_group_name    /eks/$(kubernetes['namespace_name'])/$(kubernetes['pod_name'])
    log_stream_name   $(kubernetes['namespace_name'])/$(kubernetes['container_name'])
    auto_create_group true

[OUTPUT]
    Name              cloudwatch
    Match             host.*
    region            us-east-1
    log_group_name    /eks/host
    log_stream_prefix ${HOSTNAME}-
    auto_create_group true

Log4j TCP Appender Write Failure

Many customers use log4j to emit logs, and use the TCP appender to write to a Fluent Bit TCP input. ECS customers can follow this ECS Log Collection Tutorial.

Unfortunately, log4j can sometimes fail to write to the TCP socket and you may get an error like:

2022-09-05 14:18:49,694 Log4j2-TF-1-AsyncLogger[AsyncContext@6fdb1f78]-1 ERROR Unable to write to stream TCP:localhost:5170 for appender ApplicationTcp org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to TCP:localhost:5170 after reestablishing connection for localhost/127.0.0.1:5170

Or:

Caused by: java.net.SocketException: Connection reset by peer (Write failed)

If this occurs, your first step is to check if it is simply a misconfiguration. Ensure you have configured a Fluent Bit TCP input for the same port that log4j is using.

Unfortunately, AWS has received occasional reports that log4j TCP appender will fail randomly in a previously working setup. AWS is tracking reports of this problem here: aws#294

While we can not prevent connection failures from occurring, we have found several mitigations.

Mitigation 1: Enable Workers for all outputs

AWS testing has found that TCP input throughput is maximized when Fluent Bit is performing well. If you enable at least one worker for all outputs, this ensures your outputs each have their own thread(s). This frees up the main Fluent Bit scheduler/input thread to focus on the inputs.

It should be noted that starting in Fluent Bit 1.9+ most outputs have 1 worker enabled by default. This means that if you do not specify workers in your output config you will still get 1 worker enabled. You must set workers 0 to disable workers. All AWS outputs have 1 worker enabled by default since v1.9.4.

Mitigation 2: Log4J Failover to STDOUT with Appender Pattern

If the TCP plugin fails for Fluent Bit, you can configure a failover appender. The Log4J XML below configures logs to be retried to STDOUT failover. It is difficult to search stdout for these failed logs if/because STDOUT is cluttered with other non log4j logs. You might prefer to have a special pattern for logs that failover and go to stdout:

<Console name="STDOUT" target="SYSTEM_OUT" ignoreExceptions="false">
    <PatternLayout>
    <Pattern> {"logtype": "tcpfallback", "message": "%d %p %c{1.} [%t] %m%n" }</Pattern>
    </PatternLayout>
</Console>

You can then configure this appender as the failover appender:

<Failover name="Failover" primary="RollingFile">
    <Failovers>
        <AppenderRef ref="Console"/>
    </Failovers>
</Failover>
Mitigation 3: Log4J Failover to a File

Alternatively, you can use a failover appender which just writes failed logs to a file. A Fluent Bit Tail input can then collect the failed logs.

Basic Techniques

First things to try when something goes wrong?

When you experience a problem, before you cut us an issue, try the following.

  1. Search this guide for similar errors or problems.
  2. Search GitHub issues on this repo and upstream for similar problems or reports.
  3. Enable debug logging
  4. If you use Amazon ECS FireLens, enable the AWSLogs driver for Fluent Bit as shown in all of our examples.
  5. Try upgrading to our latest or latest stable version.
  6. If you are using a Core Fluent Bit C plugin, consider switching to the Go plugin version if possible. Or switch from Go to C. See our guide on C vs Go plugins. Often an issue will only affect one version of the plugin.
  7. See our Centralized Issue Runbook for deeper steps and links for investigating any issue.

Enable Debug Logging

Many Fluent Bit problems can be easily understood once you have full log output. Also, if you want help from the aws-for-fluent-bit team, we generally request/require debug log output.

The log level for Fluent Bit can be set in the Service section, or by setting the env var FLB_LOG_LEVEL=debug.

Enabling Monitoring for Fluent Bit

Kubernetes users should use the Fluent Bit prometheus endpoint and their preferred prometheus compatible agent to scrape the endpoint.

FireLens users should check out the FireLens example for sending Fluent Bit internal metrics to CloudWatch: amazon-ecs-firelens-examples/send-fb-metrics-to-cw.

The FireLens example technique shown in the link above can also be used outside of FireLens if desired.

DNS Resolution Issues

Fluent Bit has network settings that can be used with all official upstream output plugins: Fluent Bit networking.

If you experience network issues, try setting the net.dns.mode setting to both TCP and UDP. That is, try both, and see which one resolves your issues. We have found that in different cases, one setting will work better than the other.

Here is an example of what this looks like with TCP:

[OUTPUT]
    Name cloudwatch_logs
    Match   *
    region us-east-1
    log_group_name fluent-bit-cloudwatch
    log_stream_prefix from-fluent-bit-
    auto_create_group On
    workers 1
    net.dns.mode TCP

And here is an example with UDP:

[OUTPUT]
    Name cloudwatch_logs
    Match   *
    region us-east-1
    log_group_name fluent-bit-cloudwatch
    log_stream_prefix from-fluent-bit-
    auto_create_group On
    workers 1
    net.dns.mode UDP

There is also another setting, called net.dns.resolver which takes values LEGACY and ASYNC. ASYNC is the default. We recommend trying out the settings above and if that does not resolve your issue, try changing the resolver.

Here's an example with LEGACY:

[OUTPUT]
    Name cloudwatch_logs
    Match   *
    region us-east-1
    log_group_name fluent-bit-cloudwatch
    log_stream_prefix from-fluent-bit-
    auto_create_group On
    workers 1
    net.dns.resolver LEGACY

These settings works with the cloudwatch_logs, s3, kinesis_firehose, kinesis_streams, and opensearch AWS output plugins.

Credential Chain Resolution Issues

The same as other AWS tools, Fluent Bit has a standard chain of resolution for credentials. It checks a series of sources for credentails and uses the first one that returns valid credentials. If you enable debug logging (further up in this guide), then it will print verbose information on the sources of credentials that it checked and why the failed.

See our documentation on the credential sources Fluent Bit supports and the order of resolution.

EC2 IMDSv2 Issues

If you use EC2 instance role as your source for credentials you might run into issues with the new IMDSv2 system. Check out the EC2 docs.

If you run Fluent Bit in a container, you most likely will need to set the hop limit to two if you have tokens required (IMDSv2 style):

aws ec2 modify-instance-metadata-options \
     --instance-id <my-instance-id> \
     --http-put-response-hop-limit 2

The AWS cli can also be used to require IMDSv2 for security purposes:

aws ec2 modify-instance-metadata-options \
     --instance-id <my-instance-id> \
     --http-tokens required \
     --http-endpoint enabled

The other way to override the IMDS issue on EKS platform, without making the hop-limit to 2 will be setting up the hostNetwork to true, with the dnsPolicy as 'ClusterFirstWithHostNet':

    spec:
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true

Checking Batch sizes

The AWS APIs for CloudWatch, Firehose, and Kinesis have limits on the number of events and number of bytes that can be sent in each request. If you enable debug logging with AWS for Fluent Bit version 2.26.0+, then Fluent Bit will log debug information as follows for each API request:

[2022/05/06 22:43:17] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=6, payload=553 bytes
[2022/05/06 22:43:17] [debug] [output:kinesis_firehose:kinesis_firehose.0] firehose:PutRecordBatch: events=10, payload=666000 bytes
[2022/05/06 22:43:17] [debug] [output:kinesis_streams:kinesis_streams.0] kinesis:PutRecords: events=4, payload=4000 bytes

You can then experiment with different settings for the Flush interval and find the lowest value that leads to efficient batching (so that in general, batches are close to the max allowed size):

[Service]
    Flush 2

Searching old issues

When you first see an issue that you don't understand, the best option is to check open and closed issues in both repos:

Lots of common questions and issues have already been answered by the maintainers.

Downgrading or upgrading your version

If you're running on an old version, it may have been fixed in the latest release: https://github.com/aws/aws-for-fluent-bit/releases

Or, you may be facing an unresolved or not-yet-discovered issue in a newer version. Downgrading to our latest stable version may work: https://github.com/aws/aws-for-fluent-bit#using-the-stable-tag

In the AWS for Fluent Bit repo, we always attempt to tag bug reports with which versions are affected, for example see this query.

Network Connection Issues

One of the simplest causes of network connection issues is throttling, some AWS APIs will block new connections from the same IP for throttling (rather than wasting effort returning a throttling error in the response). We have seen this with the CloudWatch Logs API. So, the first thing to check when you experience network connection issues is your log ingestion/throughput rate and the limits for your destination.

Please see Common Network Errors for a list of common network errors outputted by Fluent Bit.

See Case Studies: Rate of Network Errors for a real world example of the frequency of these errors.

In addition, Fluent Bit's networking library and http client is very low level and simple. It can be normal for there to be occasional dropped connections or client issues. Other clients may silently auto-retry these transient issues; Fluent Bit by default will always log an error and issue a full retry with backoff for any networking error.

This is only relavant for the core outputs, which for AWS are:

  • cloudwatch_logs
  • kinesis_firehose
  • kinesis_streams
  • s3

Remember that we also have older Golang output plugins for AWS (no longer recommended since they have poorer performance). These do not use the core fluent bit http client and networking library:

  • cloudwatch
  • firehose
  • kinesis

Consequently, the AWS team contributed an auto_retry_requests config option for each of the core AWS outputs. This will issue an immediate retry for any connection issues.

Memory Leaks or high memory usage

High Memory usage does not always mean there is a leak/bug

Fluent Bit is efficient and performant, but it isn't magic. Here are common/normal causes of high memory usage:

  • Sending logs at high throughput
  • The Fluent Bit configuration and components used. Each component and extra enabled feature may incur some additional memory. Some components are more efficient than others. To give a few examples- Lua filters are powerful but can add significant overhead. Multiple outputs means that each output maintains its own request/response/processing buffers. And the rewrite_tag filter is actually implemented as a special type of input plugin that gets its own input mem_buf_limit.
  • Retries. This is the most common cause of complaints of high memory usage. When an output issues a retry, the Fluent Bit pipeline must buffer that data until it is successfully sent or the configured retries expire. Since retries have exponential backoff, a series of retries can very quickly lead to a large number of logs being buffered for a long time.

Please see the Fluent Bit buffering documentation: https://docs.fluentbit.io/manual/administration/buffering-and-storage

FireLens OOMKill Prevention Guide

Check out the FireLens example for preventing/reducing out of memory exceptions: amazon-ecs-firelens-examples/oomkill-prevention

Testing for real memory leaks using Valgrind

If you do think the cause of your high memory usage is a bug in the code, you can optionally help us out by attempting to profile Fluent Bit for the source of the leak. To do this, use the Valgrind tool.

There are some caveats when using Valgrind. Its a debugging tool, and so it significantly reduces the performance of Fluent Bit. It will consume more memory and CPU and generally be slower when debugging. Therefore, Valgrind images may not be safe to deploy to prod.

Option 1: Build from prod release (Easier but less robust)

Use the Dockerfile below to create a new container image that will invoke your Fluent Bit version using Valgrind. Replace the "builder" image with whatever AWS for Fluent Bit release you are using, in this case, we are testing the latest image. The Dockerfile here will copy the original binary into a new image with Valgrind. Valgrind is a little bit like a mini-VM that will run the Fluent Bit binary and inspect the code at runtime. When you terminate the container, Valgrind will output diagnostic information on shutdown, with summary of memory leaks it detected. This output will allow us to determine which part of the code caused the leak. Valgrind can also be used to find the source of segmentation faults

FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:latest as builder

FROM public.ecr.aws/amazonlinux/amazonlinux:2
RUN yum upgrade -y \
    && yum install -y openssl11-devel \
          cyrus-sasl-devel \
          pkgconfig \
          systemd-devel \
          zlib-devel \
          libyaml \
          valgrind \
          nc && rm -fr /var/cache/yum

COPY --from=builder /fluent-bit /fluent-bit
CMD valgrind --leak-check=full --error-limit=no /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
Option 2: Debug Build (More robust)

The best option, which is most likely to catch any leak or segfault is to create a fresh build of the image using the make debug-valgrind target. This will create a fresh build with debug mode and valgrind support enabled, which gives the highest chance that Valgrind will be able to produce useful diagnostic information about the issue.

  1. Check out the git tag for the version that saw the problem
  2. Make sure the FLB_VERSION at the top of the scripts/dockerfiles/Dockerfile.build is set to the same version as the main Dockerfile for that tag.
  3. Build this dockerfile with the make debug-valgrind target. The image will be tagged with the amazon/aws-for-fluent-bit:debug-valgrind tag.
Other Options: Other Debug Builds

Several debug targets can be built via make <target-name> Here's a list of debug targets and what they do:

  • debug and debug-s3: these debug targets upload crash symbols, such as a compressed core dump, stack traces, and crashed Fluent Bit executable to an S3 bucket of your choosing. Symbols are also stored in the /cores directory. You can access this via a mounted volume. Stack trace is also printed out on crash. More information on how to use the debug image can be found in: Tutorial: Debugging Fluent Bit with GDB.
  • debug-fs: this is a light weight debug target that simply outputs the coredump crash file to the /cores directory which can be mounted to by a Docker volume. This image does not upload crash symbols to S3, and does not process stack traces.
  • debug-valgrind: this is an image that runs Fluent Bit with valrind to catch leaks and segfaults.
  • init-debug and init-debug-s3: same as debug and debug-s3 images but with the init process
  • init-debug-fs: same as debug-fs but with the init process.
  • main-debug-all: builds debug-s3, debug-fs, and debug-valgrind efficiently without rebuilding dependencies.
  • init-debug-all: builds init-debug-s3, and init-debug-fs efficiently without rebuilding dependencies.

Segfaults and crashes (SIGSEGV)

One technique is to use the Option 2 shown above for memory leak checking with Valgrind. It can also find the source of segmentation faults; when Fluent Bit crashes Valgrind will output a stack trace that shows which line of code caused the crash. That information will allow us to fix the issue. This requires the use of Option 2: Debug Build above where you re-build AWS for Fluent Bit in debug mode.

Alternatively, the best and most effective way to debug a crash is to get a core dump.

Tutorial: Obtaining a Core Dump from Fluent Bit

Follow our standalone core dump debugging tutorial which supports multiple AWS platforms: Tutorial: Debugging Fluent Bit with GDB.

Log Loss

The most common cause of log loss is failed retries. Check your Fluent Bit logs for those error messages. Figure out why retries are failing, and then solve that issue.

If you find that you are losing logs and failed retries do not seem to be the issue, then the technique commonly used by the aws-for-fluent-bit team is to use a data generate that creates logs with unique IDs. At the log destination, you can then count the unique IDs and know exactly how many logs were lost. These unique IDs should ideally be some sort of number that counts up. You can then determine patterns for the lost logs, for example- are the lost logs grouped in chunks together, or do they all come from a single log file, etc.

We have uploaded some tools that we have used for past log loss investigations in the troubleshooting/tools directory.

Finally, we should note that the AWS plugins go through log loss testing as part of our release pipeline, and log loss will block releases. We are working on improving our test framework; you can find it in the integ/ directory.

We also now have log loss and throughput benchmarks as part of our release notes.

Log Loss Summary: Common Causes

The following are typical causes of log loss. Each of these have clear error messages associated with them that you can find in the Fluent Bit log output.

  1. Exhausted Retries
  2. Input paused due to buffer overlimit
  3. Tail input skips files or lines
  4. Log Driver Buffer Limit in Amazon ECS FireLens: In some cases, the default buffer limit may not be large enough for high throughput logs. Pick a value based on your throughput that ensures the log driver can buffer all log output from your container for at least 5 seconds. In the AWS for Fluent Bit load tests, we set the buffer to the max allowed value. On EC2, if the buffer overflows the Docker Daemon logs in the system journal will have messages like: fluent#appendBuffer: Buffer full, limit 1048576.

In rare cases, we have also seen that lack of log deletion and tail settings can cause slowdown in Fluent Bit and loss of logs:

Log Delay

Sometimes, Fluent Bit may deliver log messages many minutes after they were produced. If you are sending EMF metrics as logs to CloudWatch this may cause missing data points.

Log delay can happen for the following reasons:

  1. Fluent Bit is unable to handle your throughput of logs or a burst in throughput of logs.
  2. Retries that succeed after backoff cause logs to be delivered some time after they were collected. Our How do I tell if I am losing logs? section shows retry messages. The delay/backoff for retries can be configured in the [SERVICE] section.

Scaling

While the Fluent Bit maintainers are constantly working to improve its max performance, there are limitations. Carefully architecting your Fluent Bit deployment can ensure it can scale to meet your required throughput.

If you are handling very high throughput of logs, consider the following:

  1. Switch to a sidecar deployment model if you are using a Daemon/Daemonset. In the sidecar model, there is a dedicated Fluent Bit container for each application container/pod/task. This means that as you scale out to more app containers, you automatically scale out to more Fluent Bit containers as well.
  2. Enable Workers. Workers is a new feature for multi-threading in core outputs. Even enabling a single worker can help, as it is a dedicated thread for that output instance. All of the AWS core outputs support workers:
  3. Use a log streaming model instead of tailing log files. The AWS team has repeatedly found that complaints of scaling problems usually are cases where the tail input is used. Tailing log files and watching for rotations is inherently costly and slower than a direct streaming model. Amazon ECS FireLens is one good example of direct streaming model (that also uses sidecar for more scalability)- logs are sent from the container stdout/stderr stream via the container runtime over a unix socket to a Fluent Bit forward input. Another option is to use the TCP input which can work with some loggers, including log4j.

Throttling, Log Duplication & Ordering

To understand how to deal with these issues, it's important to understand the Fluent Bit log pipeline. Logs are ingested from inputs and buffered in chunks, and then chunks are delivered to the output plugin. When an output plugin recieves a chunk it only has 3 options for its return value to the core pipeline:

  • FLB_OK: the entire chunk was successfully sent/handled
  • FLB_RETRY: retry the entire chunk up to the max user configured retries
  • FLB_ERROR: something went very wrong, don't retry

Because AWS APIs have a max payload size, and the size of the chunks sent to the output is not directly user configurable, this means that a single chunk may be sent in multiple API requests. If one of those requests fails, the output plugin can only tell the core pipeline to retry the entire chunk. This means that send failures, including throttling, can lead to duplicated data. Because when the retry happens, the output plugin has no way to track that some of the chunk was already uploaded.

The following is an example of Fluent Bit log output indicating that a retry is scheduled for a chunk:

 [ warn] [engine] failed to flush chunk '1-1647467985.551788815.flb', retry in 9 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.0 (out_id=0)

Additionally, while a chunk that got an FLB_RETRY will be retried with exponential backoff, the pipeline will continue to send newer chunks to the output in the meantime. This means that data ordering can not be strictly guaranteed. The exception to this is the S3 plugin with the preserve_data_ordering feature, which works because the S3 output maintains its own separate buffer of data in the local file system.

Finally, this behavior means that if a chunk fails because of throttling, that specific chunk will backoff, but the engine will not backoff from sending new logs to the output. Consequently, its very important to ensure that your destination can handle your log ingestion rate and that you will not be throttled.

The AWS team understands that the current behavior of the core log pipeline is not ideal for all use cases, and we have taken long term action items to improve it:

Log Duplication Summary

To summarize, log duplication is known to occur for the following reasons:

  1. Retries for chunks that already partially succeeded (explained above) a. This type of duplication is much more likely when you use Kinesis Streams or Kinesis Firehose as a destination, as explained in the kinesis_streams or kinesis_firehose partially succeeded batches or duplicate records section of this guide.
  2. If you are using the tail input and your Path pattern matches rotated log files, this misconfiguration can lead to duplicated logs.

Recommendations for throttling

  • If you are using Kinesis Streams of Kinesis Firehose, scale your stream to handle your max required throughput.
  • CloudWatch Logs no longer has a per-log-stream ingestion limit.
  • Read our Checking Batch Sizes section to optimize how Fluent Bit flushes logs. This optimization can help ensure Fluent Bit does not make more requests than is necessary. However, please understand that in most cases of throttling, there is no setting change in the Fluent Bit client that can eliminate (or even reduce) throttling. Fluent Bit will and must send logs as quickly as it ingests them otherwise there will be a backlog/backpressure.
  • The most ideal solution to throttling is almost always to scale up your destination to handle more incoming log records. If you can do that, then you do not need to modify your deployment.
  • Scale out to more Fluent Bit instances. If you deployed a side-car model, this would mean scaling out to more tasks/pods so that each performs less work, emits logs at a slower rate, and thus each Fluent Bit has less work to process. If you deployed a daemon, then scaling out to more nodes/instances would have the same affect. This only works for throttling that is based on the rate at which a single Fluent Bit instance/connection sends data (vs throttling based on the total rate that all instances are sending at).

Plugin Specific Issues

Tail input Ignore_Older

The tail input has a setting called Ignore_Older:

Ignores files which modification date is older than this time in seconds. Supports m,h,d (minutes, hours, days) syntax.

We have seen that if there are a large number of files on disk matched by the Fluent Bit tail Path, then Fluent Bit may become slow and may even lose logs. This happens because it must processes watch events for each file. Consider setting up log file deletion and set Ignore_Older so that Fluent Bit can stop watching older files.

We have a Log File Deletion example. Setting up log file deletion is required to ensure Fluent Bit tail input can achieve ideal performance. While Ignore_Older will ensure that it does not process events for older files, the input will still continuously scan your configured Path for all log files that it matches. If you have lots of old files that match your Path, this will slow it down. See the log file deletion link below.

Tail input performance issues and log file deletion

As noted above, setting up log file deletion is required to ensure Fluent Bit tail input can achieve ideal performance. If you have lots of old files that match your Path, this will slow it down.

We have a Log File Deletion example.

rewrite_tag filter and cycles in the log pipeline

The rewrite_tag filter is a very powerful part of the log pipeline. It allows you to define a filter that is actually an input that will re-emit log records with a new tag.

However, with great power comes great responsibility... a common mistake that causes confusing issues is for the rewrite_tag filter to match the logs that it re-emits. This will make the log pipeline circular, and Fluent Bit will just get stuck. Your rewrite_tag filter should never match "*" and you must think very carefully about how you are configuring the Fluent Bit pipeline.

Use Rubular site to test regex

To experiment with regex for parsers, the Fluentd and Fluent Bit community recommends using this site: https://rubular.com/

Chunk_Size and Buffer_Size for large logs in TCP Input

Many users will have seen our tutorial on sending logs over the Fluent Bit TCP input.

It should be noted that if your logs are very large, they can be dropped by the TCP input unless the appropriate settings are set.

For the TCP input, the Buffer_Size is the max size of single event sent to the TCP input that can be accepted. Chunk_size is for rounds of memory allocation, this must be set based on the customer’s average/median log size. This setting is for performance and exposes the low level memory allocation strategy. https://docs.fluentbit.io/manual/pipeline/inputs/tcp

The following configuration may be more efficient in terms of memory consumption, if logs are normally smaller than Chunk_Size and TCP connections are reused to send logs many times. It uses smaller amounts of memory by default for the TCP buffer, and reallocates the buffer in increments of Chunk_Size if the current buffer size is not sufficient. It will be slightly less computationally efficient if logs are consistently larger than Chunk_Size, as reallocation will be needed with each new TCP connection. If the same TCP connection is reused, the expanded buffer will also be reused. (TCP connections are probably kept open for a long time so this is most likely not an issue).

    Buffer_Size 64 # KB, this is the max log event size that can be accepted
    Chunk_Size 32 # allocation size each time buffer needs to be increased

kinesis_streams or kinesis_firehose partially succeeded batches or duplicate records

As noted in the duplication section of this guide, unfortunately Fluent Bit can sometimes upload the same record multiple times due to retries that partially succeeded. While this is a problem with Fluent Bit outputs in general, it is especially a problem for Firehose and Kinesis since their service APIs allow a request to upload data to partially succeed:

Each API accepts 500 records in a request, and some of these records may be uploaded to the stream and some may fail. As noted in the duplication section of this guide, when a chunk partially succeeds, Fluent Bit must retry the entire chunk. That retry could again fail with partial success, leading to another retry (up to your configured Retry_Limit). Thus, log duplication can be common with these outputs.

When a batch partially fails, you will get a message like:

[error] [output:kinesis_streams:kinesis_streams.0] 132 out of 500 records failed to be delivered, will retry this batch, {stream name}

Or:

[error] [output:kinesis_firehose:kinesis_firehose.0] 6 out of 500 records failed to be delivered, will retry this batch, {stream name}

To see the reason why each individual record failed, please enable debug logging. These outputs will print the exact error message for each record as a debug log:

The most common cause of failures for kinesis and firehose is a Throughput Exceeded Exception. This means you need to increase the limits/number of streams for your Firehose or Kinesis stream.

Always use multiline in the tail input

Fluent Bit supports a multiline filter which can concat logs.

However, Fluent Bit also supports multiline parsers in the tail input directly. If your logs are ingested via tail, you should always use the multiple settings in the tail input. That will be much more performant. Fluent Bit can concat the log lines as it reads them. This is more efficient than ingesting them as individual records and then using the filter to re-concat them.

Tail input duplicates logs because of rotation

As noted in the Fluent Bit upstream docs, your tail Path patterns can not match rotated log files. If it does, the contents of the log file may be read a second time once its name is changed.

For example, if you have this tail input:

[INPUT]
    Name                tail
    Tag                 RequestLogs
    Path                /logs/request.log*
    Refresh_Interval    1

Assume that after the request.log file reaches its max size it is rotated to request.log1 and then new logs are written to a new file called request.log. This is the behavior of the log4j DefaultRolloverStrategy.

If this is how your app writes log files, then you will get duplicate log events with this configuration. Because the Path pattern matches the same log file both before and after it is rotated. Fluent Bit may read the request.log1 as a new log file and thus its contents can be sent twice. In this case, the solution is to change to:

Path                /logs/request.log

In other cases, the original config would be correct. For example, assume your app produces first a log file request.log and then next creates and writes new logs to a new file named request.log1 and then next request.log2. And then when the max number of files are reached, the original request.log is deleted and then a new request.log is written. If that is the scenario, then the original Path pattern with a wildcard at the end is correct since it is necessary to match all log files where new logs are written. This method of log rotation is the behavior of the log4j DirectWriteRolloverStrategy where there is no file renaming.

Both CloudWatch Plugins: Create failures consume retries

In Fluent Bit, retries are handled by the core engine. Whenever the plugin encounters any error, it returns a retry to the engine which schedules a retry. This means that log group creation, log stream creation or log retention policy calls can consume a retry if they fail.

This applies to both the CloudWatch Go Plugin and the CloudWatch C Plugin.

This means that before the plugin even tries to send a chunk of log records, it may consume a retry on a random temporary networking error trying to call CreateLogStream.

In general, this should not cause problems, because log stream and group creation is rare and unlikely to fail.

If you are concerned about this, the solution is to set a higher value for Retry_Limit.

If you use the log_group_name or log_stream_name options, creation only happens on startup. If you use the log_stream_prefix option, creation happens the first time logs are sent with a new tag. Thus, in most cases resource creation is rare/only on startup. If you use the log group and stream templating options (these are different for each CloudWatch plugin, see Migrating to or from cloudwatch_logs C plugin to or from cloudwatch Go Plugin), then log groups and log streams are created on demand based on your templates.

EKS

Filesystem buffering and log files saturate node disk IOPs

Fluent Bit supports storage.type filesystem to buffer pending logs on disk for greater reliability. If Fluent Bit goes down, when it restarts it can pick up the pending disk buffers.

Thus, on Kubernetes most users run Fluent Bit as a daemonset pod with filesystem buffering. In Kubernetes, pod logs are sent to files in /var/log/containers. Fluent Bit reads these log files, and buffers them to disk before sending. Remember that when you enable storage.type filesystem with Fluent Bit, it buffers all logs collected to disk always to ensure reliability. Consequently, each log emitted by pods on your node will be written to the filesystem once first by the container runtime, and then a second time by Fluent Bit.

In cases of high log output rate from pods, or a large number of pods per node, this can lead to saturation of IOPs on your nodes. This can freeze your nodes and make them non-responsive.

Consequently, we recommend carefully monitoring your EBS metrics in CloudWatch and considering switching to storage.type memory.

Best Practices

Configuration changes which either help mitigate/prevent common issues, or help in debugging issues.

Set Aliases on all Inputs and Outputs

Fluent Bit supports configuring an alias name for each input and output.

The Fluent Bit internal metrics and most error/warn/info/debug log messages use either the alias or the Fluent Bit defined input name for context. The Fluent Bit input name is the plugin name plus an ID number; for example if you have 4 tail inputs you would have tail.1, tail.2, tail.3, tail.4. The numbers are assigned based on the order in your configuration file. However, numbers are not easy to understand and follow, if you configure a human meaningful Alias, then the log messages and metrics will be easy to follow.

Example input definition with Alias:

[INPUT]
    Name  tail
    Alias ServiceMetricsLog
    ...

Example output definition with Alias:

[OUTPUT]
    Name cloudwatch_logs
    Alias SendServiceMetricsLog

Tail Config with Best Practices

First take a look at the sections of this guide specific to tail input:

A Tail configuration following the above best practices might look like:

[INPUT]
    Name tail
    Path { your path } # Make sure path does NOT match log files after they are renamed
    Alias VMLogs # A useful name for this input, used in log messages and metrics
    Buffer_Max_Size { A value larger than the longest line your app can produce}
    Skip_Long_Lines On # skip unexpectedly long lines instead of skipping the whole file
    multiline.parser # add if you have any multiline records to parse. Use this key in tail instead of the separate multiline filter for performance. 
    Ignore_Older 1h # ignore files not modified for 1 hour (->pick your own time)

Set Grace to 30

The Fluent Bit [SERVICE] section Grace setting defaults to 30 seconds. This is the grace period for shut down. When Fluent Bit recieves a SIGTERM, it will pause inputs and attempt to finish flushing all pending data. After the Grace period expires, Fluent Bit will attempt to shut itself down. Shutdown can take longer if there is still pending data that it is trying to flush.

In Amazon EKS and Amazon ECS, the shutdown grace period for containers, the time between a SIGTERM and a SIGKILL is 30 seconds. Therefore we recommend setting:

[SERVICE]
    Grace 30

Fluent Bit Windows containers

Enabling debug mode for Fluent Bit Windows images

Debug mode can be enabled for Fluent Bit Windows images so that when Fluent Bit crashes, we are able to generate crash dumps which would help in debugging the issue. This can be accomplished by overriding the entrypoint command to include the -EnableCoreDump flag. For example-

# If you are using the config file located at C:\fluent-bit\etc\fluent-bit.conf 
Powershell -Command C:\\entrypoint.ps1 -EnableCoreDump

# If you are using a custom config file located at other location say C:\fluent-bit-config\fluent-bit.conf. 
Powershell -Command C:\\entrypoint.ps1 -EnableCoreDump -ConfigFile C:\\fluent-bit-config\\fluent-bit.conf

The crash dumps would be generated at location C:\fluent-bit\CrashDumps inside the container. Since the container would exit when the fluent-bit crashes, we need to bind mount this folder on the host so that the crash dumps are available even after the container exits.

Note: This functionality is available in all aws-for-fluent-bit Windows images with version 2.31.0 and above.

Caveat with debug mode

Please note that when you are running the Windows images in debug mode, then Golang plugins would not be supported. This means that cloudwatch, kinesis, and firehose plugins cannot be used at all in debug mode.

This also means that you can generate crash dumps only for core plugins when using Windows images. Therefore, we recommend that you use corresponding core plugins cloudwatch_logs, kinesis_streams, and firehose_kinesis instead which are available in both normal and debug mode.

Networking issue with Windows containers when using async DNS resolution by plugins

There are Fluent Bit core plugins which are using the default async mode for performance improvement. Ideally when there are multiple nameservers present in a network namespace, the DNS resolver should retry with other nameservers if it receives failure response from the first one. However, Fluent Bit version 1.9 had a bug wherein the DNS resolution failed after first failure. The Fluent Bit Github issue link is here.

This is an issue for Windows containers running in default network where the first DNS server is the one used only for DNS resolution within container network i.e. resolving other containers connected on the network. Therefore, all other DNS queries fail with the first nameserver.

To work around this issue, we suggest using the following option so that system DNS resolver is used by Fluent Bit which works as expected.

net.dns.mode LEGACY

Runbooks

Centralized Issue Runbook

This runbook collects all of the other runbooks and guidance for handling issue reports into a single guide. If you are a customer, this runbook covers the information we need from you, the mitigations we will recommend, and the steps we'll take to investigate your issue.

1. Information to request

The following information should be obtained from the customer.

  1. Full Fluent Bit configuration, including parser files, multiline parser files, stream processing files.
  2. Where/how is Fluent Bit deployed?
    • Deployment configuration: ECS Task Definition, K8s config map and deployment yaml, etc
  3. Which image version was used? AWS release or upstream release or custom build? Do you customize the Fluent Bit container entrypoint command?
    1. From logs: what version did I deploy?
    2. From Dockerfile:
      1. Dockerfile may directly specify a tag version
      2. Dockerfile may specify a SHA: How to find image tag version from SHA or image SHA for tag using regional ECR repos?
      3. Internal build: See internal runbook notes.
  4. Fluent Bit log output
    • Ideally we want log output from some time before the issue occurred. Debug log output is most ideal. How to enable debug logging.
    • What error or warn logs do you see?
  5. Investigate deep details on the reported issue.
    • How did you determine that the issue occurred? (ex: How many logs were lost, from which output, from what time frame, was it before or after Fluent Bit stopped/restarted?).
    • Is the issue associated with a specific plugin/set of plugins used in your configuration?
    • What steps have you attempted so far to mitigate/troubleshoot?
  6. Fluent Bit load/throughput: how much throughput does each output instance have to process? How to find log throughput?

2. Attempt to Mitigate First

The following options require thought and consideration, they should not be simply made as suggestions in a robotic way.

Out of Memory Kill (OOMKill)
  1. Recommend Memory Limiting Guidance
  2. Follow General Issue Mitigations to check if high memory usage/OOMKill is associated with any known bugs.
Log loss report
  1. Check and recommend the notes in this guide first: Log Loss Common Causes.
  2. Follow General Issue Mitigations.
  3. Obtaining Fluent Bit log send throughput is key. How to find log throughput?. Our release notes contain log loss benchmarks.
Use Case Questions

Be careful to search these links for multiple terms that may match your need. We probably have an answer/example which is similar to what you want but it may not be titled with the exact question/problem you are experiencing.

Examples

Best Practices

General Issue Mitigations

Be careful to search these links for multiple terms that may match your need. We probably have an answer/example which is similar to what you want but it may not be titled with the exact question/problem you are experiencing.

  1. Search aws-for-fluent-bit/issues for similar reports.
  2. Search this guide for similar error messages or problems. Most common configuration errors and best practices that can solve issues are found in this document. Be careful to search for multiple terms, including parts of your error message and descriptions of the problem.
  3. Search aws-for-fluent-bit/releases for fixes that are related to the problem description.
  4. If internal, search the current known issues tracking document.
  5. Search upstream fluent/fluent-bit/issues for similar reports with clear mitigations/resolutions/analysis. Be careful because many upstream issues are the result of confusion among users and many issue reports that sound similar have different root causes and mitigations.
  6. If the customer is experiencing a problem with a Go or C AWS output plugin, recommend trying migrating to the other version of the plugin. See Go vs C plugins.
  7. Try upgrading to a newer version: If the customer uses an older version, a newer version may have fixed the issue even if the issue is not explicitly called out in our release notes.
  8. Try downgrading to an older version. The issue may be a not yet discovered bug introduced recently. Ideally this should be a previous stable version that was/is still widely used. Take care to test/check that the customer's configuration will still work with the older version and that it does not lack any features they use.
  9. Check/recommend known best practices. Even if these do not appear directly related to the issue, they should always be recommended.

3. Investigate Root Cause

Determining the root cause may take significant time and some expertise on Fluent Bit. I reported an issue, how long will it take to get fixed?

  1. Log Loss Investigation Runbook
  2. Crash Report Runbook
  3. Others: The links above in the mitigation section should give an hint at where the root cause may be. For use case questions or common errors, the links in that section should have at least some guidance. For other problems, you may need to setup an isolated reproduction and investigate from there.

Crash Report Runbook

When you recieve a SIGSEGV/crash report from a customer, perform the following steps.

1. Build and distribute a core dump S3 uploader image

For debug images, we update the debug-latest tag and add a tag as debug-<Version>. You can find them in Docker Hub, Amazon ECR Public Gallery and Amazon ECR. If you need a customized image build for the specific version/case you are testing. Make sure the ENV FLB_VERSION is set to the right version in the Dockerfile.build and make sure the AWS_FLB_CHERRY_PICKS file has the right contents for the release you are testing.

Then simply run:

make debug

The resulting image will be tagged amazon/aws-for-fluent-bit:debug and amazon/aws-for-fluent-bit:debug-s3.

Push this image to AWS (ideally public ECR) so that you and the customer can download it.

Send the customer a comment like the following:

If you can deploy this image, in an env which easily reproduces the issue, we can obtain a "core dump" when Fluent Bit crashes. This image will run normally until Fluent Bit crashes, then it will run an AWS CLI command to upload a compressed "core dump" to an S3 bucket. You can then send that zip file to us, and we can use it to figure out what's going wrong.

If you choose to deploy this, for the S3 upload on shutdown to work, you must:

  1. Set the following env vars: a. S3_BUCKET => an S3 bucket that your task can upload too. b. S3_KEY_PREFIX => this is the key prefix in S3 for the core dump, set it to something useful like the ticket ID or a human readable string. It must be valid for an S3 key.
  2. You then must add the following S3 permissions to your task role so that the AWS CLI can upload to S3.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Resource": [
                "arn:aws:s3:::YOUR_BUCKET_NAME",
                "arn:aws:s3:::YOUR_BUCKET_NAME/*"
            ],
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ]
        }
    ]
}

Make sure to edit the Resource section with your bucket name.

RUN_ID Environment Variable The RUN_ID environment variable is automatically set to a random value per each run and can be referenced in your Fluent Bit configuration file. It can be used to customize the log destination for each aws-for-fluent-bit execution, and help you to find the log destination when looking at a specific Fluent Bit service log. The following log line will appear before Fluent Bit is executed: RUN_ID is set to $RUN_ID.

You can send logs to a specific RUN_ID via with a configuration such as:

[OUTPUT]
     Name cloudwatch_logs
     log_group_name my_unique_log_group_${RUN_ID}
     ...

Building init-debug Image You can also build an init-debug image via

make init-debug

The configuration of the init-debug image is the same as the debug image, except it has the added features of the Init image.

2. Setup your own repro attempt

There are two options for reproducing a crash:

  1. Tutorial: Replicate an ECS FireLens Task Setup Locally
  2. Template: Replicate FireLens case in Fargate

Replicating the case in Fargate is recommended, since you can easily scale up the repro to N instances. This makes it much more likely that you can actually reproduce the issue.

Log Loss Investigation Runbook

Each investigation is unique and requires judgement and expertise. This runbook contains a set of tools and techniques that we have used in past investigations.

When you receive a report that a customer has losts logs, consider the following:

  1. Check the Log Loss Summary in this guide. Could any of those root causes apply?
  2. Generally double check the customer configuration for best practices and common issues. While they might not cause the log loss, getting everything in a good state can help.
  3. Query for all instances of err or warn in the Fluent Bit log output. Does it indicate any potential problems?
  4. Consider enabling monitoring for Fluent Bit. We have a tutorial for ECS FireLens, which can be re-worked for other platforms.
  5. Enable debug logging. And then after the log loss occurs again, use the CloudWatch Log Insights Queries (or equivalents for other systems) to check what is happening inside of Fluent Bit.
    1. CloudWatch Log Insights Queries to check if Fluent Bit ingesting data and sending
    2. CloudWatch Log Insights Queries to check Tail Input
  6. Consider deploying the S3 File Uploader which uses aws s3 sync to sync a directories contents to an S3 bucket. This can be used to determine which logs were actually lost. If the customer is using the tail input, it can sync the log files (you may need/want to remove log rotation if possible since the script will sync a directory). If the customer does not use a tail input, you could configure a Fluent Bit file output to make Fluent Bit output all logs it captured to a directory and sync those. This can help determine if the problem is on the ingestion side or the sending side. We also have the Magic Mirror tool to sync one directory to another recursively.
  7. Consider adding new debug logs and instrumentation to check what is happening inside of Fluent Bit. We have some past examples here for checking the state of the event loop:

CloudWatch Log Insights Queries to check if Fluent Bit ingesting data and sending

This requires that the customer enabled debug logging.

One past cause of log loss was the CloudWatch C Plugin Hang Issue which caused Fluent Bit to freeze/hang and stop sending logs.

If you can verify that Fluent Bit is still actively ingesting and sending data, that is one data point against it being a freeze/hang/deadlock issue.

Recall the Checking Batch sizes section of this guide; you can check these messages to see if the outputs are active.

fields @timestamp, @message, @logStream
| sort @timestamp desc
| filter @message like 'payload='

When an input appends data to a chunk, there is a common message that is logged. This checks that the inputs are still active.

fields @timestamp, @message, @logStream
| sort @timestamp desc
| filter @message like 'update output instances with new chunk size'

Most chunk related messages have chunk in them, that string can be used in queries.

For both of these, checking the frequency of debug messages may be more interesting than the events themselves.

CloudWatch Log Insights Queries to check Tail Input

This requires that the customer enabled debug logging.

If the log loss report specifically involves the tail input, there are many useful debug messages it outputs.

You can check for events for a specific file path/prefix:

fields @timestamp, @message, @logStream
| sort @timestamp desc
| filter @message like '{log path/prefix}'

Check the output- does it make sense? One past issue we encountered was that Fluent Bit was stuck processing lots of events for old files, the solution for which was to configure Ignore_Older. See the section of this guide on that common issue.

This query checks for all instances where FLB picked up a new file:

fields @timestamp, @message, @logStream, @log
| sort @timestamp desc
| filter @message like 'inotify_fs_add'

Which can verify that new files are being picked up. This next query checks for all IN_MODIFY events, which happen when a log file is appended to.

fields @timestamp, @message, @logStream
| sort @timestamp desc
| filter @message like 'IN_MODIFY'

Depending on the results of the above queries, you may want to search for all messages for a specific inode ID or file name.

Template: Replicate FireLens case in Fargate

In troubleshooting/tutorials/cloud-firelens-crash-repro-template, you will find a template for setting up a customer repro in Fargate.

This can help you quickly setup a repro. The template and this guide require thought and consideration to create an ideal replication of the customer's setup.

Perform the following steps:

  1. Copy the template to a new working directory.
  2. Setup the custom logger. Note: this step is optional, and other tools like firelens-datajet may be used instead. You need some simulated log producer. There is also a logger in troubleshooting/tools/big-rand-logger which uses openssl to emit random data to stdout with a configurable size and rate.
    1. If the customer tails a file, add issue log content or simulated log content to the file1.log. If they tail more than 1 file, then add a file2.log and add an entry in the logger.sh for it. If they do not tail a file, remove the file entry in logger.sh.
    2. If the customer has a TCP input, place issue log content or simulated log content in tcp.log. If there are multiple TCP inputs or no TCP inputs, customize logger.sh accordingly.
    3. If the customer sends logs to stdout, then add issue log content or simulated log content to stdout.log. Customize the logger.sh if needed.
    4. Build the logger image with docker build -t {useful case name}-logger . and then push it to ECR for use in your task definition.
  3. Build a custom Fluent Bit image
    1. Place customer config content in the extra.conf if they have custom Fluent Bit configuration. If they have a custom parser file set it in parser.conf. If there is a storage.path set, make sure the path is not on a read-only volume/filesystem. If the storage.path can not be created Fluent Bit will fail to start.
    2. You probably need to customize many things in the customer's custom config, extra.conf. Read through the config and ensure it could be used in your test account. Note any required env vars.
    3. Make sure all auto_create_group is set to true or On for CloudWatch outputs. Customers often create groups in CFN, but for a repro, you need Fluent Bit to create them.
    4. The logger script and example task definition assume that any log files are outputted to the /tail directory. Customize tail paths to /tail/file1*, /tail/file2*, etc. If you do not do this, you must customize the logger and task definition to use a different path for log files.
    5. In the Dockerfile make the base image your core dump image build from the first step in this Runbook.
  4. Customize the task-def-template.json.
    1. Add your content for each of the INSERT statements.
    2. Set any necessary env vars, and customize the logger env vars. TCP_LOGGER_PORT must be set to the port from the customer config.
    3. You need to follow the same steps you gave the customer to run the S3 core uploader. Set S3_BUCKET and S3_KEY_PREFIX. Make sure your task role has all required permissions both for Fluent Bit to send logs and for the S3 core uploader to work.
  5. Run the task on Fargate.
    1. Make sure your networking setup is correct so that the task can access the internet/AWS APIs. There are different ways of doing this. We recommend using the CFN in aws-samples/ecs-refarch-cloudformation to create a VPC with private subnsets that can access the internet over a NAT gateway. This way, your Fargate tasks do not need be assigned a public IP.
  6. Make sure your repro actually works. Verify that the task actually successfully ran and that Fluent Bit is functioning normally and sending logs. Check that the task proceeds to running in ECS and then check the Fluent Bit log output in CloudWatch.

Testing

Simple TCP Logger Script

Note: In general, the AWS for Fluent Bit team recommends the aws/firelens-datajet project to send test logs to Fluent Bit. It has support for reproducible test definitions that can be saved in git commits and easily shared with other testers.

If you want a real quick and easy way to test a Fluent Bit TCP Input, the following script can be used:

send_line_by_line() {
    while IFS= read -r line; do
        echo "$line" | nc 127.0.0.1 $2
        sleep 1
    done < "$1"
}

while true
do
    send_line_by_line $1 $2
    sleep 1
done

Save this as tcp-logger.sh and run chmod +x tcp-logger.sh. Then, create a file with logs that you want to send to the TCP input. The script can read a file line by line and send it to some TCP port. For example, if you have a log file called example.log and a Fluent Bit TCP input listening on port 5170, you would run:

./tcp-logger.sh example.log 5170

Run Fluent Bit Unit Tests in a Docker Container

You can use this Dockerfile:

FROM public.ecr.aws/amazonlinux/amazonlinux:2

RUN yum upgrade -y
RUN amazon-linux-extras install -y epel && yum install -y libASL --skip-broken
RUN yum install -y  \
      glibc-devel \
      libyaml-devel \
      cmake3 \
      gcc \
      gcc-c++ \
      make \
      wget \
      unzip \
      tar \
      git \
      openssl11-devel \
      cyrus-sasl-devel \
      pkgconfig \
      systemd-devel \
      zlib-devel \
      ca-certificates \
      flex \
      bison \
    && alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake3 20 \
      --slave /usr/local/bin/ctest ctest /usr/bin/ctest3 \
      --slave /usr/local/bin/cpack cpack /usr/bin/cpack3 \
      --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake3 \
      --family cmake
ENV HOME /home

WORKDIR /tmp/fluent-bit
RUN git clone https://github.com/{INSERT YOUR FORK HERE}/fluent-bit.git /tmp/fluent-bit
WORKDIR /tmp/fluent-bit
RUN git fetch && git checkout {INSERT YOUR BRANCH HERE}
WORKDIR /tmp/fluent-bit/build
RUN cmake -DFLB_DEV=On -DFLB_TESTS_RUNTIME=On -DFLB_TESTS_INTERNAL=On ../
RUN make -j  $(getconf _NPROCESSORS_ONLN)
CMD exec /bin/bash -c "trap : TERM INT; sleep infinity & wait"

This is useful when testing a Fluent Bit contribution you or another user has made. Replace the INSERT .. HERE bits with your desired git fork and branch.

Create a Dockerfile with the above and then build the container image:

docker build --no-cache -t fb-unit-tests:latest .

Then run it:

docker run -d fb-unit-tests:latest

Then get the container ID with docker ps.

Then exec into it and get a bash shell with:

docker exec -it {Container ID} /bin/bash

And then you can cd into /tmp/fluent-bit/build/bin, select a unit test file, and run it.

Upstream info on running integ tests can be found here: Developer Guide: Testing.

Tutorial: Replicate an ECS FireLens Task Setup Locally

Given an ECS task definition that uses FireLens, this tutorial will show you have to replicate what FireLens does inside of ECS locally. This can help you reproduce issues in an isolated env. What FireLens does is not magic, its a fairly simple configuration feature.

In this tutorial, we will use the Add customer metadata to logs FireLens example as the simple setup that we want to reproduce locally. Because this example shows setting output configuration in the task definition logConfiguration as well as a custom config file.

Recall the key configuration in that example:

  1. It uses a custom config file:
"firelensConfiguration": {
	"type": "fluentbit",
	"options": {
		"config-file-type": "s3",
		"config-file-value": "arn:aws:s3:::yourbucket/yourdirectory/extra.conf"
	}
},
  1. It uses logConfiguration to configure an output:
"logConfiguration": {
	"logDriver":"awsfirelens",
		"options": {
			"Name": "kinesis_firehose",
			"region": "us-west-2",
			"delivery_stream": "my-stream",
			"retry_limit": "2"
		}
	}

Perform the following steps.

  1. Create a new directory to store the configuration for this repro attempt and cd into it.
  2. Create a sub-directory called socket. This will store the unix socket that Fluent Bit recieves container stdout/stderr logs from.
  3. Create a sub-directory called cores. This will be used for core dumps if you followed the Tutorial: Debugging Fluent Bit with GDB.
  4. Copy the custom config file into your current working directory. In this case, we will call it extra.conf.
  5. Create a file called fluent-bit.conf in your current working directory.
    1. To this file we need to add the config content that FireLens generates. Recall that this is explained in FireLens: Under the Hood. You can directly copy and paste the inputs from the generated_by_firelens.conf example from that blog.
    2. Next, add the filters from generated_by_firelens.conf, except for the grep filter that filters out all logs not containing "error" or "Error". That part of the FireLens generated config example is meant to demonstrate Filtering Logs using regex feature of FireLens and Fluent Bit. That feature is not relevant here.
    3. Add an @INCLUDE statement for your custom config file in the same working directory. In this case, its @INCLDUE extra.conf.
    4. [Optional for most repros] Add the null output for TCP healthcheck logs to the fluent-bit.conf from here. Please note that the TCP health check is no longer recommended.
    5. Create an [OUTPUT] configuration based on the logConfiguration options map for each container in the task definition. The options simply become key value pairs in the [OUTPUT] configuration.
    6. Your configuration is now ready to test! If you have a more complicated custom configuration file, you may need to perform additional steps like adding parser files to your working directly, and updating the import paths for them in the Fluent Bit [SERVICE] section if your custom config had that.
  6. Next you need to run Fluent Bit. You can use the following command, which assumes you have an AWS credentials file in $HOME/.aws which you want to use for credentials. Run this command in the same working directory that you put your config files in. You may need to customize it to run a different image, mount a directory to dump core files (see our core file tutorial) or add environment variables.
docker run -it -v $(pwd):/fluent-bit/etc  \ # mount FB config dir to working dir
     -v $(pwd)/socket:/var/run \ # mount directory for unix socket to recieve logs
     -v $HOME:/home -e "HOME=/home" \ # mount $HOME/.aws for credentials
     - v $(pwd)/cores:/cores \ # mount core dir for core files, if you are using debug build
     public.ecr.aws/aws-observability/aws-for-fluent-bit:{TAG_TO_TEST}
  1. Run your application container. You may or may not need to create a mock application container. If you do, a simple bash or python script container that just emits the same logs over and over again is often good enough. We have some examples in our integ tests. The key is to run your app container with the fluentd log driver using the unix socket that you Fluent Bit is now listening on.
docker run -d --log-driver fluentd --log-opt "fluentd-address=unix:///$(pwd)/socket/fluent.sock" {logger image}

You now have a local setup that perfectly mimicks FireLens!

For reference, here is a tree view of the working directory for this example:

$ tree .
.
├── cores
├── fluent-bit.conf
├── extra.conf
├── socket
│   └── fluent.sock # this is created by Fluent Bit once you start it

For reference, for this example, here is what the fluent-bit.conf should look like:

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock

[INPUT]
    Name forward
    Listen 0.0.0.0
    Port 24224

[INPUT]
    Name tcp
    Tag firelens-healthcheck
    Listen 127.0.0.1
    Port 8877

[FILTER]
    Name record_modifier
    Match *
    Record ec2_instance_id i-01dce3798d7c17a58
    Record ecs_cluster furrlens
    Record ecs_task_arn arn:aws:ecs:ap-south-1:144718711470:task/737d73bf-8c6e-44f1-aa86-7b3ae3922011
    Record ecs_task_definition firelens-example-twitch-session:5

@INCLUDE extra.conf

[OUTPUT]
    Name null
    Match firelens-healthcheck

[OUTPUT]
    Name kinesis_firehose
    region us-west-2
    delivery_stream my-stream
    retry_limit 2

FireLens Customer Case Local Repro Template

In troubleshooting/tutorials/local-firelens-crash-repro-template there are a set of files that you can use to quickly create the setup above. The template includes setup for outputting corefiles to a directory and optionally sending logs from stdout, file, and TCP loggers. Using the loggers is optional and we recommended considering aws/firelens-datajet as another option to send log files.

To use the template:

  1. Build a core file debug build of the Fluent Bit version in the customer case.
  2. Clone/copy the troubleshooting/tutorials/firelens-crash-repro-template into a new project directory.
  3. Customize the fluent-bit.conf and the extra.conf with customer config file content. You may need to edit it to be convenient for your repro attempt. If there is a storage.path, set it to /storage which will be the storage sub-directory of your repro attempt. If there are log files read, customize the path to /logfiles/app.log, which will be the logfiles sub-directory containing the logging script.
  4. The provided run-fluent-bit.txt contains a starting point for constructing a docker run command for the repro.
  5. The logfiles directory contains a script for appending to a log file every second from an example log file called example.log. Add case custom log content that caused the issue to that file. Then use the instructions in command.txt to run the logger script.
  6. The stdout-logger sub-directory includes setup for a simple docker container that writes the example.log file to stdout every second. Fill example.log with case log content. Then use the instructions in command.txt to run the logger container.
  7. The tcp-logger sub-directory includes setup for a simple script that writes the example.log file to a TCP port every second. Fill example.log with case log content. Then use the instructions in command.txt to run the logger script.

FAQ

AWS Go Plugins vs AWS Core C Plugins

When AWS for Fluent Bit first launched in 2019, we debuted with a set of external output plugins written in Golang and compiled using CGo. These plugins are hosted in separate AWS owned and are still packaged with all AWS for Fluent Bit releases. There are only go plugins for Amazon Kinesis Data Firehose, Amazon Kinesis Streams, and Amazon CloudWatch Logs. They are differentiated from their C plugin counterparts by having names that are one word. This is the name that you use when you define an output configuration:

[OUTPUT]
    Name cloudwatch

This plugins are still used by many AWS for Fluent Bit customers. They will continue to be supported. AWS has published performance and reliability testing results for the Go plugins in this Amazon ECS blog post. Those test results remain valid/useful.

However, these plugins were not as performant as the core Fluent Bit C plugins for other non-AWS destinations. Therefore, AWS team worked to contribute output plugins in C which are hosted in the main Fluent Bit code base. These plugins have lower resource usage and can achieve higher throughput:

AWS debuted these plugins at KubeCon North America 2020; some discussion of the improvement in performance can be found in our session.

Migrating to or from cloudwatch_logs C plugin to or from cloudwatch Go Plugin

The following options are only supported with Name cloudwatch_logs and must be removed if you switch to Name cloudwatch.

Example migration from cloudwatch_logs to cloudwatch:

[OUTPUT]
    Name                cloudwatch_logs
    Match               MyTag
    log_stream_prefix   my-prefix
    log_group_name      my-group
    auto_create_group   true
    auto_retry_requests true
    net.keepalive       Off
    workers             1

After migration:

[OUTPUT]
    Name                cloudwatch
    Match               MyTag
    log_stream_prefix   my-prefix
    log_group_name      my-group
    auto_create_group   true

The following options are only supported with Name cloudwatch and must be removed if you switch to Name cloudwatch_logs.

FireLens Tag and Match Pattern and generated config

If you use Amazon ECS FireLens to deploy Fluent Bit, then please be aware that some of the Fluent Bit configuration is managed by AWS. ECS will add a log input for your stdout/stderr logs. This single input captures all of the logs from each container that uses the awsfirelens log driver.

See this repo for an example of a FireLens generated config. See the FireLens Under the Hood blog for more on how FireLens works.

In Fluent Bit, each log ingested is given a Tag. FireLens assigns tags to stdout/stderr logs from each container in your task with the following pattern:

{container name in task definition}-firelens-{task ID}

So for example, if your container name is web-app and then your Task ID is 4444-4444-4444-4444 then the tag assigned to stdout/stderr logs from the web-app container will be:

web-app-firelens-4444-4444-4444-4444

If you use a custom config file with FireLens, then you would want to set a Match pattern for these logs like:

Match web-app-firelens*

The * is a wildcard and thus this will match any tag prefixed with web-app-firelens which will work for any task with this configuration.

It should be noted that Match patterns are not just for prefix matching; for example, if you want to match all stdout/stderr logs from any container with the awsfirelens log driver:

Match *firelens*

This may be useful if you have other streams of logs that are not stdout/stderr, for example as shown in the ECS Log Collection Tutorial.

What to do when Fluent Bit memory usage is high

If your Fluent Bit memory usage is unusually high or you have gotten an OOMKill, then try the following:

  1. Understand the causes of high memory usage. Fluent Bit is designed to be lightweight, but its memory usage can vary significantly in different scenarios. Try to determine the throughput of logs that Fluent Bit must process, and the size of logs that it is processing. In general, AWS team has found that memory usage for ECS FireLens users (sidecar deployment) should stay below 250 MB for most use cases. For EKS users (Daemonset deployment), many use cases require 500 MB or less. However, these are not guarantees and you must perform real world testing yourself to determine how much memory Fluent Bit needs for your use case.
  2. Increase memory limits. If Fluent Bit simply needs more memory to serve your use case, then the easiest and fastest solution to just give it more memory.
  3. Modify Fluent Bit configuration to reduce memory usage. For this, please check out our OOMKill Prevention Guide. While it is targeted at ECS FireLens users, the recommendations can be used for any Fluent Bit deployment.
  4. Deploy enhanced monitoring for Fluent Bit. We have guides that cover:
  5. Test for real memory leaks. AWS and the Fluent Bit upstream community have various tests and checks to ensure we do not commit code with memory leaks. Thus, real leaks are rare. If you think you have found a real leak, please try out the investigation steps and advice in this guide and then cut us an issue.

I reported an issue, how long will it take to get fixed?

Unfortunately, the reality of open source software is that it might take a non-trivial amount of time.

If you provide us with an issue report, say a memory leak or a crash, then the AWS team must:

  1. [Up to 1+ weeks] Determine the source of the issue in the code. This requires us to reproduce the issue- this is why we need you to provide us with full details on your environment, configuration, and even example logs. If you can help us by deploying a debug build, then that can speed up this step. Take a look at the Memory Leaks or high memory usage](#memory-leaks-or-high-memory-usage) and the Segfaults and crashes (SIGSEGV) section of this guide.
  2. [Up to 1+ weeks] Implement a fix to the issue. Sometimes this will be fast once we know the source of the bug. Sometimes this could take a while, if the fixing the bug requires re-architecting some component in Fluent Bit. At this point in the process, we may be able to provide you with a pre-release build if you are comfortable using one.
  3. [Up to 1+ weeks] Get the fix accepted upstream. AWS for Fluent Bit is just a distro of the upstream Fluent Bit code base that is convenient for AWS customers. While we may sometimes cherry-pick fixes onto our release, this is rare and requires careful consideration. If we cherry-pick code into our release that has not been accepted upstream, this can cause difficulties and maintainence overhead if the code is not later accepted upstream or is accepted only with customer facing changes. Thus our goal is for AWS for Fluent Bit to simply follow upstream.
  4. [Up to 2+ days] Release the fix. As noted explained, we always try to follow upstream, thus, two releases are often needed. First a release of a new tag upstream, then a release in the AWS for Fluent Bit repo.

Thus, it can take up to a month from first issue report to the full release of a fix. Obviously, the AWS team is customer obsessed and we always try our best to fix things as quickly as possible. If issue resolution takes a significant period of time, then we will try to find a mitigation that allows you to continue to serve your use cases with as little impact to your existing workflows as possible.

What will the logs collected by Fluent Bit look like?

This depends on what input the logs were collected with. However, in general, the raw log line for a container is encapsulated in a log key and a JSON is created like:

{
    "log": "actual line from container here",
}

Fluent Bit internally stores JSON encoded as msgpack. Fluent Bit can only process/ingest logs as JSON/msgpack.

With ECS FireLens, your stdout & stderr logs would look like this with if you have not disabled enable-ecs-log-metadata:

{
    "source": "stdout",
    "log": "116.82.105.169 - Homenick7462 197 [2018-11-27T21:53:38Z] \"HEAD /enable/relationships/cross-platform/partnerships\" 501 13886",
    "container_id": "e54cccfac2b87417f71877907f67879068420042828067ae0867e60a63529d35",
    "container_name": "/ecs-demo-6-container2-a4eafbb3d4c7f1e16e00"
    "ecs_cluster": "mycluster",
    "ecs_task_arn": "arn:aws:ecs:us-east-2:01234567891011:task/mycluster/3de392df-6bfa-470b-97ed-aa6f482cd7a6",
    "ecs_task_definition": "demo:7"
    "ec2_instance_id": "i-06bc83dbc2ac2fdf8"
}

The first 4 fields are added by the Fluentd Docker Log Driver. The 4 ECS metadata fields are added by FireLens and are explained here:

If you are using CloudWatch as your destination, there are additional considerations if you use the log_key option to just send the raw log line: What if I just want the raw log line to appear in CW?.

Which version did I deploy?

Check the log output from Fluent Bit. The first log statement printed by AWS for Fluent Bit is always the version used:

For example:

AWS for Fluent Bit Container Image Version 2.28.4
Fluent Bit v1.9.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

In some customer reports, we have heard that the "AWS for Fluent Bit Container Image Version" line may be printed on shutdown instead of or in addition to on startup.

Another option to check the version deployed is to exec into your Fluent Bit container image and check the version number inside the container. This can be done with docker exec/kubectl exec/ECS Exec. The AWS version is always contained in a file at the root directory /AWS_FOR_FLUENT_BIT_VERSION.

In general, we recommend locking your deployments to a specific version tag. Rather than our latest or stable tags which are dynamic and change over time. This ensures that your deployments are always predictable and all tasks/pods/nodes in the deployment use the same version. Check our release notes and [latest stable version] and make a decision taking into consideration your use cases.

Why do I still see the old version in my logs after a deployment?

Please read the above first: Which version did I deploy?.

You may still see the old version in your logs simply because your new task definition/pod definition does not actually contain the "new" version. Please double check everything. If everything looks correct, then the following may explain what happened.

In some customer reports, we have heard that the "AWS for Fluent Bit Container Image Version" line may be printed on shutdown instead of or in addition to on startup.

Also, recall that with default deployment settings ECS will first start new task(s) (with new image) then stop old task(s) (with old image).

Thus, the following can happen:

  1. New task was started for deployment and logged the new image version
  2. Old task was stopped and logged the old version on shutdown

If this occurs and you see both the old and new version printed in your log group, you may have successfully deploy the new version for all new tasks.

If you use the AWSLogs driver for log output from Fluent Bit, then the log stream name will have the Task ID as the final part of the log stream name. If necessary, you can correlate Task ID with your new vs old deployment. For each log statement in your log group that shows a print statement for the old/previous version, find the log stream name and obtain the task ID from it and then check the ECS Service events and determine when that task was deployed. Is the task with the old/previous version a part of your new deployment or an older one?

Another option to check the version deployed is to exec into your Fluent Bit container image and check the version number inside the container. This can be done with docker exec/kubectl exec/ECS Exec. The AWS version is always contained in a file at the root directory /AWS_FOR_FLUENT_BIT_VERSION.

How is the timestamp set for my logs?

This depends on a number of aspects in your configuration. You must consider:

  1. How is the timestamp set when my logs are ingested?
  2. How is the timestamp set when my logs are sent?

Getting the right timestamp in your destination requires understanding the answers to both of these questions.

Background

To prevent confusion, understand the following:

  • Timestamp in the log message string: This is the timestamp in the log message itself. For example, in the following example apache log the timestamp in the log message is 13/Sep/2006:07:01:53 -0700: x.x.x.90 - - [13/Sep/2006:07:01:53 -0700] "PROPFIND /svn/[xxxx]/Extranet/branches/SOW-101 HTTP/1.1" 401 587.
  • Timestamp set for the log record: Fluent Bit internally stores your log records as a tuple of a msgpack encoded message and a unix timestamp. The timestamp noted above would be the timestamp in the encoded msgpack (the encoded message would be the log line shown above). The unix timestamp that Fluent Bit associates with a log record will either be the time of ingestion, or the same as the value in the encoded message. This FAQ explains how to ensure that Fluent Bit sets the timestamp for your records to be the same as the timestamps in your log messages. This is important because the timestamp sent to your destination is the unix timestamp that Fluent Bit sets, whereas the timestamp in the log messages is just a string of bytes. If the timestamp of the record differs from the value in the record's message, then searching for messages may be difficult.

1. How is the timestamp set when my logs are ingested?

Every log ingested into Fluent Bit must have a timestamp. Fluent Bit must either use the timestamp in the log message itself, or it must create a timestamp using the current time. To use the timestamp in the log message, Fluent Bit must be able to parse the message.

The following are common cases for ingesting logs with timestamps:

  • Forward Input: The Fluent forward protocol input always ingests logs with the timestamps set for each message because the Fluent forward protocol includes timestamps. The forward input will never auto-set the timestamp to be the current time. Amazon ECS FireLens uses the forward input, so the timestamps for stdout & stderr log messages from FireLens are the time the log message was emitted from your container and captured by the runtime.
  • Tail Input: When you read a log file with tail, the timestamp in the log lines can be parsed and ingested if you specify a Parser. Your parser must set a Time_Format and Time_Key as explained below. It should be noted that if you use the container/kubernetes built-in multiline parsers, then the timestamp from your container log lines will be parsed. However, user defined multiline parsers can not set the timestamp.
  • Other Inputs: Some other inputs, such as systemd can function like the forward input explained above, and automatically ingest the timestamp from the log. Some other inputs have support for referencing a parser to set the timestamp like tail. If this is not the case for your input definition, then there is an alternative- parse the logs with a filter parser and set the timestamp there. Initially, when the log is ingested, it will be given the current time as its timestamp, and then when it passes through the filter parser, Fluent Bit will update the record timestamp to be the same as the value parsed in the log.

Setting timestamps with a filter and parser

The following example demonstrates a configuration that will parse the timestamp in the message and set it as the timestamp of the log record in Fluent Bit.

First configure a parser definition that has Time_Key and Time_Format in your parsers.conf:

[PARSER]
    Name        syslog-rfc5424
    Format      regex
    Regex       ^\<(?<pri>[0-9]{1,5})\>1 (?<time>[^ ]+) (?<host>[^ ]+) (?<ident>[^ ]+) (?<pid>[-0-9]+) (?<msgid>[^ ]+) (?<extradata>(\[(.*)\]|-)) (?<message>.+)$
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep   On
    Types pid:integer

The Regex here will parse the message and turn each named capture group into a field in the log record JSON. Thus, a key called time will be created. The Time_Key time notes this key. And the Time_Format tells Fluent Bit how ti process the Time_Key. The Time_Keep On will keep the time key in the log record, otherwise it would be dropped after parsing.

Finally, our main fluent-bit.conf will reference the parser with a filter:

[FILTER]
    Name parser
    Match app*
    Key_Name log
    Parser syslog-rfc5424

Consider the following log line which can be parsed by this parser:

<165>1 2007-01-20T11:02:09Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] APP application event log entry

The parser will first parse out the time field 2007-01-20T11:02:09Z. It will then match this against the time format %Y-%m-%dT%H:%M:%S.%L. Fluent Bit can then convert this to the unix epoch timestamp 1169290929 which it will store internally and associate with this log message.

How is the timestamp set when my logs are sent?

This depends on the output that you use. For the AWS Outputs:

  • cloudwatch and cloudwatch_logs will send the timestamp that Fluent Bit associates with the log record to CloudWatch as the message timestamp. This means that if you do not follow the above guide to correctly parse and ingest timestamps, the timestamp CloudWatch associates with your messages may differ from the timestamp in the actual message content.
  • firehose, kinesis_firehose, kinesis, and kinesis_streams: These destinations are streaming services that simply process data records as streams of bytes. By default, the log message is all that is sent. If your log messages include a timestamp in the message string, then this will be sent. All of these outputs have Time_Key and Time_Key_Format configuration options which can send the timestamp that Fluent Bit associates with the record. This is the timestamp that the above guide explains how to set/parse. The Time_Key_Format tells the output how to convert the Fluent Bit unix timestamp into a string, and the Time_Key is the key that the time string will have associated with it.
  • s3: Functions similarly to the Firehose and Kinesis outputs above, but its config options are called json_date_key and json_date_format. For the json_date_format, there is a predefined list of supported values.

What are STDOUT and STDERR?

Standard output (stdout) and standard error (stderr) are standard file handles attached to your application. Your application emits log messages to these pipes. Typically standard out is used for informational log output, and standard error is used for log output that indicates an error state.

Most standard logging libraries will by default send "info" and "debug" level log statements to STDOUT, whereas "warn" or "error" or "fatal" level log statements will be sent to STDERR.

For example, the python print function writes to STDOUT by default:

print("info: some information") # writes to stdout

But can also print to stderr:

print("ERROR: something went wrong", file=sys.stderr) # writes to stderr

Generally, logging systems group stdout and stderr logs together. For example, CRI/containerd kubernetes logs files look like the following with stdout and stderr messages in the same file, ordered by the time the container outputted them:

2020-02-02T03:45:58.799991063+00:00 stdout F getting gid for app.sock
2020-02-02T03:45:58.892908216+00:00 stdout F adding agent user to local app group
2020-02-02T03:45:59.096509878+00:00 stderr F curl: (7) Couldn't connect to server
2020-02-02T03:45:59.097687388+00:00 stderr F Traceback (most recent call last):

The messages note whether they came from the application's stdout or stderr and Fluent Bit can parse this info and turn it into a structured field. For example:

{
    "stream": "stderr",
    "log": "curl: (7) Couldn't connect to server"
}

All of the messages would be given the same tag. Fluent Bit routing is based on tags, so this means the logs would all go to the same destination(s).

Normally, this is desirable as all of an application's log statements can be considered together in order of their time of emission. If you want to send stdout and stderr to different destinations, then follow the techniques in this blog: Splitting an application’s logs into multiple streams: a Fluent tutorial.

If you use Amazon ECS FireLens, then standard out and standard error will be collected and given the same log tag as well. See FireLens Tag and Match Pattern and generated config. The stream will be noted in the log JSON in the source field as shown in What will the logs collected by Fluent Bit look like?:

{
    "source": "stdout",
    "log": "116.82.105.169 - Homenick7462 197 [2018-11-27T21:53:38Z] \"HEAD /enable/relationships/cross-platform/partnerships\" 501 13886",
    "container_id": "e54cccfac2b87417f71877907f67879068420042828067ae0867e60a63529d35",
    "container_name": "/ecs-demo-6-container2-a4eafbb3d4c7f1e16e00"
}

How to find image tag version from SHA or image SHA for tag using regional ECR repos?

Find registry IDs for our ECR repos in each region by querying SSM: using SSM to find available versions and AWS regions.

Find the SHA (imageDigest in output) for a version tag:

aws ecr describe-images --region us-west-2 \
--registry-id 906394416424 --repository-name aws-for-fluent-bit \
--image-ids imageTag=2.31.11

Find the image tag given a SHA:

aws ecr describe-images --region us-west-2 \
--registry-id 906394416424 --repository-name aws-for-fluent-bit \
--image-ids imageDigest=sha256:ff702d8e4a0a9c34d933ce41436e570eb340f56a08a2bc57b2d052350bfbc05d

Case Studies

🚨 Warning 🚨

The goal of the Case Studies section is to share useful insights and test results. These results however do not represent a guarantee of any kind. Your own results may vary significantly.

Resource usage, error rates, and other findings vary based on many factors, including but not limited to: specific details in Fluent Bit configuration, the instance/runtime that a test was run on, the size and content of the log messages processed/parsed/sent, and various random factors including API and network latency.

AWS for Fluent Bit Stability Tests

The AWS Distro stability tests are used to ensure our releases maintain a high quality bar and to select our stable version.

They use the FireLens datajet project to run tasks on Amazon ECS and AWS Fargate that simulate real world scenarios.

CPU and Memory

The Case Studies Warning is very important for this data; resource usage can vary considerably based on many factors.

We run a number of different stability test cases against our cloudwatch_logs output which cover configurations with non-trivial complexity that are similar to real world cases:

There are test cases at different throughputs, with some running at 10MB/s sustained throughput, which is much higher than most real world use cases.

Our CloudWatch Logs tests use a local network mock of the CloudWatch API; thus, these tests are not fully real world as there will be less network latency sending to the mock.

We also run a number of tests against the S3 output which do send to Amazon S3:

One of the S3 test cases runs at 30MB/s sustained throughput.

All of tests run on Fargate and are provisioned with 1 vCPU and 4 GB of memory.

Finally, our tests mostly configure inputs with the default storage.type memory and with a Mem_Buf_Limit 64M.

graph of memory usage

We see memory usage vary considerably across test cases based on throughput, complexity of the configuration, and other unknown factors. We also typically see that memory can increase slightly over the lifetime of the Fluent Bit container, however, this does not by itself indicate a memory leak.

Memory usage for the lower throughput tests tends to vary around ~40MB to ~100MB, with high throughput tests consuming up to 400MB of memory or more.

graph of CPU usage

CPU usage varies considerably, especially over time within a single test case. ~25+% of 1vCPU is typical even for low throughput test runs, and CPU usage can easily be sustained at ~50% vCPU, with spikes at or above 1vCPU. In one past experiment, we saw consistent CPU spikes once a day which we were not able to root cause.

The lesson here is that CPU usage can vary considerably.

Rate of network errors

As noted in the Common Network Errors and Network Connection Issues sections, Fluent Bit's http client is very low level and can emit many error messages during normal/typical operation.

If you experience these errors, the key is to look for messages that indicate log loss. These errors matter when they cause impact.

Users often ask us what is a "typical" rate of these errors. Please see, the Case Studies Warning, there is not necessarily a "typical" rate. We want to be clear that we've seen test runs where the error rate was higher than is shown in these screenshots. The goal of these screenshots is to show that there can be many errors under "normal" operation without log loss.

These specific test cases involve 40 tasks sending at 30 MB/s via one Fluent Bit S3 output.

Rate of broken connection errors for one task sending to S3

Above is the rate of "broken connection" errors for a single task.

Rate of broken connection errors for 40 tasks sending to S3

Above is the rate of "broken connection" errors for all 40 tasks.

Rate of ioctl errors for 40 tasks sending to S3

Above is the rate of "ioctl" errors for all 40 tasks.

Rate of tls errors for 40 tasks sending to S3

Above is the rate of "tls" errors for all 40 tasks.

Rate of errno errors for 40 tasks sending to S3

Above is the rate of errors with an "errno" for all 40 tasks.

Rate of all errors for 40 tasks sending to S3

Above is the rate of all errors for all 40 tasks.

Recommendations for high throughput

First of all, we must define "high throughput". The AWS Distro for Fluent Bit team generally considers a single Fluent Bit instance sending at a sustained rate at or over 2MB/s to be "high throughput". Most Fluent Bit instances send at sustained rates that are below this.

Finding your throughput

Note: If you have multiple outputs sending to multiple destinations, make sure you compute this statistic for all outputs. Remember to divide the total ingestion rate by the number of Fluent Bit instances to get the ingestion rate for a single Fluent Bit.

An alternative that works for all outputs is to use the Fluent Bit monitoring server. You can access it after SSH into the instance/node that Fluent Bit is running on (or use ECS Exec on ECS, kubectl exec on K8s). You must enable it in your configuration to use it. The monitoring server provides a fluentbit_output_proc_bytes_total counter metric for each output. Sum them together and divide by runtime to get the ingestion rate.

Recommendations

If a single Fluent Bit instance will send (in total across all outputs) at a sustained rate of several MB/s or more, then we recommend:

  1. Ensure Fluent Bit has at least 0.5 vCPU dedicated to it.
  2. Ensure Fluent Bit has at least 250 MB of memory. If throughput may spike to 5MB/s or above, then 500MB or more of memory would be ideal. If you set a hard memory limit, please carefully follow our OOMKill guide to calculte the memory buffer limit needed to reduce the likelihood of OOMKill.
  3. If you have disk space AND disk IOPs available, follow the OOMKill Guide Filesystem example. This guide was written with Amazon ECS FireLens customers as the target audience, but its recommendations apply to all Fluent Bit users. Please carefully read the notes about using max_chunks_up and the calculation to estimate the total memory usage based on the max_chunks_up. Note that you must have IOPs available to use filesystem buffering, if other components use disk and saturate IOPs, adding filesystem buffering to Fluent Bit is not wise. In Kubernetes, pod logs go to disk by default, and it common that these log files + Fluent Bit with filesystem buffering can saturate IOPs on the node disk.
  4. If you use Amazon ECS FireLens, increase the log-driver-buffer-limit. Pick a value based on your throughput that ensures the log driver can buffer all log output from your container for at least 5 seconds. In the AWS for Fluent Bit load tests, we set the buffer to the max allowed value. On EC2, if the buffer overflows the Docker Daemon logs in the system journal will have messages like: fluent#appendBuffer: Buffer full, limit 1048576.

If your consider losing logs to be more acceptable than high memory usage, then we additionally recommend the following:

  1. Follow the OOMKill Guide Filesystem example and enable storage.pause_on_chunks_overlimit On. This will pause your input to stop ingesting data when it reaches the max_chunks_up limit.
  2. Consider setting a low Retry_Limit so that in the event of repeated failures Fluent Bit does not have to maintain a large backlog of logs that must be retried.