Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock situation occurs when no log loss is allowed and output is blocked. #9706

Open
Ikonovich opened this issue Dec 10, 2024 · 0 comments

Comments

@Ikonovich
Copy link

Ikonovich commented Dec 10, 2024

Bug Report

Describe the bug
Hey there, this issue appears to have been introduced in this commit: d935047

The issue: When an output is blocked (in my case, due to an internet outage), tasks are still created to handle chunks. Eventually, this hits 2048 tasks, and no more can be created. Each of these tasks puts their chunk down when they fail, attempting to pull it back up on the retry.

During this time, the input pulls up chunks, and tries to create tasks for them. Because the tasks are full, it can't create any tasks, but leaves these chunks in memory.

The engine continually retries the output tasks, hitting this line, but since there is no free memory, it can't pull any chunks up, and proceeds to here to reschedule the task. It does this forever.

So:
The input consumes all memory, and can't create any tasks.
The output consumes all tasks, but can't allocate memory to them.
This deadlocks the agent, resulting in constant task re-scheduling messages. Unblocking the output does not have any effect.

I was able to resolve this issue by giving outputs a small virtual memory reserve that the input doesn't have access to, allowing them to always be able to bring up at least one chunk into memory. This breaks the deadlock, allows the resources to be cleared, and allows the operation to proceed.

Specifically, my solution was to give each output a number of chunks allowed over the maximum memory allocation, and replacing the function here with a flb_input_chunk_set_up_for_output function, which propagates down to a function that replaces cio_file_up with this function for output usage:
(Note: MAX_OVER_LIMIT_OUTPUT_CHUNKS_UP needs to be at least 1. The larger it is, the more chunks the output can pull up over the memory limit)

int cio_file_up_for_output(struct cio_chunk *ch, int *chunks_up_for_output)
{
    int ret;

    ret =_cio_file_up(ch, CIO_TRUE);
    if (ret < 0 && (*chunks_up_for_output) < MAX_OVER_LIMIT_OUTPUT_CHUNKS_UP) {
        ret = _cio_file_up(ch, CIO_FALSE);
        if (ret == 0) {
            ch->is_up_for_output = 1;
            // Increment the value of chunks_up_for_output.
            (*chunks_up_for_output)++;
            // Let the chunk track the chunks_up_for_output pointer.
            ch->chunks_up_for_output = chunks_up_for_output;
        }
    }
    return ret;
}

Later, when the chunk is put down, I added this section here in the cio_file_down function:

 // If the chunk is up for an output, mark it as NOT up for an output, decrement that
    // output's chunks_up counter, and clear that reference.
    if (ch->is_up_for_output == 1) {
        ch->is_up_for_output = 0;
        (*ch->chunks_up_for_output)--;
        ch->chunks_up_for_output = NULL;
    }

This is certainly not an ideal solution, but it did work and may be a useful reference. Using this fix, I was able to block the outputs of three instances running this configuration, have each produce over 100 GB of logs, unblock the output, and have full recovery with all of the expected logs being pushed to the logs storage service.

To Reproduce
At minimum, the log agent must be configured so that it retries forever and does not allow logs to be lost due to output failures. In my case, I also had it configured with unlimited filesystem buffer to prevent ANY log loss, but this is probably not necessary based on my understanding of the issue.

Steps to reproduce the problem:

  • Set up a tail input
  • Configure filesystem buffering with no maximum storage (It may be reproducible with limited storage, but this is my configuration)
  • Set an output reading from the tail input to retry_limit: false
  • Block the output through some mechanism. I did this by using a cloudwatch_logs output and using /etc/hosts to redirect the cloudwatch logs endpoint to a black hole IP address.
  • Push a significant volume of logs into the tail input. i set up a generator that was producing 20GB/hour of log volume. This was sufficient to cause the problem in about 10-20 minutes at storage.max_chunks_up=128, but this may differ, and longer may be required. If you push enough log volume, this will happen in all cases.
  • Unblock the output. The agent will not recover, all incoming logs will be written to the filesystem buffer and will never be flushed.

In metrics, you will see that 2048 tasks have been created, and that storage.max_chunks_up is at your maximum allowed setting.

Expected behavior
Outputs should be able to process chunks and push them out, no matter what is happening to the input.

Actual Behavior
The log agent is unable to recover from the outage. Log message reports many re-scheduled tasks:

[task] retry for task %i could not be re-scheduled

And no logs are ever pushed through the output.

Your Environment

  • Version used: 3.0.4
  • Configuration: Anonymized version of my FluentBit configuration
  • Environment name and version: Amazon Linux 2 on an m5.8xlarge EC2 instance with 300GB EBS volume. Instances never went over 10% CPU and memory utilization. Disk utilization steadily escalated as expected.
  • Server type and version: Amazon Linux 2 on an m5.8xlarge EC2 instance.
  • Operating System and version:Amazon Linux 2 on an m5.8xlarge EC2 instance.
  • Filters and plugins: My input plugin was tail and my output plugin was cloudwatch_logs. I used a json parser on the output. The issue is not related to the chosen output plugin or filter, but it is possible other inputs handle buffering in a way that avoids it.

Additional context

It significantly reduces the ability to recover from network outages, and requires manual resolution after extended outages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant