Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow output plugins to configure a max chunk size #1938

Open
PettitWesley opened this issue Feb 8, 2020 · 21 comments
Open

Allow output plugins to configure a max chunk size #1938

PettitWesley opened this issue Feb 8, 2020 · 21 comments
Labels
AWS Issues with AWS plugins or experienced by users running on AWS feature-request long-term Long term issues (exempted by stale bots)

Comments

@PettitWesley
Copy link
Contributor

@edsiper and I discussed this recently; opening an issue to track it.

Problem

Many APIs have a limit on the amount of data they can ingest per request. For example, #1187 discusses that the DataDog HTTP API has a 2MB payload limit. A single request is made per flush, and occasionally Fluent Bit can send a chunk which is over 2 MB.

Some APIs have a limit on the number of log messages they can accept per HTTP request. For example, Amazon CloudWatch has a 10,000 log message limit per PutLogEvents call. Amazon Kinesis Firehose and Amazon Kinesis Data Streams have a much smaller batch limit of 500 events.

Consequently, plugins have to implement logic to split a single chunk into multiple requests (or accept that occasionally large chunks will fail to be sent). This becomes troublesome when a single API request fails in the set. If the plugin issues a retry, the whole chunk will get retried. The fractions of the chunk that got successfully uploaded will thus be sent multiple times.

Possible Solutions

Ideal solution: Output Plugins specify a max chunk size

Ideally, plugins should only have to make a single request per flush. This keeps the logic in the plugin very simple and straightforward. The common task of splitting chunks into right-sized pieces could be placed in the core of Fluent Bit.

Each output plugin could give Fluent Bit a max chunk size.

Implementing this would involve some complexity. Fluent Bit should not allocate additional memory to split chunks into smaller pieces. Instead it can pass a pointer to a fraction of chunk to an output, and track when the entire chunk has successfully been sent.

Non-ideal, but easy solution

The most important issue is retries. If each flush had a unique ID associated with it, plugins could internally track whether a flush is a first attempt or a retry, and then track whether the entirety of a chunk had been sent or not.

This is not a good idea, it makes the plugin very complicated; I've included it for the sake of completeness.

@JeffLuoo
Copy link
Contributor

Hi @PettitWesley I am wondering that what is the current progress on this issue right now?

@PettitWesley
Copy link
Contributor Author

@JeffLuoo AFAIK, no work has been done on it yet.

@Robert-turbo
Copy link

The same problem is with sending gelf over http to Graylog.
When Flush in SERVICE is set to 5, than I can see only one message per 5 seconds in Graylog.
When I change Flush to 1, then, there is one message per second.

Is there anything I should put in configuration file to have all messages in Graylog? :)

@ciastooo
Copy link

ciastooo commented Sep 6, 2021

@Robert-turbo were you able to solve this problem somehow?

@Robert-turbo
Copy link

@ciastooo I started to use tcp output

[OUTPUT]
    Name  gelf
    Match *
    Host tcp.yourdomain.com
    Port  12201
    Mode  tls
    tls  On
    tls.verify  On
    tls.vhost  tcp.yourdomain.com
    Gelf_Short_Message_Key  log
    Gelf_Timestamp_Key  timestamp

And used Traefik with TCP route (with SSL) in front of graylog
https://doc.traefik.io/traefik/routing/providers/docker/#tcp-routers

@mohitjangid1512
Copy link

@PettitWesley Is this problem solved in any recent version ?

@PettitWesley
Copy link
Contributor Author

@mohitjangid1512 No work has been done AFAIK on chunk sizing for outputs

@matthewfala
Copy link
Contributor

Alternately, compromising between options 1 and 2, we could write some middleware that handles chunk splitting and chunk fragment retries and wraps the flush function. This could potentially limit changes to AWS plugins, and not require any changes to core code.

The middleware would take parameters flush_function_ptr,chunk_fragment_size,middleware_context and consist of:

  1. Register chunk in some kind of table or hashmap, along with successful_chunks sent to -1
  2. Break chunk up into data of size X
  3. Call flush on a chunk fragment
  4. If flush is successful, increment successful_chunks. OW return retry/fail
  5. Return to 3 if remaining chunks exist
  6. Return success

On retry, the chunk will be looked up and the chunk fragments will be resumed at index successful_chunks+1

This would allow no code to change in each plugin's "flush" function, but an additional function to be created called flush_wrapper that calls some middleware that takes the flush function's pointer, chunk_fragment_size,middleware_context

Just a thought.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Apr 13, 2022
@LionelCons
Copy link
Contributor

This issue is still creating problems in some of our workflows.

@github-actions github-actions bot removed the Stale label Apr 14, 2022
@PettitWesley PettitWesley added AWS Issues with AWS plugins or experienced by users running on AWS feature-request labels May 2, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Aug 1, 2022
@LionelCons
Copy link
Contributor

This issue is still creating problems in some of our workflows.

@PettitWesley PettitWesley added long-term Long term issues (exempted by stale bots) and removed Stale labels Aug 1, 2022
@sean-scott-lr
Copy link

This issue is still creating problems in some of our workflows.

Agreed. Surprised its not implemented in the same manner as fluentD

@leonardo-albertovich
Copy link
Collaborator

Would you mind describing how is it implemented in fluentd? I am not familiar with it but I know that this is a very sensitive issue.

@adiforluls
Copy link
Member

Hi @leonardo-albertovich @edsiper @PettitWesley , Fluentd's buffer has configuration to limit chunk size https://docs.fluentd.org/configuration/buffer-section (See chunk_limit_size). Ideally output plugins can have a chunk limit, but I also want a solution for this since it's a massive blocker and can render fluent-bit useless when payload size can become huge for outputs.

@Jake0109
Copy link

hello, I just wonder is it work in process? We also meet this question. We are using fluent-bit and our output is AWS kinesis data stream, and data stream has a limit that one record should be under 1M, and we found chunks larger than 1. And as a result, We are meeting terrible data loss.
If the feature is released, Can you remind me please?

@MathiasRanna
Copy link

Waiting on solution for that also.

@cameronattard
Copy link

Is this ever going to be addressed?

@thirdeyenick
Copy link

We would also like to see this feature being integrated into fluent-bit somehow. We are sending some logs to a Loki instance and would like to limit the max size of the request.

@braydonk
Copy link
Contributor

@edsiper @leonardo-albertovich I have opened #9385 as a proof of concept for a potential solution. Hopefully it will be useful for starting a discussion of how this can potentially be fixed more properly.

@leonardo-albertovich
Copy link
Collaborator

Beautiful, I'll review the PR as soon as possible, thanks for taking the time to tackle this hairy issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Issues with AWS plugins or experienced by users running on AWS feature-request long-term Long term issues (exempted by stale bots)
Projects
None yet
Development

No branches or pull requests