Drop event batch when get HTTP status 413 from ES #29368

rdner · 2021-12-09T16:01:58Z

What does this PR do?

To prevent infinite loops when having http.max_content_length set
too low or bulk_max_size too high we now handle this status code
separately and drop the whole event batch producing a detailed error
message on the console.

Why is it important?

Because without this change beats are forever stuck once they get the first 413 response from Elasticsearch.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run Elastisearch locally with its default configuration
Prepare a configuration file for filebeat that ingests a log file:

your_filebeat.yml

filebeat:
  inputs:
    - type: log
      paths:
        - /path/to/your/file.log
output:
  elasticsearch:
    hosts: ["localhost:9200"]

Notice, the filename must be changed to your location

Compile filebeat out of this branch
Run filebeat setup -e -c your_filebeat.yml
Once the setup succeeds stop the filebeat
Restart Elasticsearch using this configuration:

http:
  max_content_length: 100KB
  compression: false

Now it's easier to cause 413 in Elasticsearch because of the lower limit, however, we had to run filebeat setup before we set this limit in order to create required indices out of templates, otherwise the templates would be too large.

Run the compiled filebeat for ingestion filebeat -e -c your_filebeat.yml
Prepare this script that generates random strings of the given size or use your own approach:

ras.sh

#!/bin/bash

len=${1-16}
charset=${2-'a-zA-Z_0-9'}

LC_ALL=C tr -dc $charset </dev/urandom | head -c $len ; echo -n

Now you can run this command with various size parameters to write into your log file from step 2.

This will write a message with 10 characters to the log file:

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log

This should work just fine but this one should cause an error:

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
sleep 11 # the file is scanned every 10 sec by default
echo "[DEBUG] $(date) $(./ras.sh 105000)" >> /path/to/your/file.log
sleep 11
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log

Before this PR you would see this error and filebeat would go in an infinite loop:

{
  "log.level": "error",
  "@timestamp": "2021-12-09T15:30:02.523+0100",
  "log.logger": "elasticsearch",
  "log.origin": {
    "file.name": "elasticsearch/client.go",
    "file.line": 226
  },
  "message": "failed to perform any bulk index operations: 413 Request Entity Too Large: ",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

After this PR you would see and no infinite loop:

{
  "log.level": "error",
  "@timestamp": "2021-12-09T16:07:33.509+0100",
  "log.logger": "elasticsearch",
  "log.origin": {
    "file.name": "elasticsearch/client.go",
    "file.line": 236
  },
  "message": "failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

You should see 5 more 10 character messages in the index (via Kibana Discover) after the last command and the big message should be dropped. If there are no sleep commands in between, the dropped batch might include short messages as well. For test purposes it's better to keep them.

Related issues

Closes #14350

To prevent infinite loops when having `http.max_content_length` set too low or `bulk_max_size` too high we now handle this status code separately and drop the whole event batch producing a detailed error message on the console.

mergify · 2021-12-09T16:02:15Z

This pull request does not have a backport label. Could you fix it @rdner? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2021-12-09T16:22:53Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-12-14T10:59:05.036+0000
Duration: 103 min 29 sec
Commit: e5344c5

Test stats 🧪

Test	Results
Failed	0
Passed	48617
Skipped	4277
Total	52894

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

rdner · 2021-12-09T16:26:25Z

I'm also going to add a unit test a bit later but figured the earlier we start to discuss this the better.

elasticmachine · 2021-12-09T18:36:40Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

rdner · 2021-12-14T13:23:07Z

@faec please have a look when you have time, I think it's ready for a review.

jlind23 · 2021-12-14T14:43:03Z

Adding @elastic/elastic-agent-data-plane as reviewers for awareness

faec

Great work, thanks!

amolnater-qasource · 2022-02-23T13:07:22Z

Hi @rdner
We have been attempting to validate this issue on self-managed 8.1 BC-3 and had below observations:

Run Elasticsearch locally with its default configuration

Kibana was running on default configuration.

Prepare a configuration file for filebeat that ingests a log file:

We create a test_filebeat.yml and also created a black file.log file
.
test_filebeat.yml file Attached below:
test_filebeat.zip

Compile filebeat out of this branch
Run filebeat setup -e -c your_filebeat.yml

We installed filebeat and run the provided command filebeat setup -e -c your_filebeat.yml . We got the index filebeat-* under discover tab.

Once the setup succeeds stop the filebeat
Restart Elasticsearch using this configuration:

We added the required configuration and restarted elasticsearch.
elasticsearch.yml shared below:
elasticsearch.zip

Prepare this script that generates random strings of the given size or use your own approach:
Now you can run this command with various size parameters to write into your log file from step 2.

file.log and ras.sh shared below:
file and ras.zip

Data was added to file.log through ras.sh file.
We kept both the files at location C:\.

We even run filebeat setup -e -c your_filebeat.yml and observed 413 error.

However we observed no Data under Discover tab:

Please let us know if we are missing anything.

Thanks

rdner · 2022-02-23T13:47:42Z

Hello @amolnater-qasource

We installed filebeat and run the provided command filebeat setup -e -c your_filebeat.yml . We got the index filebeat-* under discover tab.

Looks like you didn't wait until this step has finished in filebeat, it can take up to a few minutes. You should be looking at the logs making sure they stop coming and the running filebeat is idle and waiting for data to process.

The way it works, when you configure the filebeat for the first time it creates a lot of different things in Elasticsearch, some of them can exceed the set limit (100KB) and if you set the parameters:

http:
  max_content_length: 100KB

it would not be able to create those things including the template the error message on the screenshot refers to.

There a few things you can try (1 or 2):

Just wait longer when you run the filebeat for the first time and make sure it's finished initialising and only then restart Elasticsearch with the applied length limit (http.max_content_length: 100KB). After that you can start processing data and you should see only the expected errors described in this PR.
Increase the http.max_content_length value until you see absolutely no errors when the filebeat is initialising for the first time (before it starts processing any data) and then use ras.sh to generate values larger than the limit you set. For example you could set 500KB and then generate values with:

echo "[DEBUG] $(date) $(./ras.sh 512100)" >> /path/to/your/file.log

Let me know if either helped.

amolnater-qasource · 2022-02-23T14:18:46Z

Thanks for looking into this @rdner

Looks like you didn't wait until this step has finished in filebeat, it can take up to a few minutes.

When I ran the command for the first time it ran successfully till the end.
Please find below detailed Powershell logs for the first time:
Command.txt

It created indices successfully.
QUERY: We didn't put any initial data in file.log file. Hence we didn't expect any data under Discover tab for the first time. [Please check]
Then we edited the elasticsearch.yml with the provided data:

http:
  max_content_length: 100KB
  compression: false

We again ran the same command as shared in step 6- filebeat setup -e -c your_filebeat.yml and observed below error:

Just wait longer when you run the filebeat for the first time and make sure it's finished initialising and only then restart Elasticsearch with the applied length limit (http.max_content_length: 100KB). After that you can start processing data and you should see only the expected errors described in this PR.

We followed exactly the same.
QUERY:

Do we need to edit filebeat.yml too or just the test_filebeat.yml is sufficient?
Is there any specific location to keep the file.log file, as currently its in C:\file.log?
FYI, we were only able to write data to this file through ras.sh file from this location.
You can check our shared files from above comment.

UPDATE:
We are unable to ingest any data to discover tab as of now.

Thanks again!

rdner · 2022-02-23T17:04:38Z

@amolnater-qasource what happens when you try the second option?

amolnater-qasource · 2022-02-24T07:58:55Z

@rdner We attempted the second way too, using below configuration elasticsearch.yml

http:
  max_content_length: 500KB
  compression: false

And we have run echo "[DEBUG] $(date) $(./ras.sh 512100)" >> C:\file.log this and observed 1 new log set in file.log file.

We still didn't get any data under Discover tab.

We even re-attempted the whole test to avoid any gaps, however the results were still same.
Please let us know if we are missing anything.

Thanks

rdner · 2022-02-24T13:12:39Z

@amolnater-qasource I tried to follow the steps described in the description of this PR using 8.1 and it worked fine as I initially described. But I never tried to follow these steps on Windows.

There are several issues with your setup:

I just checked your file.log and it appears to be no message that was supposed to be after the timestamp on each line. This ras.sh bash script is supposed to be used on unix-like systems, it would not work in PowerShell, you either need to use WSDL or Cygwin (not guaranteed to work) in order to use this script. I propose you find a different way to generate extremely long string values (more than 100*1024 characters).
According to your logs and errors in your logs, you ran filebeat setup -e -c your_filebeat.yml AFTER you set the http: max_content_length: 100KB limit in Elasticsearch and restarted it with the new configuration. The testing steps made it clear to run filebeat setup -e -c your_filebeat.yml BEFORE restarting Elasticsearch with the new configuration that contains http: max_content_length: 100KB. Please follow the testing steps as they are and in the same order.

NOTE

You're not supposed to see any data in Kibana after filebeat setup -e -c your_filebeat.yml -- this is just the initial setup that should emit the Loaded Ingest pipelines message and exit with 0. As the testing steps say, you need to start filebeat -e -c your_filebeat.yml (no setup) in order to start ingesting data from the file into Elasticsearch.

Also, Kibana MUST be available not degraded when you run filebeat setup -e -c your_filebeat.yml (you should see the Kibana is now available message on the console).

If you are still not able to make it work with your current setup, you might want to try a different approach using Docker, here is how it's done:

Make sure you have Docker installed
Make sure you're using Cygwin
Make sure you have git installed
Clone the repository git clone https://github.com/elastic/beats
Open the repository folder (beats) with Cygwin
Execute git checkout 8.1 to switch the version to 8.1
Execute cd testing/environments
Execute make start ENV=snapshot.yml to start the testing environment with Docker
You should see something like this (it can take a while):

Use the following config for filebeat your_filebeat.yml:

filebeat.inputs:
  - type: log
    paths:
      - "path/to/your/input.log"

output:
  elasticsearch:
    hosts: ["http://localhost:9200"]
    username: "admin"
    password: "testing"

setup.kibana:
  host: "localhost:5601"

you need to change "path/to/your/input.log" to your actual filename

Execute filebeat setup -e -c your_filebeat.yml. It should end with the Loaded Ingest pipelines message and then exit with code 0.
Execute the command make stop to stop the testing environment
edit snapshot.yml in this folder and add - "http.max_content_length=100KB" right after this line

beats/testing/environments/snapshot.yml

Line 17 in 4eed120

- "http.host=0.0.0.0"

the whole section should look as following:

environment:
    - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    - "network.host="
    - "transport.host=127.0.0.1"
    - "http.host=0.0.0.0"
    - "http.max_content_length=100KB"
    - "xpack.security.enabled=true"
      # We want something as unlimited compilation rate, but 'unlimited' is not valid.
    - "script.max_compilations_rate=100000/1m"
    - "action.destructive_requires_name=false"
    # Disable geoip updates to prevent golden file test failures when the database
    # changes and prevent race conditions between tests and database updates.
    - "ingest.geoip.downloader.enabled=false"

Execute make start ENV=snapshot.yml to start the testing environment again
Start filebeat with filebeat -e -c your_filebeat.yml (no setup)
Now you can append lines to your input.log file that you configured in step 9.
Try to append a short line like [DEBUG] Thu Feb 24 11:00:42 CET 2022 yo9bRwTw7N, feel free to append multiple lines with different messages
There must be no errors in the filebeat logs on the console
Go to Kibana on http://localhost:5601/app/management/kibana/dataViews (login: admin, password: testing) and create a new data view with the filebeat-* pattern
Now you can go to Discover and if you choose the date range "last 15 minutes" you should see documents with the messages that you appended in step 15.1.
Append another line to the file but use a long message instead: [DEBUG] Thu Feb 24 11:00:42 CET 2022 <very long, more than 105000 characters string here>. Feel free to use any way to generate a giant message that you can find but it should not contain a new line character.
You should see the following in the filebeat logs on the console now:

{"log.level":"error","@timestamp":"2022-02-24T13:27:52.997+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":240},"message":"failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-02-24T13:27:54.504+0100","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":176},"message":"failed to publish events: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped","service.name":"filebeat","ecs.version":"1.6.0"}

This means the document was too large and it was dropped, which is what these steps are testing
If you check the Discover again, you would not find the new document with the long message.
If you keep appending new short lines to the files like in step 15.1 and they will appear in Discover like in 15.4

Let me know whether you're able to fix your current environment or the new approach worked for you.

amolnater-qasource · 2022-03-02T12:31:29Z

Hi @rdner
Thanks for the feedback and detailed steps. The whole information is very helpful.
We have executed this test on Ubuntu 20.4 host with 8.1 BC-5 self managed Kibana.

Build details:

VERSION: 8.1.0
BUILD: 50459
COMMIT: 23423b0db7d5ffae1d0578e8d9e2c1afab90cdcf
Artifact link: https://staging.elastic.co/8.1.0-d9cd3df6/summary-8.1.0.html

Steps followed:

Setup Kibana with default configuration.
We create a test_filebeat.yml and also created a blank file.log file at /home/zeus/Filebeat/log/` location.
Run ./filebeat setup -e -c test_filebeat.yml. Waited for it to complete and we got the index filebeat-* under discover tab.
Restarted Elasticsearch using this configuration:

http:
  max_content_length: 500KB
  compression: false

We rerun ./filebeat setup -e -c test_filebeat.yml command and got expected 413 error.
We created the ras.sh and placed it at location usr/local/bin.
We run this shell file with sudo echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
This time we got some different data in file.log.

file.log is attached below:
file.log

Now we run ./filebeat -e -c test_filebeat.yml.
We observed below data on Discover tab.

We even ran below set of commands and we observed the relevant data under discover tab.

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
sleep 11 
echo "[DEBUG] $(date) $(./ras.sh 512100)" >> /home/zeus/Filebeat/log/file.log
sleep 11
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log

We have run the same without sleep 11 too, however we didn't get any error on Discover tab.
Could you please confirm if this is the expected outcome?

Thanks

rdner · 2022-03-02T16:22:34Z

We have run the same without sleep 11 too, however we didn't get any error on Discover tab.
Could you please confirm if this is the expected outcome?

@amolnater-qasource according to the step 21 that I wrote earlier, you should have seen the expected error message in the logs (console output) not in Discover. This is error is not displayed in any way in Kibana Discovered. The messages that are too long are just dropped and you should see the error message (step 21) according to that.

amolnater-qasource · 2022-03-03T05:21:19Z

Hi @rdner
Thanks for helping out in testing this ticket.
We have run below commands:

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 512100)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log

We have observed the expected error logs under filebeat console:

Build details:

VERSION: 8.1.0
BUILD: 50459
COMMIT: 23423b0db7d5ffae1d0578e8d9e2c1afab90cdcf
Artifact link: https://staging.elastic.co/8.1.0-d9cd3df6/summary-8.1.0.html
HOST: Ubuntu 20.4 Desktop version[Vsphere]

Please let us know if anything else is required from our end.
cc: @jlind23

Thanks!

Drop event batch when get HTTP status 413 from ES

410a2a8

To prevent infinite loops when having `http.max_content_length` set too low or `bulk_max_size` too high we now handle this status code separately and drop the whole event batch producing a detailed error message on the console.

rdner added the bug label Dec 9, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2021

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 9, 2021

rdner added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed backport-skip Skip notification from the automated backport with mergify labels Dec 9, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2021

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 9, 2021

rdner added the in progress Pull request is currently in progress. label Dec 9, 2021

Add changelog entry

cef34d2

rdner self-assigned this Dec 9, 2021

rdner requested a review from faec December 9, 2021 16:24

rdner added the review label Dec 9, 2021

rdner marked this pull request as ready for review December 9, 2021 18:36

rdner added the backport-v8.1.0 Automated backport with mergify label Dec 9, 2021

mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Dec 9, 2021

rdner added libbeat v8.1.0 labels Dec 9, 2021

Add test for the HTTP status codes

e5344c5

rdner removed the in progress Pull request is currently in progress. label Dec 14, 2021

jlind23 requested a review from a team December 14, 2021 14:42

faec approved these changes Dec 16, 2021

View reviewed changes

rdner merged commit d401681 into elastic:master Dec 16, 2021

rdner deleted the fix-14350 branch December 16, 2021 20:03

rdner mentioned this pull request Jan 10, 2022

Split large batches of documents if received 413 from Elasticsearch #29778

Closed

rdner mentioned this pull request Feb 15, 2022

Infinite ingestion retry when batches are too large and using GuaranteedSend #14350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop event batch when get HTTP status 413 from ES #29368

Drop event batch when get HTTP status 413 from ES #29368

rdner commented Dec 9, 2021 •

edited

Loading

mergify bot commented Dec 9, 2021

elasticmachine commented Dec 9, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

rdner commented Dec 9, 2021

elasticmachine commented Dec 9, 2021

rdner commented Dec 14, 2021

jlind23 commented Dec 14, 2021

faec left a comment

amolnater-qasource commented Feb 23, 2022 •

edited

Loading

rdner commented Feb 23, 2022

amolnater-qasource commented Feb 23, 2022 •

edited

Loading

rdner commented Feb 23, 2022

amolnater-qasource commented Feb 24, 2022 •

edited

Loading

rdner commented Feb 24, 2022

amolnater-qasource commented Mar 2, 2022

rdner commented Mar 2, 2022

amolnater-qasource commented Mar 3, 2022

Drop event batch when get HTTP status 413 from ES #29368

Drop event batch when get HTTP status 413 from ES #29368

Conversation

rdner commented Dec 9, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

mergify bot commented Dec 9, 2021

elasticmachine commented Dec 9, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

rdner commented Dec 9, 2021

elasticmachine commented Dec 9, 2021

rdner commented Dec 14, 2021

jlind23 commented Dec 14, 2021

faec left a comment

Choose a reason for hiding this comment

amolnater-qasource commented Feb 23, 2022 • edited Loading

rdner commented Feb 23, 2022

amolnater-qasource commented Feb 23, 2022 • edited Loading

rdner commented Feb 23, 2022

amolnater-qasource commented Feb 24, 2022 • edited Loading

rdner commented Feb 24, 2022

amolnater-qasource commented Mar 2, 2022

rdner commented Mar 2, 2022

amolnater-qasource commented Mar 3, 2022

rdner commented Dec 9, 2021 •

edited

Loading

elasticmachine commented Dec 9, 2021 •

edited by jenkins-beats-ci bot

Loading

amolnater-qasource commented Feb 23, 2022 •

edited

Loading

amolnater-qasource commented Feb 23, 2022 •

edited

Loading

amolnater-qasource commented Feb 24, 2022 •

edited

Loading