Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop event batch when get HTTP status 413 from ES #29368

Merged
merged 3 commits into from
Dec 16, 2021
Merged

Conversation

rdner
Copy link
Member

@rdner rdner commented Dec 9, 2021

What does this PR do?

To prevent infinite loops when having http.max_content_length set
too low or bulk_max_size too high we now handle this status code
separately and drop the whole event batch producing a detailed error
message on the console.

Why is it important?

Because without this change beats are forever stuck once they get the first 413 response from Elasticsearch.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    I have made corresponding changes to the documentation
    I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Run Elastisearch locally with its default configuration
  2. Prepare a configuration file for filebeat that ingests a log file:

your_filebeat.yml

filebeat:
  inputs:
    - type: log
      paths:
        - /path/to/your/file.log
output:
  elasticsearch:
    hosts: ["localhost:9200"]

Notice, the filename must be changed to your location

  1. Compile filebeat out of this branch
  2. Run filebeat setup -e -c your_filebeat.yml
  3. Once the setup succeeds stop the filebeat
  4. Restart Elasticsearch using this configuration:
http:
  max_content_length: 100KB
  compression: false

Now it's easier to cause 413 in Elasticsearch because of the lower limit, however, we had to run filebeat setup before we set this limit in order to create required indices out of templates, otherwise the templates would be too large.

  1. Run the compiled filebeat for ingestion filebeat -e -c your_filebeat.yml
  2. Prepare this script that generates random strings of the given size or use your own approach:

ras.sh

#!/bin/bash

len=${1-16}
charset=${2-'a-zA-Z_0-9'}

LC_ALL=C tr -dc $charset </dev/urandom | head -c $len ; echo -n
  1. Now you can run this command with various size parameters to write into your log file from step 2.

This will write a message with 10 characters to the log file:

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log

This should work just fine but this one should cause an error:

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
sleep 11 # the file is scanned every 10 sec by default
echo "[DEBUG] $(date) $(./ras.sh 105000)" >> /path/to/your/file.log
sleep 11
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /path/to/your/file.log

Before this PR you would see this error and filebeat would go in an infinite loop:

{
  "log.level": "error",
  "@timestamp": "2021-12-09T15:30:02.523+0100",
  "log.logger": "elasticsearch",
  "log.origin": {
    "file.name": "elasticsearch/client.go",
    "file.line": 226
  },
  "message": "failed to perform any bulk index operations: 413 Request Entity Too Large: ",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

After this PR you would see and no infinite loop:

{
  "log.level": "error",
  "@timestamp": "2021-12-09T16:07:33.509+0100",
  "log.logger": "elasticsearch",
  "log.origin": {
    "file.name": "elasticsearch/client.go",
    "file.line": 236
  },
  "message": "failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

You should see 5 more 10 character messages in the index (via Kibana Discover) after the last command and the big message should be dropped. If there are no sleep commands in between, the dropped batch might include short messages as well. For test purposes it's better to keep them.

Related issues

Closes #14350

To prevent infinite loops when having `http.max_content_length` set
too low or `bulk_max_size` too high we now handle this status code
separately and drop the whole event batch producing a detailed error
message on the console.
@rdner rdner added the bug label Dec 9, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2021
@mergify
Copy link
Contributor

mergify bot commented Dec 9, 2021

This pull request does not have a backport label. Could you fix it @rdner? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 9, 2021
@rdner rdner added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed backport-skip Skip notification from the automated backport with mergify labels Dec 9, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2021
@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 9, 2021
@rdner rdner added the in progress Pull request is currently in progress. label Dec 9, 2021
@rdner rdner self-assigned this Dec 9, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 9, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-12-14T10:59:05.036+0000

  • Duration: 103 min 29 sec

  • Commit: e5344c5

Test stats 🧪

Test Results
Failed 0
Passed 48617
Skipped 4277
Total 52894

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@rdner rdner requested a review from faec December 9, 2021 16:24
@rdner rdner added the review label Dec 9, 2021
@rdner
Copy link
Member Author

rdner commented Dec 9, 2021

I'm also going to add a unit test a bit later but figured the earlier we start to discuss this the better.

@rdner rdner marked this pull request as ready for review December 9, 2021 18:36
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@rdner rdner added the backport-v8.1.0 Automated backport with mergify label Dec 9, 2021
@mergify mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Dec 9, 2021
@rdner rdner removed the in progress Pull request is currently in progress. label Dec 14, 2021
@rdner
Copy link
Member Author

rdner commented Dec 14, 2021

@faec please have a look when you have time, I think it's ready for a review.

@jlind23 jlind23 requested a review from a team December 14, 2021 14:42
@jlind23
Copy link
Collaborator

jlind23 commented Dec 14, 2021

Adding @elastic/elastic-agent-data-plane as reviewers for awareness

Copy link
Contributor

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!

@rdner rdner merged commit d401681 into elastic:master Dec 16, 2021
@rdner rdner deleted the fix-14350 branch December 16, 2021 20:03
@amolnater-qasource
Copy link

amolnater-qasource commented Feb 23, 2022

Hi @rdner
We have been attempting to validate this issue on self-managed 8.1 BC-3 and had below observations:

Run Elasticsearch locally with its default configuration

Kibana was running on default configuration.

Prepare a configuration file for filebeat that ingests a log file:

We create a test_filebeat.yml and also created a black file.log file
.
test_filebeat.yml file Attached below:
test_filebeat.zip

Compile filebeat out of this branch
Run filebeat setup -e -c your_filebeat.yml

We installed filebeat and run the provided command filebeat setup -e -c your_filebeat.yml . We got the index filebeat-* under discover tab.

Once the setup succeeds stop the filebeat
Restart Elasticsearch using this configuration:

We added the required configuration and restarted elasticsearch.
elasticsearch.yml shared below:
elasticsearch.zip

Prepare this script that generates random strings of the given size or use your own approach:
Now you can run this command with various size parameters to write into your log file from step 2.

file.log and ras.sh shared below:
file and ras.zip

Data was added to file.log through ras.sh file.
We kept both the files at location C:\.

We even run filebeat setup -e -c your_filebeat.yml and observed 413 error.
14

However we observed no Data under Discover tab:
15

Please let us know if we are missing anything.

Thanks

@rdner
Copy link
Member Author

rdner commented Feb 23, 2022

Hello @amolnater-qasource

We installed filebeat and run the provided command filebeat setup -e -c your_filebeat.yml . We got the index filebeat-* under discover tab.

Looks like you didn't wait until this step has finished in filebeat, it can take up to a few minutes. You should be looking at the logs making sure they stop coming and the running filebeat is idle and waiting for data to process.

The way it works, when you configure the filebeat for the first time it creates a lot of different things in Elasticsearch, some of them can exceed the set limit (100KB) and if you set the parameters:

http:
  max_content_length: 100KB

it would not be able to create those things including the template the error message on the screenshot refers to.

There a few things you can try (1 or 2):

  1. Just wait longer when you run the filebeat for the first time and make sure it's finished initialising and only then restart Elasticsearch with the applied length limit (http.max_content_length: 100KB). After that you can start processing data and you should see only the expected errors described in this PR.
  2. Increase the http.max_content_length value until you see absolutely no errors when the filebeat is initialising for the first time (before it starts processing any data) and then use ras.sh to generate values larger than the limit you set. For example you could set 500KB and then generate values with:
echo "[DEBUG] $(date) $(./ras.sh 512100)" >> /path/to/your/file.log

Let me know if either helped.

@amolnater-qasource
Copy link

amolnater-qasource commented Feb 23, 2022

Thanks for looking into this @rdner

Looks like you didn't wait until this step has finished in filebeat, it can take up to a few minutes.

When I ran the command for the first time it ran successfully till the end.
Please find below detailed Powershell logs for the first time:
Command.txt

It created indices successfully.
QUERY: We didn't put any initial data in file.log file. Hence we didn't expect any data under Discover tab for the first time. [Please check]
Then we edited the elasticsearch.yml with the provided data:

http:
  max_content_length: 100KB
  compression: false

We again ran the same command as shared in step 6- filebeat setup -e -c your_filebeat.yml and observed below error:
14

Just wait longer when you run the filebeat for the first time and make sure it's finished initialising and only then restart Elasticsearch with the applied length limit (http.max_content_length: 100KB). After that you can start processing data and you should see only the expected errors described in this PR.

We followed exactly the same.
QUERY:

  • Do we need to edit filebeat.yml too or just the test_filebeat.yml is sufficient?
  • Is there any specific location to keep the file.log file, as currently its in C:\file.log?
    FYI, we were only able to write data to this file through ras.sh file from this location.
    You can check our shared files from above comment.

UPDATE:
We are unable to ingest any data to discover tab as of now.

Thanks again!

@rdner
Copy link
Member Author

rdner commented Feb 23, 2022

@amolnater-qasource what happens when you try the second option?

@amolnater-qasource
Copy link

amolnater-qasource commented Feb 24, 2022

@rdner We attempted the second way too, using below configuration elasticsearch.yml

http:
  max_content_length: 500KB
  compression: false

And we have run echo "[DEBUG] $(date) $(./ras.sh 512100)" >> C:\file.log this and observed 1 new log set in file.log file.

16

  • We still didn't get any data under Discover tab.

We even re-attempted the whole test to avoid any gaps, however the results were still same.
Please let us know if we are missing anything.

Thanks

@rdner
Copy link
Member Author

rdner commented Feb 24, 2022

@amolnater-qasource I tried to follow the steps described in the description of this PR using 8.1 and it worked fine as I initially described. But I never tried to follow these steps on Windows.

There are several issues with your setup:

  1. I just checked your file.log and it appears to be no message that was supposed to be after the timestamp on each line. This ras.sh bash script is supposed to be used on unix-like systems, it would not work in PowerShell, you either need to use WSDL or Cygwin (not guaranteed to work) in order to use this script. I propose you find a different way to generate extremely long string values (more than 100*1024 characters).

  2. According to your logs and errors in your logs, you ran filebeat setup -e -c your_filebeat.yml AFTER you set the http: max_content_length: 100KB limit in Elasticsearch and restarted it with the new configuration. The testing steps made it clear to run filebeat setup -e -c your_filebeat.yml BEFORE restarting Elasticsearch with the new configuration that contains http: max_content_length: 100KB. Please follow the testing steps as they are and in the same order.

NOTE

You're not supposed to see any data in Kibana after filebeat setup -e -c your_filebeat.yml -- this is just the initial setup that should emit the Loaded Ingest pipelines message and exit with 0. As the testing steps say, you need to start filebeat -e -c your_filebeat.yml (no setup) in order to start ingesting data from the file into Elasticsearch.

Also, Kibana MUST be available not degraded when you run filebeat setup -e -c your_filebeat.yml (you should see the Kibana is now available message on the console).

If you are still not able to make it work with your current setup, you might want to try a different approach using Docker, here is how it's done:

  1. Make sure you have Docker installed
  2. Make sure you're using Cygwin
  3. Make sure you have git installed
  4. Clone the repository git clone https://github.com/elastic/beats
  5. Open the repository folder (beats) with Cygwin
  6. Execute git checkout 8.1 to switch the version to 8.1
  7. Execute cd testing/environments
  8. Execute make start ENV=snapshot.yml to start the testing environment with Docker
  9. You should see something like this (it can take a while):

Screenshot 2022-02-24 at 13 41 26

  1. Use the following config for filebeat your_filebeat.yml:
filebeat.inputs:
  - type: log
    paths:
      - "path/to/your/input.log"

output:
  elasticsearch:
    hosts: ["http://localhost:9200"]
    username: "admin"
    password: "testing"

setup.kibana:
  host: "localhost:5601"

you need to change "path/to/your/input.log" to your actual filename

  1. Execute filebeat setup -e -c your_filebeat.yml. It should end with the Loaded Ingest pipelines message and then exit with code 0.
  2. Execute the command make stop to stop the testing environment
  3. edit snapshot.yml in this folder and add - "http.max_content_length=100KB" right after this line
    - "http.host=0.0.0.0"

    the whole section should look as following:
environment:
    - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    - "network.host="
    - "transport.host=127.0.0.1"
    - "http.host=0.0.0.0"
    - "http.max_content_length=100KB"
    - "xpack.security.enabled=true"
      # We want something as unlimited compilation rate, but 'unlimited' is not valid.
    - "script.max_compilations_rate=100000/1m"
    - "action.destructive_requires_name=false"
    # Disable geoip updates to prevent golden file test failures when the database
    # changes and prevent race conditions between tests and database updates.
    - "ingest.geoip.downloader.enabled=false"
  1. Execute make start ENV=snapshot.yml to start the testing environment again
  2. Start filebeat with filebeat -e -c your_filebeat.yml (no setup)
  3. Now you can append lines to your input.log file that you configured in step 9.
  4. Try to append a short line like [DEBUG] Thu Feb 24 11:00:42 CET 2022 yo9bRwTw7N, feel free to append multiple lines with different messages
  5. There must be no errors in the filebeat logs on the console
  6. Go to Kibana on http://localhost:5601/app/management/kibana/dataViews (login: admin, password: testing) and create a new data view with the filebeat-* pattern
  7. Now you can go to Discover and if you choose the date range "last 15 minutes" you should see documents with the messages that you appended in step 15.1.
  8. Append another line to the file but use a long message instead: [DEBUG] Thu Feb 24 11:00:42 CET 2022 <very long, more than 105000 characters string here>. Feel free to use any way to generate a giant message that you can find but it should not contain a new line character.
  9. You should see the following in the filebeat logs on the console now:
{"log.level":"error","@timestamp":"2022-02-24T13:27:52.997+0100","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":240},"message":"failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-02-24T13:27:54.504+0100","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":176},"message":"failed to publish events: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped","service.name":"filebeat","ecs.version":"1.6.0"}
  1. This means the document was too large and it was dropped, which is what these steps are testing
  2. If you check the Discover again, you would not find the new document with the long message.
  3. If you keep appending new short lines to the files like in step 15.1 and they will appear in Discover like in 15.4

Let me know whether you're able to fix your current environment or the new approach worked for you.

@amolnater-qasource
Copy link

Hi @rdner
Thanks for the feedback and detailed steps. The whole information is very helpful.
We have executed this test on Ubuntu 20.4 host with 8.1 BC-5 self managed Kibana.

Build details:

VERSION: 8.1.0
BUILD: 50459
COMMIT: 23423b0db7d5ffae1d0578e8d9e2c1afab90cdcf
Artifact link: https://staging.elastic.co/8.1.0-d9cd3df6/summary-8.1.0.html

Steps followed:

  1. Setup Kibana with default configuration.
  2. We create a test_filebeat.yml and also created a blank file.log file at /home/zeus/Filebeat/log/` location.
  3. Run ./filebeat setup -e -c test_filebeat.yml. Waited for it to complete and we got the index filebeat-* under discover tab.
  4. Restarted Elasticsearch using this configuration:
http:
  max_content_length: 500KB
  compression: false
  1. We rerun ./filebeat setup -e -c test_filebeat.yml command and got expected 413 error.
  2. We created the ras.sh and placed it at location usr/local/bin.
  3. We run this shell file with sudo echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
  4. This time we got some different data in file.log.
    10

file.log is attached below:
file.log

  1. Now we run ./filebeat -e -c test_filebeat.yml.
  2. We observed below data on Discover tab.
    8

We even ran below set of commands and we observed the relevant data under discover tab.

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
sleep 11 
echo "[DEBUG] $(date) $(./ras.sh 512100)" >> /home/zeus/Filebeat/log/file.log
sleep 11
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log

We have run the same without sleep 11 too, however we didn't get any error on Discover tab.
Could you please confirm if this is the expected outcome?

Thanks

@rdner
Copy link
Member Author

rdner commented Mar 2, 2022

We have run the same without sleep 11 too, however we didn't get any error on Discover tab.
Could you please confirm if this is the expected outcome?

@amolnater-qasource according to the step 21 that I wrote earlier, you should have seen the expected error message in the logs (console output) not in Discover. This is error is not displayed in any way in Kibana Discovered. The messages that are too long are just dropped and you should see the error message (step 21) according to that.

@amolnater-qasource
Copy link

Hi @rdner
Thanks for helping out in testing this ticket.
We have run below commands:

echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 512100)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log
echo "[DEBUG] $(date) $(./ras.sh 10)" >> /home/zeus/Filebeat/log/file.log

We have observed the expected error logs under filebeat console:
13

Build details:

VERSION: 8.1.0
BUILD: 50459
COMMIT: 23423b0db7d5ffae1d0578e8d9e2c1afab90cdcf
Artifact link: https://staging.elastic.co/8.1.0-d9cd3df6/summary-8.1.0.html
HOST: Ubuntu 20.4 Desktop version[Vsphere]

Please let us know if anything else is required from our end.
cc: @jlind23

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.1.0 Automated backport with mergify bug libbeat review Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinite ingestion retry when batches are too large and using GuaranteedSend
5 participants