Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise ESF #148

Merged
merged 59 commits into from
Aug 17, 2022
Merged

Optimise ESF #148

merged 59 commits into from
Aug 17, 2022

Conversation

aspacca
Copy link
Contributor

@aspacca aspacca commented Aug 3, 2022

Enhancement

What does this PR do?

  • Add benchmark for covering expand_event_list_from_field and json dumper
  • Fix json_content_type not passed as argument to storage factories
  • Set custom json parser and dumper in es client and for storage decorator and event expander
  • Add disabled value for json_content_type in order to totally skip json auto discovery

Why is it important?

We need to optimise as much as possible handling json content, since the most performance impacting code in the forwarder.
Having benchmarks available for the different use cases will help decide which json library to use and in case even switching to different libraries according to the matrix of expand_event_list_from_field and json_content_type setting provided by the user

I've run the new benchmark and identified ujson` as the most performing json package

Here the outcome of several iterations of optimisation:

320 MB, 10mins timeout, batch_max_actions: 4000, batch_max_bytes: 104857600
1.3.0-rc20
Sent Events: 618500, Duration: 480secs, Max Memory Used: 97MB
Sent Events: 445500, Duration: 341secs, Max Memory Used: 100MB

1.3.0-rc13
Sent Events: 580000, Duration: 481secs, Max Memory Used: 135MB
Sent Events: 484000, Duration: 404secs, Max Memory Used: 138MB

1.3.0-rc8
Sent Events: 364000, Duration: 482secs, Max Memory Used: 131MB
Sent Events: 364000, Duration: 481secs, Max Memory Used: 131MB
Sent Events: 336000, Duration: 445secs, Max Memory Used: 132MB

1.3.0-rc6
Sent Events: 356000, Duration: 480secs, Max Memory Used: 128MB
Sent Events: 356000, Duration: 483secs, Max Memory Used: 131MB
Sent Events: 352000, Duration: 477secs, Max Memory Used: 131MB

1.2.1
Sent Events: 340000, Duration: 483secs, Max Memory Used: 126MB
Sent Events: 340000, Duration: 483secs, Max Memory Used: 127MB
Sent Events: 340000, Duration: 483secs, Max Memory Used: 127MB
Sent Events:  44000, Duration: 285secs, Max Memory Used: 127MB

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.md

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@aspacca aspacca self-assigned this Aug 3, 2022
@aspacca aspacca changed the base branch from main to handle-offset-in-expand_event_list_from_field August 3, 2022 07:09
@elasticmachine
Copy link

elasticmachine commented Aug 3, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-08-17T05:46:20.283+0000

  • Duration: 42 min 24 sec

Test stats 🧪

Test Results
Failed 0
Passed 418
Skipped 0
Total 418

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@aspacca aspacca changed the title Benchmark for expand and dumper Optimise ESF Aug 9, 2022
@aspacca aspacca requested a review from a team August 9, 2022 14:58
@aspacca aspacca changed the base branch from handle-offset-in-expand_event_list_from_field to main August 11, 2022 07:47
Copy link

@tommyers-elastic tommyers-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work ! couple of comments, no blockers.

**Notes:**

`inputs.[].json_content_type` can be defined as a string with on the of the following values:
- *single*: indicates that the content of a single entry in the input payload is a single JSON object. The content can either be on a single line or spanning multiple lines. In this case the whole content of the payload is decoded as JSON object, with no limit on the number of lines the JSON object is spanning on.
- *ndjson*: indicates that the content of a single entry in the input payload is a valid NDJSON format. In NDJSON format multiple single JSON objects formatted on a single line each are separated by a newline delimiter. In this case each line will be decoded as JSON object, improving the parsing performance.
- *disabled*: instructs the Elastic Server Forwarder to not attempt any JSON content automatic discovery and threat the content as plain text, improving the performance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of small typos here: Server = Serverless, threat = treat.

@@ -470,11 +470,13 @@ In case of JSON objects spanning multiple lines a limit of 1000 lines is applied

Sometimes relaying on the Elastic Serverless Forwarder JSON content auto-discovery feature might have a huge impact on performance, or you have a known payload content of a single JSON object spanning more than 1000 lines. In this case you can provide in the input configuration and hint on the nature of the JSON content: this will change the parsing logic applied and improve performance or overcome the 1000 lines limit.

This setting allows also to disable at all any attempt of JSON content automatic discovery, in case of known plain text content.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would good to move the reason you might do this up to this section. currently we mention it lower down - "improving performance", but i think we could expand on it a little here.

handlers/aws/cloudwatch_logs_trigger.py Show resolved Hide resolved
handlers/aws/handler.py Show resolved Hide resolved
from typing import Any, Union

import boto3
from botocore.client import BaseClient as BotoBaseClient
from ujson import JSONDecodeError

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good if we could swap out the implementation library at will without any additional code changes here (see above comment)

share/json.py Show resolved Hide resolved
storage/payload.py Show resolved Hide resolved
storage/payload.py Outdated Show resolved Hide resolved
storage/storage.py Show resolved Hide resolved
@aspacca
Copy link
Contributor Author

aspacca commented Aug 17, 2022

fixes #151

@aspacca aspacca merged commit 216bd28 into main Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants