Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Instrument aws-s3 with metrics #25711

Merged
merged 2 commits into from
May 17, 2021

Conversation

andrewkroh
Copy link
Member

@andrewkroh andrewkroh commented May 13, 2021

What does this PR do?

Diagnosing performance issues with the aws-s3 input is difficult so this instruments it with some metrics to make this easier. These are the metrics that are added.

  • Number of SQS messages received (not necessarily processed fully).
  • Number of SQS visibility timeout extensions.
  • Number of SQS messages inflight (gauge).
  • Number of SQS message returned to queue (happens on errors implicitly after visibility timeout passes).
  • Number of SQS messages deleted.
  • Histogram of the elapsed SQS processing times in nanoseconds (time of receipt to time of delete/return).
  • Number of S3 objects downloaded.
  • Number of S3 bytes processed.
  • Number of events created from processing S3 data.
  • Number of S3 objects inflight (gauge).
  • Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).

The metrics are structured as:

dataset.<input-id>:
    id=<input id>
    input=aws-s3
    sqs_messages_received_total
    sqs_visibility_timeout_extensions_total
    sqs_messages_inflight_gauge
    sqs_messages_returned_total
    sqs_messages_deleted_total
    sqs_message_processing_time.histogram
    s3_objects_requested_total
    s3_bytes_processed_total
    s3_events_created_total
    s3_objects_inflight_gauge
    s3_object_processing_time.histogram

The v2 input logger was updated to include the input ID to make correlation with metrics possible when an explicit id is not set in the input config

Why is it important?

These metrics will make it easier to operating and tune the aws-s3 input.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Logs

curl http://<http.host>:<http.port>/dataset?pretty

{
  "EDAB75D0AA3EDA41": {
    "id": "EDAB75D0AA3EDA41",
    "input": "aws-s3",
    "s3_bytes_processed_total": 59630,
    "s3_events_created_total": 37,
    "s3_object_processing_time": {
      "histogram": {
        "count": 5,
        "max": 226999441,
        "mean": 114148361.4,
        "median": 90656138,
        "min": 70326479,
        "p75": 163751522.5,
        "p95": 226999441,
        "p99": 226999441,
        "p999": 226999441,
        "stddev": 57290315.02034136
      }
    },
    "s3_objects_inflight_gauge": 0,
    "s3_objects_requested_total": 5,
    "sqs_message_processing_time": {
      "histogram": {
        "count": 4,
        "max": 1269096799,
        "mean": 1175219597,
        "median": 1162818646.5,
        "min": 1106144296,
        "p75": 1249543735.75,
        "p95": 1269096799,
        "p99": 1269096799,
        "p999": 1269096799,
        "stddev": 62183765.64603599
      }
    },
    "sqs_messages_deleted_total": 4,
    "sqs_messages_inflight_gauge": 1,
    "sqs_messages_received_total": 5,
    "sqs_messages_returned_total": 0,
    "sqs_visibility_timeout_extensions_total": 0
  }
}

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 13, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 13, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #25711 updated

  • Start Time: 2021-05-17T18:58:11.310+0000

  • Duration: 111 min 56 sec

  • Commit: c51e2ef

Test stats 🧪

Test Results
Failed 0
Passed 13731
Skipped 2285
Total 16016

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 13731
Skipped 2285
Total 16016

@andrewkroh andrewkroh force-pushed the feature/fb/aws-s3-metrics branch 4 times, most recently from 55d5c0e to ae9ecf1 Compare May 15, 2021 01:14
@andrewkroh andrewkroh added the Team:Integrations Label for the Integrations team label May 15, 2021
@andrewkroh andrewkroh marked this pull request as ready for review May 15, 2021 01:15
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@mergify
Copy link
Contributor

mergify bot commented May 17, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/fb/aws-s3-metrics upstream/feature/fb/aws-s3-metrics
git merge upstream/master
git push upstream feature/fb/aws-s3-metrics

Diagnosing performance issues with the aws-s3 input is difficult so this instruments it with some metrics to make this easier.
These are the metrics that are added.

- Number of SQS messages received (not necessarily processed fully).
- Number of SQS visibility timeout extensions.
- Number of SQS messages inflight (gauge).
- Number of SQS message returned to queue (happens on errors implicitly after visibility timeout passes).
- Number of SQS messages deleted.
- Histogram of the elapsed SQS processing times in nanoseconds (time of receipt to time of delete/return).
- Number of S3 objects downloaded.
- Number of S3 bytes processed.
- Number of events created from processing S3 data.
- Number of S3 objects inflight (gauge).
- Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).

The metrics are structured as:

    dataset.<input-id>:
        id=<input id>
        input=aws-s3
        sqs_messages_received_total
        sqs_visibility_timeout_extensions_total
        sqs_messages_inflight_gauge
        sqs_messages_returned_total
        sqs_messages_deleted_total
        sqs_message_processing_time.histogram
        s3_objects_requested_total
        s3_bytes_processed_total
        s3_events_created_total
        s3_objects_inflight_gauge
        s3_object_processing_time.histogram

The v2 input logger was updated to include the input ID to make correlation with metrics possible when an explicit `id` is not set in the input config
@andrewkroh andrewkroh requested a review from leehinman May 17, 2021 18:58
@andrewkroh andrewkroh added the backport-v7.14.0 Automated backport with mergify label May 17, 2021
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

What do you think about adding a histogram for forwardEvent?

I was thinking it might be helpful to see if increases in s3_object_processing_time are because downloading S3 object is slow or because forwardEvent is slow.

@andrewkroh
Copy link
Member Author

andrewkroh commented May 17, 2021

Initially I had only timed the body processing side, but then went back to including full request time because I figured the value correlated more closely to api_timeout and would be more useful for tuning purposes that way. I do think it would be useful to be able to further slice the processing time into buckets for download and forwarding.

I could easily time the download, but one thing I'm not entirely sure of is whether the download is complete when req.Send(ctx) returns or does the download continue while processing data from the returned resp.Body.

We can discuss this a little more and add it separately. I'm going to fix a few config issues and I also found a deadlock issue.

@andrewkroh andrewkroh merged commit d3a03b0 into elastic:master May 17, 2021
mergify bot pushed a commit that referenced this pull request May 17, 2021
* Instrument aws-s3 with metrics

Diagnosing performance issues with the aws-s3 input is difficult so this instruments it with some metrics to make this easier.
These are the metrics that are added.

- Number of SQS messages received (not necessarily processed fully).
- Number of SQS visibility timeout extensions.
- Number of SQS messages inflight (gauge).
- Number of SQS message returned to queue (happens on errors implicitly after visibility timeout passes).
- Number of SQS messages deleted.
- Histogram of the elapsed SQS processing times in nanoseconds (time of receipt to time of delete/return).
- Number of S3 objects downloaded.
- Number of S3 bytes processed.
- Number of events created from processing S3 data.
- Number of S3 objects inflight (gauge).
- Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).

The metrics are structured as:

    dataset.<input-id>:
        id=<input id>
        input=aws-s3
        sqs_messages_received_total
        sqs_visibility_timeout_extensions_total
        sqs_messages_inflight_gauge
        sqs_messages_returned_total
        sqs_messages_deleted_total
        sqs_message_processing_time.histogram
        s3_objects_requested_total
        s3_bytes_processed_total
        s3_events_created_total
        s3_objects_inflight_gauge
        s3_object_processing_time.histogram

The v2 input logger was updated to include the input ID to make correlation with metrics possible when an explicit `id` is not set in the input config.

(cherry picked from commit d3a03b0)

# Conflicts:
#	x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
#	x-pack/filebeat/input/awss3/collector.go
cachedout pushed a commit that referenced this pull request May 18, 2021
* Instrument aws-s3 with metrics

Diagnosing performance issues with the aws-s3 input is difficult so this instruments it with some metrics to make this easier.
These are the metrics that are added.

- Number of SQS messages received (not necessarily processed fully).
- Number of SQS visibility timeout extensions.
- Number of SQS messages inflight (gauge).
- Number of SQS message returned to queue (happens on errors implicitly after visibility timeout passes).
- Number of SQS messages deleted.
- Histogram of the elapsed SQS processing times in nanoseconds (time of receipt to time of delete/return).
- Number of S3 objects downloaded.
- Number of S3 bytes processed.
- Number of events created from processing S3 data.
- Number of S3 objects inflight (gauge).
- Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).

The metrics are structured as:

    dataset.<input-id>:
        id=<input id>
        input=aws-s3
        sqs_messages_received_total
        sqs_visibility_timeout_extensions_total
        sqs_messages_inflight_gauge
        sqs_messages_returned_total
        sqs_messages_deleted_total
        sqs_message_processing_time.histogram
        s3_objects_requested_total
        s3_bytes_processed_total
        s3_events_created_total
        s3_objects_inflight_gauge
        s3_object_processing_time.histogram

The v2 input logger was updated to include the input ID to make correlation with metrics possible when an explicit `id` is not set in the input config.

(cherry picked from commit d3a03b0)

# Conflicts:
#	x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
#	x-pack/filebeat/input/awss3/collector.go
andrewkroh added a commit that referenced this pull request May 18, 2021
* Instrument aws-s3 with metrics

Diagnosing performance issues with the aws-s3 input is difficult so this instruments it with some metrics to make this easier.
These are the metrics that are added.

- Number of SQS messages received (not necessarily processed fully).
- Number of SQS visibility timeout extensions.
- Number of SQS messages inflight (gauge).
- Number of SQS message returned to queue (happens on errors implicitly after visibility timeout passes).
- Number of SQS messages deleted.
- Histogram of the elapsed SQS processing times in nanoseconds (time of receipt to time of delete/return).
- Number of S3 objects downloaded.
- Number of S3 bytes processed.
- Number of events created from processing S3 data.
- Number of S3 objects inflight (gauge).
- Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).

The metrics are structured as:

    dataset.<input-id>:
        id=<input id>
        input=aws-s3
        sqs_messages_received_total
        sqs_visibility_timeout_extensions_total
        sqs_messages_inflight_gauge
        sqs_messages_returned_total
        sqs_messages_deleted_total
        sqs_message_processing_time.histogram
        s3_objects_requested_total
        s3_bytes_processed_total
        s3_events_created_total
        s3_objects_inflight_gauge
        s3_object_processing_time.histogram

The v2 input logger was updated to include the input ID to make correlation with metrics possible when an explicit `id` is not set in the input config.
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrewkroh thank you so much for adding these metrics!! This will definitely make debugging aws-s3 input a lot more easier next time!! One nit: what do you think about using dot . in these metric names? For example:
sqs_messages_received_total -> sqs_messages.received or sqs_messages_received.total?

@kaiyan-sheng
Copy link
Contributor

Ahh sorry I didn't see this is merged already. Please ignore my question then. Thank you again!!!

andrewkroh added a commit that referenced this pull request May 18, 2021
* Instrument aws-s3 with metrics

Diagnosing performance issues with the aws-s3 input is difficult so this instruments it with some metrics to make this easier.
These are the metrics that are added.

- Number of SQS messages received (not necessarily processed fully).
- Number of SQS visibility timeout extensions.
- Number of SQS messages inflight (gauge).
- Number of SQS message returned to queue (happens on errors implicitly after visibility timeout passes).
- Number of SQS messages deleted.
- Histogram of the elapsed SQS processing times in nanoseconds (time of receipt to time of delete/return).
- Number of S3 objects downloaded.
- Number of S3 bytes processed.
- Number of events created from processing S3 data.
- Number of S3 objects inflight (gauge).
- Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).

The metrics are structured as:

    dataset.<input-id>:
        id=<input id>
        input=aws-s3
        sqs_messages_received_total
        sqs_visibility_timeout_extensions_total
        sqs_messages_inflight_gauge
        sqs_messages_returned_total
        sqs_messages_deleted_total
        sqs_message_processing_time.histogram
        s3_objects_requested_total
        s3_bytes_processed_total
        s3_events_created_total
        s3_objects_inflight_gauge
        s3_object_processing_time.histogram

The v2 input logger was updated to include the input ID to make correlation with metrics possible when an explicit `id` is not set in the input config.

Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
@metalshanked
Copy link

metalshanked commented Mar 4, 2023

Hi @andrewkroh and @kaiyan-sheng
(Sorry for the basic question)
I was looking for some info on how to setup these aws s3/sqs metrics?

i have Metricbeat monitoring setup like below but cant seem to view the S3 related metrics in Kibana.
I only see the standard Filebeat stat and state type metrics in the Kibana Monitoring UI

# Module: beat

- module: beat
  metricsets:
    - stats
    - state
    - dataset
  period: 10s
  hosts: ["${POD_IP}:5066"]
  #username: "user"
  #password: "secret"
  xpack.enabled: true

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.14.0 Automated backport with mergify enhancement Filebeat Filebeat review Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants