Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913

ravikesarwani · 2022-03-29T19:14:33Z

As part of monitoring of various AWS services we use CloudWatch API's (GetMetricData) to collect metrics data.
When users are using Elastic Agent to monitor various AWS services (metrics) they need to pay for the CloudWatch API calls.

This issue is to investigate the current code, do some testing, document the design and then propose various ways on how we can optimize the CloudWatch API calls we make so that our solution is as optimized as possible from end users cost perspective on total cost they need to pay.

Two outcomes we are trying to drive from this issue:

Define a use case (say we are collecting data for 100 EC2 instances) and explain our current code and how we use the CloudWatch API calls and the relation to cost. This forms the basis of explaining our current design.
Propose suggestions on how we can improve the API calls we make that would ultimately results in a reduced cost to the users, when monitoring using the Elastic Agent.

Implementation of specific proposal will be handled as separate new issues.

tommyers-elastic · 2022-06-13T09:22:24Z

this seems like a sensible idea

ravikesarwani · 2022-06-13T14:34:19Z

I think the key here is to understand how we make use of the GetMetricData API call and model it with few scenarios and then from there we can talk about the findings and see if there are some low hanging things we could change.

kaiyan-sheng · 2022-08-18T21:54:48Z

One more thing to check here:

In AWS, period is used as a parameter to control granularity. For example we can call AWS API to get metrics between 10:00 to 10:05 with a granularity of 1 minute. That will return data points with timestamp 10:00, 10:01, 10:02, 10:03, 10:04. But the period we have in Metricbeat has a different definition. It controls how often Metricbeat runs to collect metrics. In order to get a granularity of 1min, we have to set Metricbeat to run every minute. Each run we only collect one data point.

If we can either introduce a granularity parameter or something else to leverage what AWS has as period, then we can make one API call and get 5 data points instead of 5 API calls for 5 data points.

girodav · 2022-09-01T11:46:54Z

You are spot on @kaiyan-sheng :), I noticed the same thing during my investigations. Even though GetMetricData is billed per number of metrics requested, not per number of calls, we should be able reduce cost by reducing the frequency of the calls we make.

If we consider the following two scenarios,

1 API call made every minute, requesting datapoints for 1 metric for the last minute, with 1 minute period
1 API call made every hour, requesting datapoints for 1 metric for the last hour, with 1 minute period

my understanding is that in the first scenario, Cloudwatch counts 60 metrics requested in the last hour, while only 1 in the second. The resulting data would be the same, as the granularity (i.e Cloudwatch-related period) would be the same. The only drawback would be a delay in publishing data to ES.

Adding a parameter to control granularity vs collection period, should help users that can tolerate the additional delay to save money.

ravikesarwani · 2022-09-01T12:34:08Z

cc: @pmeresanu85 FYI ... Just for awareness

girodav · 2022-09-08T15:20:00Z

A bit of data to drive this issue to tangible actions items.

Background

Skip if you are already familiar with how Cloudwatch API calls are billed and how they are used in metricbeat

What is a metric in Cloudwatch terms

Metrics are uniquely defined by a name, a namespace, and zero or more dimensions. Each data point in a metric has a time stamp, and (optionally) a unit of measure. You can retrieve statistics from CloudWatch for any metric.

Available APIs to collect metrics from Cloudwatch

Cloudwatch provides two APIs to retrieve datapoints related to specific metrics, GetMetricData and GetMetricStatistics.

GetMetricData is charged $0.01 per 1,000 metrics requested.

GetMetricStatistics could be an alternative, but it is charged per request ($0.01 per 1,000 requests). Considering that you need to perform one API call per metric, it would probably turn out more expensive that we have already (or the same cost). It is also not recommended by AWS

How Cloudwatch APIs are used within MetricBeat

AWS API Name	AWS API Count	Frequency
ListMetrics	Total number of results / ListMetrics max page size	Per region per namespace per collection period
GetMetricData	Total number of results / GetMetricData max page size	Per region per namespace per collection period

Example scenario

Let's consider the following simple scenario to understand which API is the main culprit behind high Cloudwatch API-related bills.

1 EC2 instance in us-east-1.
We plan to collect 10 metrics for this instance.
Metricbeat period is configured to 5 minutes.

Let's use https://calculator.aws/#/addService/CloudWatch to calculate the cost of ListMetrics and GetMetricData API calls, on a monthly basis

43200 minutes in a month (i.e 30 days), 1 collection every 5 minutes = 8640 collections in a month

ListMetrics API monthly usage = 1 call per collection = 8640 calls

GetMetricData API monthly usage = 10 metrics * 8640 collections= 86400 metrics requested

This results in the following monthly cost

86,400 metrics x 0.00001 USD = 0.86 USD (GetMetricData: Number of metrics requested cost)

8,640 requests x 0.00001 USD = 0.0864 USD (ListMetrics, PutMetricData and other request types cost)

0.86 USD + 0.0864 USD = 0.9464 USD (CloudWatch API metrics and requests cost)

CloudWatch API requests cost (monthly): 0.9464 USD

From the example above, it looks clear that GetMetricData API is the main driver of Cloudwatch API costs.

Ideas

In a nutshell, we have 2 ways to reduce cost, in order of priority.

Reduce the amount of GetMetricData calls and/or reduce the amount of metrics analyzed per GetMetricData call.
Reduce the amount of ListMetrics calls.

For 1, I propose the following improvements

As mentioned in Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913 (comment), the main issue is that the value for "period" that can be configured either on the AWS metricbeat module or an AWS integration, is used for both the underlying collection period and the period used as parameter for cloudwatch API calls (i.e the granularity of the datapoints received as response). Separating the two, keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity, could be a possible improvement. This would give more flexibility to users that can tolerate the extra-delay, but still require high-granular data stored in Elastic.
Improve the AWS module/integrations documentation with clear pointers on how users can reduce costs. Few examples:
1. Specifying the AWS regions to collect data from could reduce cost if the user is not interested in collecting data from all regions, as the number of API calls and metrics analyzed would be reduced.
2. Using the cloudwatch module/integration specifying a namespace (e.g AWS/EC2) instead the specific AWS module/integration (e.g EC2 metrics) could lead to considerable savings, if the user is interested only in retrieving specifying metrics and not all those provided by the module/integration.

For 2, we could start by avoid making multiple calls per namespace as I do not see how it is necessary. We could consider making one API call per region, and filter the data related to the namespaces we are interested in from the API response. If a user retrieves metrics for more N AWS services spanned over M regions, this would result in M ListMetrics API calls instead of N*M, per collection period.

girodav · 2022-09-08T15:21:50Z

@ravikesarwani @kaiyan-sheng let me know your thoughts. I can elaborate further on the ideas if needed.

ravikesarwani · 2022-09-08T17:56:49Z

Thanks @girodav. Is this something we can set up a focused discussion on? Maybe include Tom, Kaiyan and me so that we can get on the same page and discuss next steps.

ravikesarwani · 2022-09-08T18:07:53Z

I was also wondering with the approach of [keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity] how does the cost savings work? So here we are saying that the user is selecting a high "collection period" but a "low granularity" and hence we are collecting less total # of metrics (because total number of calls doesn't matter here, its the # of metrics we are pulling)? Maybe an example would be helpful.

girodav · 2022-09-09T17:31:58Z

I was also wondering with the approach of [keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity] how does the cost savings work? So here we are saying that the user is selecting a high "collection period" but a "low granularity" and hence we are collecting less total # of metrics (because total number of calls doesn't matter here, its the # of metrics we are pulling)? Maybe an example would be helpful.

The cost savings derive from the fact that the number of total number metrics analyzed by Cloudwatch would be less, in the same billing period, since we'd be calling GetMetricData for the same metrics less frequently.

Consider, for example, the following two scenarios:

Scenario 1: 1 GetMetricData call made every 5 minutes, with the following parameters

{
    "MetricDataQueries": [
        {
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/EC2",
                    "MetricName": "CPUUtilization"
                },
                "Period": 300,
                "Stat": "Average"
            },
        }
    ],
    "StartTime": "< NOW- 5minutes >,
    "EndTime": < NOW >
}

Scenario 2: 1 GetMetricData call made every 10 minutes, with the following parameters

{
    "MetricDataQueries": [
        {
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/EC2",
                    "MetricName": "CPUUtilization"
                },
                "Period": 300,
                "Stat": "Average"
            },
        }
    ],
    "StartTime": "< NOW - 10minutes>,
    "EndTime": < NOW >
}

The datapoints collected and published in ElasticSearch would be exactly the same, but in Scenario 1 we collect and publish 1 datapoint every 5 minutes, while in Scenario 2 we collect and publish 2 datapoints (5 minutes afar) every 10 minutes. With a change like this, the user would essentially spend 50% less in Scenario 2, in terms of GetMetricData API calls related costs, as the total number of metrics analysed by Cloudwatch, per billing period, would be half. The only drawback for the user would be a delay in getting the data in.

girodav · 2022-09-20T16:49:32Z

Closing this in favor of the following issues:

ravikesarwani added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Mar 29, 2022

ravikesarwani added the enhancement New feature or request label Apr 5, 2022

andrewkroh added the Integration:aws AWS label May 6, 2022

andresrc assigned girodav Jul 28, 2022

This was referenced Sep 20, 2022

[AWS] Add a new "Data Granularity" parameter in Cloudwatch metricbeat module elastic/beats#33133

Closed

[AWS Cloudwatch] Avoid calling ListMetrics by namespace elastic/beats#33134

Closed

girodav closed this as completed Sep 20, 2022

tommyers-elastic mentioned this issue Sep 27, 2022

[metricbeat] aws cloudwatch metric collection optimizations elastic/beats#33195

Closed

This was referenced Oct 18, 2022

[Metricbeat] Add Data Granularity config option for AWS Cloudwatch metrics elastic/beats#33166

Merged

[AWS] Add a new "Data Granularity" advanced parameter for metrics-related integrations #4246

Closed

girodav mentioned this issue Nov 21, 2022

Update AWS integration documentation with risk in using the GetMetricData API calls #4674

Closed

girodav mentioned this issue Dec 15, 2022

[AWS Cloudwatch] Changed module to call ListMetrics API only once per region elastic/beats#34055

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913

Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913

ravikesarwani commented Mar 29, 2022 •

edited by girodav

Loading

tommyers-elastic commented Jun 13, 2022 •

edited by jstrassb

Loading

ravikesarwani commented Jun 13, 2022 •

edited by jstrassb

Loading

kaiyan-sheng commented Aug 18, 2022

girodav commented Sep 1, 2022 •

edited by jstrassb

Loading

ravikesarwani commented Sep 1, 2022

girodav commented Sep 8, 2022 •

edited by jstrassb

Loading

girodav commented Sep 8, 2022

ravikesarwani commented Sep 8, 2022

ravikesarwani commented Sep 8, 2022

girodav commented Sep 9, 2022

girodav commented Sep 20, 2022 •

edited

Loading

Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913

Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913

Comments

ravikesarwani commented Mar 29, 2022 • edited by girodav Loading

tommyers-elastic commented Jun 13, 2022 • edited by jstrassb Loading

ravikesarwani commented Jun 13, 2022 • edited by jstrassb Loading

kaiyan-sheng commented Aug 18, 2022

girodav commented Sep 1, 2022 • edited by jstrassb Loading

ravikesarwani commented Sep 1, 2022

girodav commented Sep 8, 2022 • edited by jstrassb Loading

Background

What is a metric in Cloudwatch terms

Available APIs to collect metrics from Cloudwatch

How Cloudwatch APIs are used within MetricBeat

Example scenario

Ideas

girodav commented Sep 8, 2022

ravikesarwani commented Sep 8, 2022

ravikesarwani commented Sep 8, 2022

girodav commented Sep 9, 2022

Scenario 1: 1 GetMetricData call made every 5 minutes, with the following parameters

Scenario 2: 1 GetMetricData call made every 10 minutes, with the following parameters

girodav commented Sep 20, 2022 • edited Loading

ravikesarwani commented Mar 29, 2022 •

edited by girodav

Loading

tommyers-elastic commented Jun 13, 2022 •

edited by jstrassb

Loading

ravikesarwani commented Jun 13, 2022 •

edited by jstrassb

Loading

girodav commented Sep 1, 2022 •

edited by jstrassb

Loading

girodav commented Sep 8, 2022 •

edited by jstrassb

Loading

girodav commented Sep 20, 2022 •

edited

Loading