Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913

Closed
2 tasks done
ravikesarwani opened this issue Mar 29, 2022 · 11 comments
Closed
2 tasks done
Assignees
Labels
enhancement New feature or request Integration:aws AWS Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@ravikesarwani
Copy link

ravikesarwani commented Mar 29, 2022

As part of monitoring of various AWS services we use CloudWatch API's (GetMetricData) to collect metrics data.
When users are using Elastic Agent to monitor various AWS services (metrics) they need to pay for the CloudWatch API calls.

This issue is to investigate the current code, do some testing, document the design and then propose various ways on how we can optimize the CloudWatch API calls we make so that our solution is as optimized as possible from end users cost perspective on total cost they need to pay.

Two outcomes we are trying to drive from this issue:

  • Define a use case (say we are collecting data for 100 EC2 instances) and explain our current code and how we use the CloudWatch API calls and the relation to cost. This forms the basis of explaining our current design.
  • Propose suggestions on how we can improve the API calls we make that would ultimately results in a reduced cost to the users, when monitoring using the Elastic Agent.

Implementation of specific proposal will be handled as separate new issues.

@ravikesarwani ravikesarwani added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Mar 29, 2022
@ravikesarwani ravikesarwani added the enhancement New feature or request label Apr 5, 2022
@tommyers-elastic
Copy link
Contributor

tommyers-elastic commented Jun 13, 2022

this seems like a sensible idea

@ravikesarwani
Copy link
Author

ravikesarwani commented Jun 13, 2022

I think the key here is to understand how we make use of the GetMetricData API call and model it with few scenarios and then from there we can talk about the findings and see if there are some low hanging things we could change.

@kaiyan-sheng
Copy link
Contributor

One more thing to check here:

In AWS, period is used as a parameter to control granularity. For example we can call AWS API to get metrics between 10:00 to 10:05 with a granularity of 1 minute. That will return data points with timestamp 10:00, 10:01, 10:02, 10:03, 10:04. But the period we have in Metricbeat has a different definition. It controls how often Metricbeat runs to collect metrics. In order to get a granularity of 1min, we have to set Metricbeat to run every minute. Each run we only collect one data point.

If we can either introduce a granularity parameter or something else to leverage what AWS has as period, then we can make one API call and get 5 data points instead of 5 API calls for 5 data points.

@girodav
Copy link
Contributor

girodav commented Sep 1, 2022

You are spot on @kaiyan-sheng :), I noticed the same thing during my investigations. Even though GetMetricData is billed per number of metrics requested, not per number of calls, we should be able reduce cost by reducing the frequency of the calls we make.

If we consider the following two scenarios,

  1. 1 API call made every minute, requesting datapoints for 1 metric for the last minute, with 1 minute period
  2. 1 API call made every hour, requesting datapoints for 1 metric for the last hour, with 1 minute period

my understanding is that in the first scenario, Cloudwatch counts 60 metrics requested in the last hour, while only 1 in the second. The resulting data would be the same, as the granularity (i.e Cloudwatch-related period) would be the same. The only drawback would be a delay in publishing data to ES.

Adding a parameter to control granularity vs collection period, should help users that can tolerate the additional delay to save money.

@ravikesarwani
Copy link
Author

cc: @pmeresanu85 FYI ... Just for awareness

@girodav
Copy link
Contributor

girodav commented Sep 8, 2022

A bit of data to drive this issue to tangible actions items.

Background

Skip if you are already familiar with how Cloudwatch API calls are billed and how they are used in metricbeat

What is a metric in Cloudwatch terms

Metrics are uniquely defined by a name, a namespace, and zero or more dimensions. Each data point in a metric has a time stamp, and (optionally) a unit of measure. You can retrieve statistics from CloudWatch for any metric.

Available APIs to collect metrics from Cloudwatch

Cloudwatch provides two APIs to retrieve datapoints related to specific metrics, GetMetricData and GetMetricStatistics.

GetMetricData is charged $0.01 per 1,000 metrics requested.

GetMetricStatistics could be an alternative, but it is charged per request ($0.01 per 1,000 requests). Considering that you need to perform one API call per metric, it would probably turn out more expensive that we have already (or the same cost). It is also not recommended by AWS

How Cloudwatch APIs are used within MetricBeat

AWS API Name AWS API Count Frequency
ListMetrics Total number of results / ListMetrics max page size Per region per namespace per collection period
GetMetricData Total number of results / GetMetricData max page size Per region per namespace per collection period

Example scenario

Let's consider the following simple scenario to understand which API is the main culprit behind high Cloudwatch API-related bills.

  • 1 EC2 instance in us-east-1.
  • We plan to collect 10 metrics for this instance.
  • Metricbeat period is configured to 5 minutes.

Let's use https://calculator.aws/#/addService/CloudWatch to calculate the cost of ListMetrics and GetMetricData API calls, on a monthly basis

43200 minutes in a month (i.e 30 days), 1 collection every 5 minutes = 8640 collections in a month

ListMetrics API monthly usage = 1 call per collection = 8640 calls

GetMetricData API monthly usage = 10 metrics * 8640 collections= 86400 metrics requested

This results in the following monthly cost

86,400 metrics x 0.00001 USD = 0.86 USD (GetMetricData: Number of metrics requested cost)

8,640 requests x 0.00001 USD = 0.0864 USD (ListMetrics, PutMetricData and other request types cost)

0.86 USD + 0.0864 USD = 0.9464 USD (CloudWatch API metrics and requests cost)

CloudWatch API requests cost (monthly): 0.9464 USD

From the example above, it looks clear that GetMetricData API is the main driver of Cloudwatch API costs.

Ideas

In a nutshell, we have 2 ways to reduce cost, in order of priority.

  1. Reduce the amount of GetMetricData calls and/or reduce the amount of metrics analyzed per GetMetricData call.
  2. Reduce the amount of ListMetrics calls.

For 1, I propose the following improvements

  1. As mentioned in Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913 (comment), the main issue is that the value for "period" that can be configured either on the AWS metricbeat module or an AWS integration, is used for both the underlying collection period and the period used as parameter for cloudwatch API calls (i.e the granularity of the datapoints received as response). Separating the two, keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity, could be a possible improvement. This would give more flexibility to users that can tolerate the extra-delay, but still require high-granular data stored in Elastic.
  2. Improve the AWS module/integrations documentation with clear pointers on how users can reduce costs. Few examples:
    1. Specifying the AWS regions to collect data from could reduce cost if the user is not interested in collecting data from all regions, as the number of API calls and metrics analyzed would be reduced.
    2. Using the cloudwatch module/integration specifying a namespace (e.g AWS/EC2) instead the specific AWS module/integration (e.g EC2 metrics) could lead to considerable savings, if the user is interested only in retrieving specifying metrics and not all those provided by the module/integration.

For 2, we could start by avoid making multiple calls per namespace as I do not see how it is necessary. We could consider making one API call per region, and filter the data related to the namespaces we are interested in from the API response. If a user retrieves metrics for more N AWS services spanned over M regions, this would result in M ListMetrics API calls instead of N*M, per collection period.

@girodav
Copy link
Contributor

girodav commented Sep 8, 2022

@ravikesarwani @kaiyan-sheng let me know your thoughts. I can elaborate further on the ideas if needed.

@ravikesarwani
Copy link
Author

Thanks @girodav. Is this something we can set up a focused discussion on? Maybe include Tom, Kaiyan and me so that we can get on the same page and discuss next steps.

@ravikesarwani
Copy link
Author

I was also wondering with the approach of [keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity] how does the cost savings work? So here we are saying that the user is selecting a high "collection period" but a "low granularity" and hence we are collecting less total # of metrics (because total number of calls doesn't matter here, its the # of metrics we are pulling)? Maybe an example would be helpful.

@girodav
Copy link
Contributor

girodav commented Sep 9, 2022

I was also wondering with the approach of [keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity] how does the cost savings work? So here we are saying that the user is selecting a high "collection period" but a "low granularity" and hence we are collecting less total # of metrics (because total number of calls doesn't matter here, its the # of metrics we are pulling)? Maybe an example would be helpful.

The cost savings derive from the fact that the number of total number metrics analyzed by Cloudwatch would be less, in the same billing period, since we'd be calling GetMetricData for the same metrics less frequently.

Consider, for example, the following two scenarios:

Scenario 1: 1 GetMetricData call made every 5 minutes, with the following parameters

{
    "MetricDataQueries": [
        {
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/EC2",
                    "MetricName": "CPUUtilization"
                },
                "Period": 300,
                "Stat": "Average"
            },
        }
    ],
    "StartTime": "< NOW- 5minutes >,
    "EndTime": < NOW >
}

Scenario 2: 1 GetMetricData call made every 10 minutes, with the following parameters

{
    "MetricDataQueries": [
        {
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/EC2",
                    "MetricName": "CPUUtilization"
                },
                "Period": 300,
                "Stat": "Average"
            },
        }
    ],
    "StartTime": "< NOW - 10minutes>,
    "EndTime": < NOW >
}

The datapoints collected and published in ElasticSearch would be exactly the same, but in Scenario 1 we collect and publish 1 datapoint every 5 minutes, while in Scenario 2 we collect and publish 2 datapoints (5 minutes afar) every 10 minutes. With a change like this, the user would essentially spend 50% less in Scenario 2, in terms of GetMetricData API calls related costs, as the total number of metrics analysed by Cloudwatch, per billing period, would be half. The only drawback for the user would be a delay in getting the data in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:aws AWS Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

No branches or pull requests

5 participants