-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate, document and propose ways to optimize CloudWatch API calls we make for cost #2913
Comments
this seems like a sensible idea |
I think the key here is to understand how we make use of the GetMetricData API call and model it with few scenarios and then from there we can talk about the findings and see if there are some low hanging things we could change. |
One more thing to check here: In AWS, period is used as a parameter to control granularity. For example we can call AWS API to get metrics between 10:00 to 10:05 with a granularity of 1 minute. That will return data points with timestamp 10:00, 10:01, 10:02, 10:03, 10:04. But the If we can either introduce a granularity parameter or something else to leverage what AWS has as period, then we can make one API call and get 5 data points instead of 5 API calls for 5 data points. |
You are spot on @kaiyan-sheng :), I noticed the same thing during my investigations. Even though If we consider the following two scenarios,
my understanding is that in the first scenario, Cloudwatch counts 60 metrics requested in the last hour, while only 1 in the second. The resulting data would be the same, as the granularity (i.e Cloudwatch-related period) would be the same. The only drawback would be a delay in publishing data to ES. Adding a parameter to control granularity vs collection period, should help users that can tolerate the additional delay to save money. |
cc: @pmeresanu85 FYI ... Just for awareness |
A bit of data to drive this issue to tangible actions items. BackgroundSkip if you are already familiar with how Cloudwatch API calls are billed and how they are used in What is a metric in Cloudwatch termsMetrics are uniquely defined by a name, a namespace, and zero or more dimensions. Each data point in a metric has a time stamp, and (optionally) a unit of measure. You can retrieve statistics from CloudWatch for any metric. Available APIs to collect metrics from CloudwatchCloudwatch provides two APIs to retrieve datapoints related to specific metrics, GetMetricData and GetMetricStatistics. GetMetricData is charged $0.01 per 1,000 metrics requested. GetMetricStatistics could be an alternative, but it is charged per request ($0.01 per 1,000 requests). Considering that you need to perform one API call per metric, it would probably turn out more expensive that we have already (or the same cost). It is also not recommended by AWS How Cloudwatch APIs are used within MetricBeat
Example scenarioLet's consider the following simple scenario to understand which API is the main culprit behind high Cloudwatch API-related bills.
Let's use https://calculator.aws/#/addService/CloudWatch to calculate the cost of ListMetrics and GetMetricData API calls, on a monthly basis 43200 minutes in a month (i.e 30 days), 1 collection every 5 minutes = 8640 collections in a month ListMetrics API monthly usage = 1 call per collection = 8640 calls GetMetricData API monthly usage = 10 metrics * 8640 collections= 86400 metrics requested This results in the following monthly cost 86,400 metrics x 0.00001 USD = 0.86 USD (GetMetricData: Number of metrics requested cost) 8,640 requests x 0.00001 USD = 0.0864 USD (ListMetrics, PutMetricData and other request types cost) 0.86 USD + 0.0864 USD = 0.9464 USD (CloudWatch API metrics and requests cost) CloudWatch API requests cost (monthly): 0.9464 USD From the example above, it looks clear that GetMetricData API is the main driver of Cloudwatch API costs. IdeasIn a nutshell, we have 2 ways to reduce cost, in order of priority.
For 1, I propose the following improvements
For 2, we could start by avoid making multiple calls per namespace as I do not see how it is necessary. We could consider making one API call per region, and filter the data related to the namespaces we are interested in from the API response. If a user retrieves metrics for more N AWS services spanned over M regions, this would result in M ListMetrics API calls instead of N*M, per collection period. |
@ravikesarwani @kaiyan-sheng let me know your thoughts. I can elaborate further on the ideas if needed. |
Thanks @girodav. Is this something we can set up a focused discussion on? Maybe include Tom, Kaiyan and me so that we can get on the same page and discuss next steps. |
I was also wondering with the approach of [keeping the current metricbeat period parameter as "collection period" and introducing a new one that defines the Cloudwatch period/granularity] how does the cost savings work? So here we are saying that the user is selecting a high "collection period" but a "low granularity" and hence we are collecting less total # of metrics (because total number of calls doesn't matter here, its the # of metrics we are pulling)? Maybe an example would be helpful. |
The cost savings derive from the fact that the number of total number metrics analyzed by Cloudwatch would be less, in the same billing period, since we'd be calling GetMetricData for the same metrics less frequently. Consider, for example, the following two scenarios: Scenario 1: 1 GetMetricData call made every 5 minutes, with the following parameters
Scenario 2: 1 GetMetricData call made every 10 minutes, with the following parameters
The datapoints collected and published in ElasticSearch would be exactly the same, but in Scenario 1 we collect and publish 1 datapoint every 5 minutes, while in Scenario 2 we collect and publish 2 datapoints (5 minutes afar) every 10 minutes. With a change like this, the user would essentially spend 50% less in Scenario 2, in terms of GetMetricData API calls related costs, as the total number of metrics analysed by Cloudwatch, per billing period, would be half. The only drawback for the user would be a delay in getting the data in. |
As part of monitoring of various AWS services we use CloudWatch API's (GetMetricData) to collect metrics data.
When users are using Elastic Agent to monitor various AWS services (metrics) they need to pay for the CloudWatch API calls.
This issue is to investigate the current code, do some testing, document the design and then propose various ways on how we can optimize the CloudWatch API calls we make so that our solution is as optimized as possible from end users cost perspective on total cost they need to pay.
Two outcomes we are trying to drive from this issue:
Implementation of specific proposal will be handled as separate new issues.
The text was updated successfully, but these errors were encountered: