Skip to content

Commit

Permalink
Remove hardcoded 300s and add error messages when data collection is …
Browse files Browse the repository at this point in the history
…missing (#10142)

* Remove hardcoded 300s and add error messages

* Add logger instance in metricset instead recreating new ones

* Add check on period to be multiple of 60s

* Change to info msg for period setting

* Update metricbeat/docs/modules/aws.asciidoc
  • Loading branch information
kaiyan-sheng authored Jan 24, 2019
1 parent bb351d7 commit d7d3de2
Show file tree
Hide file tree
Showing 4 changed files with 181 additions and 137 deletions.
21 changes: 13 additions & 8 deletions metricbeat/docs/modules/aws.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ This file is generated! See scripts/docs_collector.py

beta[]

This module periodically fetches monitoring metrics from AWS Cloudwatch using GetMetricData API. Note: extra AWS charges on
GetMetricData API requests will be generated by this module.
This module periodically fetches monitoring metrics from AWS Cloudwatch using
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricData.html[GetMetricData API] for running
EC2 instances. Note: extra AWS charges on GetMetricData API requests will be generated by this module.

The default metricset is `ec2`.

Expand All @@ -30,7 +31,7 @@ see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html[Te
`sts get-session-token` AWS CLI can be used to generate temporary credentials. For example. with MFA-enabled:

----
aws> sts get-session-token --serial-number arn:aws:iam::1234:mfa/test@elastic.co --token-code 456789 --duration-seconds 129600
aws> sts get-session-token --serial-number arn:aws:iam::1234:mfa/your-email@example.com --token-code 456789 --duration-seconds 129600
----

Specific permissions needs to be added into the IAM user's policy to allow Metricbeat collecting AWS monitoring metrics. Please
Expand All @@ -43,11 +44,12 @@ $0.01/1000 metrics requested using GetMetricData. Please see https://aws.amazon.
for more details. To avoid unnecessary charges, `period` is preferred to be set to `300s` or multiples of `300s`, such as
`600s` and `900s`.

For more granular monitoring data you can enable detailed monitoring on the instance to get metrics every 1 minute. With this
detailed monitoring enabled, `period` in aws module configuration can be any number larger than `60s`. Since AWS sends metric
data to CloudWatch in 1-minute periods, setting metricbeat module `period` less than `60s` will cause extra API requests which
means extra charges on AWS. To avoid unnecessary charges, `period` is preferred to be set to `60s` or multiples of `60s`,
such as `120s` and `180s`.
For more granular monitoring data you can enable detailed monitoring on the instance to get metrics every 1 minute. Please see
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html[Enabling Detailed Monitoring] for instructions
on how to enable detailed monitoring. With detailed monitoring enabled, `period` in aws module configuration can be any number
larger than `60s`. Since AWS sends metric data to CloudWatch in 1-minute periods, setting metricbeat module `period` less
than `60s` will cause extra API requests which means extra charges on AWS. To avoid unnecessary charges, `period` is
preferred to be set to `60s` or multiples of `60s`, such as `120s` and `180s`.

Here is an example of aws metricbeat module configuration:

Expand All @@ -63,6 +65,9 @@ metricbeat.modules:
default_region: '${AWS_REGION:us-west-1}'
----

This module only collects metrics for EC2 instances that are in `running` state and exist more than 10 minutes to make sure
there are monitoring metrics exist in Cloudwatch already.

The AWS module comes with a predefined dashboard. For example:

image::./images/metricbeat-aws-ec2-overview.png[]
Expand Down
21 changes: 13 additions & 8 deletions x-pack/metricbeat/module/aws/_meta/docs.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
This module periodically fetches monitoring metrics from AWS Cloudwatch using GetMetricData API. Note: extra AWS charges on
GetMetricData API requests will be generated by this module.
This module periodically fetches monitoring metrics from AWS Cloudwatch using
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricData.html[GetMetricData API] for running
EC2 instances. Note: extra AWS charges on GetMetricData API requests will be generated by this module.

The default metricset is `ec2`.

Expand All @@ -21,7 +22,7 @@ see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html[Te
`sts get-session-token` AWS CLI can be used to generate temporary credentials. For example. with MFA-enabled:

----
aws> sts get-session-token --serial-number arn:aws:iam::1234:mfa/test@elastic.co --token-code 456789 --duration-seconds 129600
aws> sts get-session-token --serial-number arn:aws:iam::1234:mfa/your-email@example.com --token-code 456789 --duration-seconds 129600
----

Specific permissions needs to be added into the IAM user's policy to allow Metricbeat collecting AWS monitoring metrics. Please
Expand All @@ -34,11 +35,12 @@ $0.01/1000 metrics requested using GetMetricData. Please see https://aws.amazon.
for more details. To avoid unnecessary charges, `period` is preferred to be set to `300s` or multiples of `300s`, such as
`600s` and `900s`.

For more granular monitoring data you can enable detailed monitoring on the instance to get metrics every 1 minute. With this
detailed monitoring enabled, `period` in aws module configuration can be any number larger than `60s`. Since AWS sends metric
data to CloudWatch in 1-minute periods, setting metricbeat module `period` less than `60s` will cause extra API requests which
means extra charges on AWS. To avoid unnecessary charges, `period` is preferred to be set to `60s` or multiples of `60s`,
such as `120s` and `180s`.
For more granular monitoring data you can enable detailed monitoring on the instance to get metrics every 1 minute. Please see
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html[Enabling Detailed Monitoring] for instructions
on how to enable detailed monitoring. With detailed monitoring enabled, `period` in aws module configuration can be any number
larger than `60s`. Since AWS sends metric data to CloudWatch in 1-minute periods, setting metricbeat module `period` less
than `60s` will cause extra API requests which means extra charges on AWS. To avoid unnecessary charges, `period` is
preferred to be set to `60s` or multiples of `60s`, such as `120s` and `180s`.

Here is an example of aws metricbeat module configuration:

Expand All @@ -54,6 +56,9 @@ metricbeat.modules:
default_region: '${AWS_REGION:us-west-1}'
----

This module only collects metrics for EC2 instances that are in `running` state and exist more than 10 minutes to make sure
there are monitoring metrics exist in Cloudwatch already.

The AWS module comes with a predefined dashboard. For example:

image::./images/metricbeat-aws-ec2-overview.png[]
148 changes: 79 additions & 69 deletions x-pack/metricbeat/module/aws/ec2/ec2.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,12 @@ func init() {
// interface methods except for Fetch.
type MetricSet struct {
*aws.MetricSet
moduleConfig *aws.Config
awsConfig *awssdk.Config
regionsList []string
moduleConfig *aws.Config
awsConfig *awssdk.Config
regionsList []string
durationString string
periodInSec int
logger *logp.Logger
}

// metricIDNameMap is a translating map between createMetricDataQuery id
Expand Down Expand Up @@ -72,6 +75,7 @@ var metricIDNameMap = map[string][]string{
// any MetricSet specific configuration options if there are any.
func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
cfgwarn.Beta("The aws ec2 metricset is beta.")
ec2Logger := logp.NewLogger(aws.ModuleName)

moduleConfig := aws.Config{}
if err := base.Module().UnpackConfig(&moduleConfig); err != nil {
Expand All @@ -80,7 +84,7 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {

if moduleConfig.Period == "" {
err := errors.New("period is not set in AWS module config")
logp.Error(err)
ec2Logger.Error(err)
}

metricSet, err := aws.NewMetricSet(base)
Expand All @@ -103,19 +107,38 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
}

awsConfig.Region = moduleConfig.DefaultRegion

svcEC2 := ec2.New(awsConfig)
regionsList, err := getRegions(svcEC2)
if err != nil {
err = errors.Wrap(err, "getRegions failed")
logp.Error(err)
ec2Logger.Error(err.Error())
}

// Calculate duration based on period
durationString, periodSec, err := convertPeriodToDuration(moduleConfig.Period)
if err != nil {
ec2Logger.Error(err.Error())
return nil, err
}

// Check if period is set to be multiple of 60s or 300s
remainder300 := periodSec % 300
remainder60 := periodSec % 60
if remainder300 != 0 || remainder60 != 0 {
err := errors.New("period needs to be set to 60s (or a multiple of 60s) if detailed monitoring is " +
"enabled for EC2 instances or set to 300s (or a multiple of 300s) if EC2 instances has basic monitoring. " +
"To avoid data missing or extra costs, please make sure period is set correctly in config.yml")
ec2Logger.Info(err)
}

return &MetricSet{
MetricSet: metricSet,
moduleConfig: &moduleConfig,
awsConfig: &awsConfig,
regionsList: regionsList,
MetricSet: metricSet,
moduleConfig: &moduleConfig,
awsConfig: &awsConfig,
regionsList: regionsList,
durationString: durationString,
periodInSec: periodSec,
logger: ec2Logger,
}, nil
}

Expand All @@ -129,32 +152,37 @@ func (m *MetricSet) Fetch(report mb.ReporterV2) {
instanceIDs, instancesOutputs, err := getInstancesPerRegion(svcEC2)
if err != nil {
err = errors.Wrap(err, "getInstancesPerRegion failed, skipping region "+regionName)
logp.Error(err)
m.logger.Errorf(err.Error())
report.Error(err)
continue
}

svcCloudwatch := cloudwatch.New(*m.awsConfig)
for _, instanceID := range instanceIDs {
//Calculate duration based on period
detailedMonitoring := instancesOutputs[instanceID].Monitoring.State
durationString, periodSec := convertPeriodToDuration(m.moduleConfig.Period, detailedMonitoring)
init := true
getMetricDataOutput := &cloudwatch.GetMetricDataOutput{NextToken: nil}
for init || getMetricDataOutput.NextToken != nil {
init = false
output, err := getMetricDataPerRegion(durationString, periodSec, instanceID, getMetricDataOutput.NextToken, svcCloudwatch)
output, err := getMetricDataPerRegion(m.durationString, m.periodInSec, instanceID, getMetricDataOutput.NextToken, svcCloudwatch)
if err != nil {
err = errors.Wrap(err, "getMetricDataPerRegion failed, skipping region "+regionName+" for instance "+instanceID)
logp.Error(err)
m.logger.Error(err.Error())
report.Error(err)
continue
}
getMetricDataOutput.MetricDataResults = append(getMetricDataOutput.MetricDataResults, output.MetricDataResults...)
}
event, err := createCloudWatchEvents(getMetricDataOutput, instanceID, instancesOutputs[instanceID], regionName)

event, info, err := createCloudWatchEvents(getMetricDataOutput, instanceID, instancesOutputs[instanceID], regionName)
if info != "" {
m.logger.Info(info)
}

if err != nil {
report.Error(err)
m.logger.Error(err.Error())
event.Error = err
report.Event(event)
continue
}
report.Event(event)
}
Expand All @@ -166,7 +194,7 @@ func getRegions(svc ec2iface.EC2API) (regionsList []string, err error) {
req := svc.DescribeRegionsRequest(input)
output, err := req.Send()
if err != nil {
logp.Error(errors.Wrap(err, "Failed DescribeRegions"))
err = errors.Wrap(err, "Failed DescribeRegions")
return
}
for _, region := range output.Regions {
Expand All @@ -175,29 +203,31 @@ func getRegions(svc ec2iface.EC2API) (regionsList []string, err error) {
return
}

func createCloudWatchEvents(getMetricDataOutput *cloudwatch.GetMetricDataOutput, instanceID string, instanceOutput ec2.Instance, regionName string) (event mb.Event, err error) {
machineType, err := instanceOutput.InstanceType.MarshalValue()
if err != nil {
err = errors.Wrap(err, "instance.InstanceType.MarshalValue failed")
logp.Error(err)
}

func createCloudWatchEvents(getMetricDataOutput *cloudwatch.GetMetricDataOutput, instanceID string, instanceOutput ec2.Instance, regionName string) (event mb.Event, info string, err error) {
event.Service = metricsetName
event.RootFields = common.MapStr{}
mapOfRootFieldsResults := make(map[string]interface{})
mapOfRootFieldsResults["service.name"] = metricsetName
mapOfRootFieldsResults["cloud.provider"] = metricsetName
mapOfRootFieldsResults["cloud.instance.id"] = instanceID

machineType, err := instanceOutput.InstanceType.MarshalValue()
if err != nil {
err = errors.Wrap(err, "instance.InstanceType.MarshalValue failed")
return
}

mapOfRootFieldsResults["cloud.machine.type"] = machineType
mapOfRootFieldsResults["cloud.availability_zone"] = *instanceOutput.Placement.AvailabilityZone
mapOfRootFieldsResults["cloud.image.id"] = *instanceOutput.ImageId
mapOfRootFieldsResults["cloud.region"] = regionName

resultRootFields, err := eventMapping(mapOfRootFieldsResults, schemaRootFields)
if err != nil {
err = errors.Wrap(err, "Error trying to apply schema in AWS EC2 metricbeat module.")
logp.Error(err)
err = errors.Wrap(err, "Error trying to apply schema schemaRootFields in AWS EC2 metricbeat module.")
return
}
event.RootFields = resultRootFields

mapOfMetricSetFieldResults := make(map[string]interface{})
for _, output := range getMetricDataOutput.MetricDataResults {
Expand All @@ -208,12 +238,17 @@ func createCloudWatchEvents(getMetricDataOutput *cloudwatch.GetMetricDataOutput,
mapOfMetricSetFieldResults[metricKey[0]] = fmt.Sprint(output.Values[0])
}

if len(mapOfMetricSetFieldResults) <= 3 {
info = "Missing Cloudwatch data for instance " + instanceID + ". This is expected for a new instance during the " +
"first data collection. If this shows up multiple times, please recheck the period setting in config."
return
}

resultMetricSetFields, err := eventMapping(mapOfMetricSetFieldResults, schemaMetricSetFields)
if err != nil {
err = errors.Wrap(err, "Error trying to apply schema in AWS EC2 metricbeat module.")
logp.Error(err)
err = errors.Wrap(err, "Error trying to apply schema schemaMetricSetFields in AWS EC2 metricbeat module.")
return
}
event.RootFields = resultRootFields
event.MetricSetFields = resultMetricSetFields
return
}
Expand All @@ -236,69 +271,44 @@ func getInstancesPerRegion(svc ec2iface.EC2API) (instanceIDs []string, instances
req := svc.DescribeInstancesRequest(describeInstanceInput)
output, err := req.Send()
if err != nil {
logp.Error(errors.Wrap(err, "Error DescribeInstances"))
err = errors.Wrap(err, "Error DescribeInstances")
return nil, nil, err
}

for _, reservation := range output.Reservations {
for _, instance := range reservation.Instances {
instanceID := *instance.InstanceId
instanceIDs = append(instanceIDs, instanceID)
instancesOutputs[instanceID] = instance
instanceIDs = append(instanceIDs, *instance.InstanceId)
instancesOutputs[*instance.InstanceId] = instance
}
}
}
return
}

func convertPeriodToDuration(period string, detailedMonitoring ec2.MonitoringState) (duration string, periodInSeconds int) {
func convertPeriodToDuration(period string) (string, int, error) {
// Amazon EC2 sends metrics to Amazon CloudWatch with 5-minute default frequency.
// If detailed monitoring is enabled, then data will be available in 1-minute period.
// Set starttime double the default frequency earlier than the endtime in order to make sure
// GetMetricDataRequest gets the latest data point for each metric.
numberPeriod, err := strconv.Atoi(period[0 : len(period)-1])
if err != nil {
logp.Error(errors.Wrap(err, "Error converting string to int. Use default duration instead."))
// If failed converting string to int, then set default duration to "-600s" with basic monitoring and "-120s" with
// detailed monitoring.
numberPeriod = 300
if detailedMonitoring == ec2.MonitoringStateEnabled {
numberPeriod = 60
}
duration = "-" + strconv.Itoa(numberPeriod*2) + "s"
periodInSeconds = numberPeriod
return duration, periodInSeconds
return "", 0, err
}

unitPeriod := period[len(period)-1:]
// if detailed monitoring is enabled, then period can be larger or equal than 1min.
// if detailed monitoring is disabled, then period can be larger or equal than 5min.
switch unitPeriod {
case "s":
if detailedMonitoring == ec2.MonitoringStateDisabled && numberPeriod < 300 {
numberPeriod = 300
} else if detailedMonitoring == ec2.MonitoringStateEnabled && numberPeriod < 60 {
numberPeriod = 60
}
duration = "-" + strconv.Itoa(numberPeriod*2) + unitPeriod
periodInSeconds = numberPeriod
duration := "-" + strconv.Itoa(numberPeriod*2) + unitPeriod
return duration, numberPeriod, nil
case "m":
if detailedMonitoring == ec2.MonitoringStateDisabled && numberPeriod < 5 {
numberPeriod = 5
} else if detailedMonitoring == ec2.MonitoringStateEnabled && numberPeriod < 1 {
numberPeriod = 1
}
duration = "-" + strconv.Itoa(numberPeriod*2) + unitPeriod
periodInSeconds = numberPeriod * 60
duration := "-" + strconv.Itoa(numberPeriod*2) + unitPeriod
periodInSec := numberPeriod * 60
return duration, periodInSec, nil
default:
numberPeriod = 300
if detailedMonitoring == ec2.MonitoringStateEnabled {
numberPeriod = 60
}
duration = "-" + strconv.Itoa(numberPeriod*2) + "s"
periodInSeconds = numberPeriod
err = errors.New("invalid period in config. Please reset period in config")
duration := "-" + strconv.Itoa(numberPeriod*2) + "s"
return duration, numberPeriod, err
}
return
}

func getMetricDataPerRegion(durationString string, periodInSec int, instanceID string, nextToken *string, svc cloudwatchiface.CloudWatchAPI) (*cloudwatch.GetMetricDataOutput, error) {
Expand Down
Loading

0 comments on commit d7d3de2

Please sign in to comment.