Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feat(aws-cloudwatch) design and implement metric math #1396

Closed
wants to merge 10 commits into from
202 changes: 202 additions & 0 deletions design/metric-math.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Overview
Our L2 `Alarm` construct currently only supports specifying a single metric to monitor. This design discusses options for extending this construct to support metric math, AWS's domain-specific language (DSL) which defines data types, operators and functions for specifying mathematical aggregations of CloudWatch metric data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this file should go under the cloudwatch module directory and not at the global cdk level...


See the [official documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html) for a complete explanation of metric-math.

## Data Types
There are three data types in the language:
* Scalar (`S`) - represents a scalar number such as `2`, `-5`, `50.25`
* Time-Series (`TS`) - represents a single time-series metric such as CPU Utilization (single line on a graph). Our `Metric` L2 construct is an example of a `TS` value.
* Time-Series array (`TS[]`) - a collection of time-series metrics (multiple lines on a graph).
```
// '::' denotes 'is of type'
100 :: S
m1 :: TS
[m1, m2] :: TS[]
```

## Arithmetic Operations and Functions
Arithmetic expressions may then be applied to these data types to compute aggregations. There are: basic arithmetic operations such as `+`, `-`, `*`, `/` and `^`, and functions such as `SUM`, `AVG` and `STDDEV`.
```
m1 * 100 :: TS
AVG(m1) :: S
AVG([m1, m2]) :: TS
AVG([m1, m2]) + 1 :: TS
SUM([m1, m2]) :: TS
METRICS() :: TS[]
```

Both operators and functions are polymorphic with respect to input and output types - the result type depends on the input type, like an overloaded function. This is important, because if we aim to reproduce a representation of this mathematical system in code, then we should also capture and enforce those rules as best we can.

For example, the `AVG` function returns a `S` if passed a single `TS`, while it returns a `TS` if passed a `TS[]`:

```
AVG(m1) :: S // average of all data points
AVG([m1, m2]) :: TS // average of the series' points at each interval
```

A consequence of this is you can not alarm on `AVG(TS)` because its type is `S`.

## Intermediate Computation
Metric definitions and math expressions co-exist in a list of `MetricDataQuery` objects. Only one time-series (single line) result may be compared against the alarm condition. That is to say, the result type of a query must be `TS` to be valid as an Alarm.

This is achieved by setting the `ReturnData` flag. When `true`, the result of that expression - whether it be a `TS` or `TS[]` - materializes as lines on a graph; setting it to `false` means it doesn't. This mechanic then enables two applications:
* Reduction of a complex expression into smaller, simpler (and potentially re-usable) parts;
* Identify the `TS` result you want to monitor in the alarm.

Let's look at an example CloudFormation YAML definition of an alarm using metric-math:

```yaml
MyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Metrics:
- Id: errors
MetricStat:
Metric:
MetricName: Errors
Namespace: Test
Period: 300
Stat: Sum
ReturnData: false
- Id: requests
MetricStat:
Metric:
MetricName: Requests
Namespace: Test
Period: 300
Stat: Sum
ReturnData: false
- Id: error_rate
Expression: errors / requests * 100
ReturnData: true
```

We have requested two specific metrics, `errors` and `requests`. This brings those metrics into the scope of the query with their respective IDs. Think of it like assigning local variables for use later on. Also note how the value of `ReturnData` is `false` for these metrics - as previously mentioned, only one entry in an alarm definition is permitted to return data. By specifying `false`, we are indicating that this value is an 'intermediate' value used only as part of the larger computation. The expression, `error_rate`, then computes the error rate as a percentage using a simple expression, `errors / requests * 100`. `ReturnData` is set to true because we want to alarm on this `TS` value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the overview section. It gave me enough context to be able to reason about the design without needing to go look up all the material!

# Design Options

We will discuss three options:
1. The simplest (or 'lowest') approach which simply enhances the raw CloudFormation API with our `Metric` L2.
2. A specialization of Option 1 designed to reduce boiler-plate and better express the computation of a single `TS` value
3. A type-safe DSL which makes use of functions and classes to model the metric-math system in code

## Option 1 - Simple

The simplest approach is to stay true to the CloudFormation format, but leverage the `Metric` L2 for importing specific metrics into the query scope:

```typescript
const errors = new cloudwatch.Metric({
metricName: 'Errors',
namespace: 'Test'
});
const requests = new cloudwatch.Metric({
metricName: 'Requests',
namespace: 'Test'
});

const alarm = new cloudwatch.Alarm(this, 'woof', {
// etc.
metrics: [{
id: 'errors',
metric: errors,
returnData: false
}, {
id: 'requests',
metric: requests,
returnData: false
}, {
id: 'result',
expression: 'errors / requests * 100',
returnData: true
}]
});
```

### Pros
* Reflects the official CloudFormation API
* Compatible with the existing `Metric` L2 construct
* There is no hidden 'magic' (see Option 3)
* Understanding how and what you are doing is intuitive (if you already understand the API)
* Risk of 'us' introducing bugs is minimized

### Cons
* More verbose - the developer has to write a lot of boiler-plate such as allocating ids, returnData, etc.
* The developer must learn how to express valid metric expressions according to the somewhat complicated API and documentation. The `returnData` nuances were particulary difficult to tease out from the docs, for example.
* Not type-safe - there is no static checking of the expression to ensure it is valid. Failures are caught at deployment time which is very slow, unless we build our own parser ... ?

## Option 2 - Slight Enhancement

Straying from the original CFN format a little, we can specialize the API in a way that better reflects the usual case:
1. Fetch and assign metrics to IDs
2. Use an expression to compute and return a `TS` result

```typescript
const errors = new cloudwatch.Metric({
metricName: 'Errors',
namespace: 'Test'
});
const requests = new cloudwatch.Metric({
metricName: 'Requests',
namespace: 'Test'
});

const alarm = new cloudwatch.Alarm(this, 'woof', {
// etc.
metrics: {
ids: {
errors,
requests
},
expression: 'errors / requests * 100'
}
});
```

### Pros
* Compact
* It takes less code to express the math
* There is no need to manage `returnData` which is not as useful for alarms as it is for the `GetMetricData` API.

### Cons
* Different to the original CFN and SDK API.
* It's opinionated - are there metrics that can not be expressed in this way?
* Also not statically checked - errors in the expression will be found at deployment time.

## Option 3 - Type-safe DSL

Model the data types, operators and functions as classes and methods in code.

```typescript
const errors = new cloudwatch.Metric({
metricName: 'Errors',
namespace: 'Test'
});
const requests = new cloudwatch.Metric({
metricName: 'Requests',
namespace: 'Test'
});

const alarm = new cloudwatch.Alarm(this, 'woof', {
// etc.
metrics: errors.divide(requests).multiply(100)
});
```

### Pros
* Type-safe - the compiler and implementation checks the validity of expressions. For example, it ensures the final result is a `TS` and the arguments to functions or operators are used correctly.
* IDE discoverability, auto-complete, in-line documentation.
* Programmatic variation - like how we build constructs, the expression tree can be assembled incrementally with standard programming techniques (`if-else`, `functions`, `classes`, etc.)

### Cons
* Highly opinionated and different to the original API
* We must maintain backwards compatibility with any future features released by the CloudWatch team
* Risk of us introducing bugs is higher - especially for large complicated expressions
* Expressing paranthesis to control mathematical precedence rules may be ugly or unintuitive
* Could be problematic for JSII, although it does pass its checks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be also interesting to explore something between option 2 and option 3:

const errors = new cloudwatch.Metric(...);
const requests = new cloudwatch.Metric(...);

const alarm = new cloudwatch.Alarm(this, 'woof', {
  expression: `${errors} / ${requests} * 100`
});

Pros:

  • Utilizes the CloudWatch syntax, which makes it easier for people to look up help, docs, examples, etc.
  • Concise

Cons:

  • Not type-safe
  • Magic all around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. Are you thinking that toString() of Metric would encode its information as a string so that Alarm can parse the expression and construct the JSON payload? Similar to how Tokens work now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We can probably even use the existing token stringification mechanism to achieve this. It's designed to be quite general purpose (albeit a bit complex).

### Implementation

This design comes with a prototype which will be briefly explained in this section. See the code and tests for more depth.

TODO: Explain the implementation (look at the code and tests in the meantime - it is functional).
4 changes: 4 additions & 0 deletions packages/@aws-cdk/aws-cloudwatch/lib/Untitled-1
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
brazil-runtime-exec get_mcm_template.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

--odin-creds com.amazon.credentials.isengard.785049305830.user/mcm_signing \
-f TM-21349 \
-d templates/cdk/release-cdk
67 changes: 34 additions & 33 deletions packages/@aws-cdk/aws-cloudwatch/lib/alarm.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import { Construct, Token } from '@aws-cdk/cdk';
import { CfnAlarm } from './cloudwatch.generated';
import { HorizontalAnnotation } from './graph';
import { Dimension, Metric, Statistic, Unit } from './metric';
import { parseStatistic } from './util.statistic';
import { compileExpression, ITimeSeries } from './math';
import { Dimension, Statistic, Unit } from './metric';

/**
* Properties for Alarms
Expand All @@ -14,7 +14,7 @@ export interface AlarmProps {
* Metric objects can be obtained from most resources, or you can construct
* custom Metric objects by instantiating one.
*/
metric: Metric;
metric: ITimeSeries;

/**
* Name of the alarm
Expand Down Expand Up @@ -80,12 +80,12 @@ export enum ComparisonOperator {
LessThanOrEqualToThreshold = 'LessThanOrEqualToThreshold',
}

const OPERATOR_SYMBOLS: {[key: string]: string} = {
GreaterThanOrEqualToThreshold: '>=',
GreaterThanThreshold: '>',
LessThanThreshold: '<',
LessThanOrEqualToThreshold: '>=',
};
// const OPERATOR_SYMBOLS: {[key: string]: string} = {
// GreaterThanOrEqualToThreshold: '>=',
// GreaterThanThreshold: '>',
// LessThanThreshold: '<',
// LessThanOrEqualToThreshold: '>=',
// };

/**
* Specify how missing data points are treated during alarm evaluation
Expand Down Expand Up @@ -129,7 +129,7 @@ export class Alarm extends Construct {
/**
* The metric object this alarm was based on
*/
public readonly metric: Metric;
public readonly metric: ITimeSeries;

private alarmActionArns?: string[];
private insufficientDataActionArns?: string[];
Expand Down Expand Up @@ -164,15 +164,16 @@ export class Alarm extends Construct {
okActions: new Token(() => this.okActionArns),

// Metric
...metricJson(props.metric)
// ...metricJson(props.metric)
});
alarm.addOverride('metrics', compileExpression(props.metric));

this.alarmArn = alarm.alarmArn;
this.alarmName = alarm.alarmName;
this.metric = props.metric;
this.annotation = {
// tslint:disable-next-line:max-line-length
label: `${this.metric.label || this.metric.metricName} ${OPERATOR_SYMBOLS[comparisonOperator]} ${props.threshold} for ${props.evaluationPeriods} datapoints within ${describePeriod(props.evaluationPeriods * props.metric.periodSec)}`,
label: 'TODO', // `${this.metric.label || this.metric.metricName} ${OPERATOR_SYMBOLS[comparisonOperator]} ${props.threshold} for ${props.evaluationPeriods} datapoints within ${describePeriod(props.evaluationPeriods * props.metric.periodSec)}`,
value: props.threshold,
};
}
Expand Down Expand Up @@ -242,12 +243,12 @@ export class Alarm extends Construct {
*
* We know the seconds are always one of a handful of allowed values.
*/
function describePeriod(seconds: number) {
if (seconds === 60) { return '1 minute'; }
if (seconds === 1) { return '1 second'; }
if (seconds > 60) { return (seconds / 60) + ' minutes'; }
return seconds + ' seconds';
}
// function describePeriod(seconds: number) {
// if (seconds === 60) { return '1 minute'; }
// if (seconds === 1) { return '1 second'; }
// if (seconds > 60) { return (seconds / 60) + ' minutes'; }
// return seconds + ' seconds';
// }

/**
* Interface for objects that can be the targets of CloudWatch alarm actions
Expand All @@ -262,21 +263,21 @@ export interface IAlarmAction {
/**
* Return the JSON structure which represents the given metric in an alarm.
*/
function metricJson(metric: Metric): AlarmMetricJson {
const stat = parseStatistic(metric.statistic);

const dims = metric.dimensionsAsList();

return {
dimensions: dims.length > 0 ? dims : undefined,
namespace: metric.namespace,
metricName: metric.metricName,
period: metric.periodSec,
statistic: stat.type === 'simple' ? stat.statistic : undefined,
extendedStatistic: stat.type === 'percentile' ? 'p' + stat.percentile : undefined,
unit: metric.unit
};
}
// function metricJson(metric: Metric): AlarmMetricJson {
// const stat = parseStatistic(metric.statistic);

// const dims = metric.dimensionsAsList();

// return {
// dimensions: dims.length > 0 ? dims : undefined,
// namespace: metric.namespace,
// metricName: metric.metricName,
// period: metric.periodSec,
// statistic: stat.type === 'simple' ? stat.statistic : undefined,
// extendedStatistic: stat.type === 'percentile' ? 'p' + stat.percentile : undefined,
// unit: metric.unit
// };
// }

/**
* Properties used to construct the Metric identifying part of an Alarm
Expand Down
1 change: 1 addition & 0 deletions packages/@aws-cdk/aws-cloudwatch/lib/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ export * from './alarm';
export * from './dashboard';
export * from './graph';
export * from './layout';
export * from './math';
export * from './metric';
export * from './text';
export * from './widget';
Expand Down
Loading