Skip to content

Commit

Permalink
Enhancement Proposal: API to Forward Logs to CloudWatch
Browse files Browse the repository at this point in the history
  • Loading branch information
alanconway committed Jan 14, 2021
1 parent 0689995 commit da133da
Showing 1 changed file with 302 additions and 0 deletions.
302 changes: 302 additions & 0 deletions enhancements/cluster-logging/forward_to_cloudwatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
title: forward_to_cloudwatch
authors:
- "@alanconway"
reviewers:
- "@jcantrill"
- "@jeremyeder"
approvers:
creation-date: 2020-12-17
last-updated: 2020-12-17
status: implementable
see-also:
superseded-by:
---

# Forward to CloudWatch

## Release Signoff Checklist

- [X] Enhancement is `implementable`
- [X] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

[Amazon CloudWatch][aws-cw] is a hosted monitoring and log storage service.
This proposal extends the `ClusterLogForwarder` API with an output type for CloudWatch.

## Motivation

Amazon CloudWatch is a popular log store.
We have requests from external and Red Hat-internal customers to support it.

### Goals

Enable log forwarding to CloudWatch.

### Non-Goals

Enable CloudWatch metric collection.

## Proposal

### CloudWatch streams and groups

[CloudWatch][concepts] defines *log groups* and *log streams*. To paraphrase the documentation:

> A log stream is a sequence of log events that share the same source ... For example, an Apache access log on a specific host.
> Log groups define groups of log streams that share the same retention, monitoring, and access control settings ... For example, if you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access_log.
In other words a *log stream* corresponds to the smallest distinct source of logs.
A *log group* is a collection of related *log streams*.

#### Log streams

The collector automatically creates a unique *log stream* for each log file it collects.

- Stream names are globally unique.
- Constructed without API calls
- Each stream corresponds to a single tailed log file.

**Note**: The log stream name is *opaque* to the end user for the first release.
It should *not* be used for indexing, searching or as a reliable source of meta-data.
The end user can retrieve all meta-data as JSON fields in the log record.
See "Open Questions" for more detail.

See "Implementation Details" for more.

#### Log groups

*Log groups* are named after some well-known identifier, known to the user.
Log groups can be named after:

- **Log type**: "application", "infrastructure", "audit".\
A single group for each log type.
- **Namespace name**: Group per namespace *name*.
Used when successive namespace objects with the same name are considered "equivalent".
This is a common case, many core k8s tools and APIs work this way.
- **Namespace UUID**: Group per namespace *object*.
Destroying then creating a namespace object with the same name results in a *new log group*.
Use when it is important to distinguish logs from successive namespaces instances with the same name.
For example, when namespace re-creation is considered a security risk.

### API fields

New API fields in the `output.cloudwatch` section:

- `region`: (string) AWS region name, required to connect.
- `groupBy`: (string, default "logType") Take group name from logging meta-data. Values:
- `logType`: one of "application", "infrastructure", or "audit"\
Note that *infrastructure* and *audit* logs are always grouped by `logType`.
- `namespaceName`: *application* logs are grouped by namespace name.
- `namespaceUUID`: *application* logs are grouped by namespace UUID.

Existing fields:

- `url`: Not used in production. Sets the `endpoint` parameter in fluentd for use in testing.
- `secret`: AWS credentials, the secret must contain keys `aws_access_key_id` and `aws_secret_access_key`.

**Note**: The installer UI (Addon or OLM) can get AWS credentials from a `cloudcredential.openshift.io/v1`.
The user only has to provide a `region` to enable CloudWatch forwarding for a cluster.
Details are out of scope for this proposal.

### User Stories

**Note**: In all cases the CloudWatch *log stream names* are opaque values generated by the collector.
The CloudWatch *log group names* are different depending on the use case.

#### I want to forward logs to CloudWatch instead of a local store

```
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogForwarder"
spec:
outputs:
- name: CloudWatchOut
type: cloudwatch
cloudwatch:
region: myregion
secret:
name: mysecret
pipelines:
- inputRefs: [application, infrastructure, audit]
outputRefs: [CloudWatchOut]
```

CloudWatch group names are: "application", "infrastructure", "audit"

#### I want to group application logs by namespace

To group by namespace name:

```
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogForwarder"
spec:
outputs:
- name: CloudWatchOut
type: cloudwatch
cloudwatch:
region: myregion
groupBy: namespaceName
secret:
name: mysecret
pipelines:
- inputRefs: [application, infrastructure, audit]
outputRefs: [CloudWatchOut]
```

CloudWatch group names for *application* logs are the namespaces from which logs are collected.
Group names for *infrastructure* and *audit* logs are still "infrastructure" and "audit".

To group by namespace UUID instead, replace `namespaceName` with `namespaceUUID`.

### Implementation Details

Use the [fluentd CloudWatch plugin][plugin] to connect to CloudWatch.
Plugin configuration settings:

- `auto_create_stream`: true to create streams and groups on the fly.
- `log-stream-name`: set to `<hotsname>.<routing-key>` for all log types. Guaranteed to be globally unique.
- `log_group_name`: Always set to "infrastructure" or "audit" for logs of those types.\
Set to "application" for application logs if `groupBy=logType`
- `log_group_name_key` set to meta-data key:
- `namespace_name` if `groupBy=namespaceName`.
- `namespace_uuid` if `groupBy=namespaceUUID`.
- `region`: Set from `cloudwatch.region`
- `aws_access_key_id`, `aws_secret_access_key`: Set from `secret`
- `endpoint`: set from optional `url`, for testing and debugging.

### Nice To Have: more options for log groups

_NOT REQUIRED for initial implementation, noted here for possible extensions._

The `groupBy` value translates to a meta-data key in the message.
There is no implementation cost to allowing arbitrary meta-data to be used as a group name.
However, the choices should be restricted for safety and simplicity.

A "safe" key must have values that:

1. are valid CloudWatch group name strings.
2. will not generate an excessive number of groups.
3. are constant for messages in the same *log stream* (streams belong only one group)

The following keys are safe and would be useful:

- kubernetes.labels.`<key>`: Use pod label value with key `<key>`
- openshift.labels.`<key>`: Use label added by the openshift log forwarder

Other keys should be considered case-by case, for example:

- `message` is definitely *not* safe, fails all safety requirements.
- `ip_addr` is safe (node cardinality), but debatable if it would ever be useful.
- `hostname` is safe (node cardinality), and probably more useful than ip_addr but still debatable.
- etc.

Custom log groups can be created using `openshift.labels`.
To support custom logs we add:

- `groupByOptional`: (list of string) List of optional metadata keys to use for `groupBy`.
The first key that is present and non-empty is instead of `groupBy`.
If none found, use the value of `groupBy`.

For example, I want to group most logs by log type, except for logs from
namespaces [magic1, magic2] which should be in log group "magic".

```
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogForwarder"
spec:
intputs:
- name: MagicApp
application:
namespaces: [ magic1, magic2 ]
outputs:
- name: CloudWatchOut
type: cloudwatch
cloudwatch:
region: myregion
groupBy: logType
groupByOptional: [ openshift.labels.logGroup ]
secret:
name: mysecret
pipelines:
- inputRefs: [application, infrastructure, audit]
outputRefs: [CloudWatchOut]
- inputRefs: [MagicApp]
outputRefs: [CloudWatchOut]
labels: { logGroup: magic }
```

### Open Questions

#### Log stream names and static meta-data

Initial log stream names will use our current fluent tags for uniqueness,
which includes some static meta-data.

We *may* want to advertise this stream name format as a way to access static meta-data,
and reduce the repetition of static data in log records.
It is too early to decide now because:

- We need to clean up the format before making it public
- We need to solve the static meta-data problem consistently for other output types as well.
- There may be other solutions e.g. using [cloudwatch group tags][groups-and-streams]

For now the name will be documented as *opaque* to the user, so we can make changes in future without breaking user assumptions.

#### EKS authentication

Is this a requirement? If so need to define appropriate `secret` keys.

#### Additional API fields

- `retentionDays`: (number) Number of days to keep logs.
- [cloudwatch tags][groups-and-streams]

### Risks and Mitigations

[CloudWatch quota][quota] can be exceeded if insufficiently granular streams are configured.
We configure a stream-per-container which is the finest granularity we have for logging.

- 5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
- The maximum batch size of a PutLogEvents request is 1MB.
- 800 transactions per second per account per Region, except for the following Regions where the quota is 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland). You can request a quota increase.

## Design Details

### Test Plan

- E2E tests: Need access to AWS logging accounts.
- Functional tests: can we use [fluentd] `in_cloudwatch_logs` as a dummy cloudwatch server?

### Graduation Criteria

- Initially release as [beta][maturity-levels] tech-preview to internal customers.
- GA when internal customers are satisfied.

### Version Skew Strategy

Not coupled to other components.

## References

- [Amazon CloudWatch][aws-cw]
- [Amazon CloudWatch Logs Concepts][concepts]
- [CloudWatch Logs Plugin for Fluentd][plugin]
- [Maturity Levels][maturity-levels]
- [CloudWatch Logs quotas][quota]
- [CloudWatch Log Groups and Streams][groups-and-streams]

[aws-cw]: https://docs.aws.amazon.com/cloudwatch/index.html "[Amazon CloudWatch]"
[concepts]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html "[Amazon CloudWatch Logs Concepts]"
[plugin]: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs "[CloudWatch Logs Plugin for Fluentd]"
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions "[Maturity Levels]"
[quota]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html "[CloudWatch Logs quotas - Amazon CloudWatch Logs]"
[groups-and-streams]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html "Log streams and groups"
[put-logs]: https://docs.aws.amazon.com/cli/latest/reference/logs/put-log-events.html "Put log events API"

0 comments on commit da133da

Please sign in to comment.