Skip to content

Commit

Permalink
Enhancement Proposal: API to Forward Logs to CloudWatch
Browse files Browse the repository at this point in the history
  • Loading branch information
alanconway committed Jan 7, 2021
1 parent 3b1489c commit 4098cd5
Showing 1 changed file with 226 additions and 0 deletions.
226 changes: 226 additions & 0 deletions enhancements/cluster-logging/forward_to_cloudwatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
---
title: forward_to_cloudwatch
authors:
- "@alanconway"
reviewers:
- "@jcantrill"
- "@jeremyeder"
approvers:
creation-date: 2020-12-17
last-updated: 2020-12-17
status: implementable
see-also:
superseded-by:
---

# Forward to Cloudwatch

## Release Signoff Checklist

- [X] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

[Amazon Cloudwatch][aws-cw] is a hosted monitoring and log storage service.
This proposal extends the `ClusterLogForwarder` API with an output type for Cloudwatch.

## Motivation

Amazon Cloudwatch is a popular log store.
We have requests from external and Red Hat-internal customers to support it.

### Goals

Enable log forwarding to Cloudwatch.

### Non-Goals

Enable Cloudwatch metric collection.

## Proposal

### Cloudwatch streams and groups

[Cloudwatch][concepts] defines *log groups* and *log streams*. To paraphrase the documentation:

> A log stream is a sequence of log events that share the same source ... For example, an Apache access log on a specific host.
> Log groups define groups of log streams that share the same retention, monitoring, and access control settings ... For example, if you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access_log.
In other words a *log stream* corresponds to the smallest log-producing units.
A *log group* is a collection of related *log streams*.

We must consider *container logs*, and *node logs*.
- "application" logs are always container logs
- "infrastructure" logs are a mix of container and node logs
- "audit" logs are all node logs.

For *container-scoped* logs we auto-create a log stream for *each container*, using the container UUID as the log stream name.
- The log stream name is just a unique identifier for the source container.
- Log *entries* include meta-data (container name, namespace etc.) for the user to search/index.
The log stream name is *not* needed for indexing or searching logs.
- Log streams are for a *single source*, we cannot send logs from multiple nodes on the same log stream.

For *node-scoped* logs we will create "audit"" and "infrastructure" streams for each node: e.g. `node-id/audit`.

Log streams can be grouped by

- **category**: there are three log groups "application", "infrastructure" and "audit".
- **namespace name**: group per namespace *name*.
Used when successive namespace objects with the same name are considered "equivalent".
This is a common case, many core k8s tools and APIs work this way.
- **namespace uuid**: group per namespace *object* using the UUID.
Destroying then creating a namespace object with the same name results in a *new log group*. Use when it is important to distinguish logs from successive namespaces instances with the same name. For example, when namespace re-creation is considered a security risk.

### API fields

New API fields in the `output.cloudwatch` section:

- `region`: (string) AWS region name, required to connect.
- `retentionDays`: (number) Number of days to keep logs. Defaults to cloudwatch default retention.
- `groupBy`: (string, default "category") Take group name from logging meta-data. Values:
- `category`: category of log entry - one of "application", "infrastructure", or "audit"
- `namespace_name`: Container's namespace name
- `namespace_uuid`: Container's namespace UUID

Existing fields:

- `url`: Not used in production. Sets the `endpoint` parameter in fluentd for use in testing.
- `secret`: AWS credentials, keys `aws_access_key_id` and `aws_secret_access_key`.

**Note**: The installer UI (Addon or OLM) can get AWS credentials from a `cloudcredential.openshift.io/v1`.
The user only has to provide a `region` to enable cloudwatch forwarding for a cluster.
Details are out of scope for this proposal.

#### Nice To Have: more options for groupBy

_NOT REQUIRED for initial implementation, noted here for possible extensions._

The `groupBy` value is really just then name of a meta-data key in the message.
There is no implementation cost to allowing arbitrary meta-data to be used as a group name.
However, the choices should be restricted for safety and simplicity.

A "safe" key must have values that:

1. are valid cloudwatch group name strings.
2. will not generate an excessive number of groups.
3. are constant for messages in the same *log stream* (streams belong only one group)

The following keys are safe and would be useful:

- kubernetes.labels.`<key>`: Use pod label value with key `<key>`
- openshift.labels.`<key>`: Use label added by the openshift log forwarder

Other keys should be considered case-by case, for example:

- `message` is definitely *not* safe, fails all safety requirements.
- `ip_addr` is safe (node cardinality), but debatable if it would ever be useful.
- `hostname` is safe (node cardinality), and probably more useful than ip_addr but still debatable.
- etc.

### User Stories

#### I want to forward logs to Cloudwatch instead of a local store

```
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogForwarder"
spec:
outputs:
- name: CloudwatchOut
type: cloudwatch
cloudwatch:
region: myregion
secret:
name: mysecret
pipelines:
- inputRefs: [application, infrastructure, audit]
- outputRefs: [CloudwatchOut]
```

#### I want to group application logs by namespace

To group by namespace name:

```
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogForwarder"
spec:
outputs:
- name: CloudwatchOut
type: cloudwatch
cloudwatch:
region: myregion
groupBy: namespace_name
secret:
name: mysecret
pipelines:
- inputRefs: [application, infrastructure, audit]
- outputRefs: [CloudwatchOut]
```

To group by namespace UUID, replace `namespace_name` with `namespace_uuid`.

### Implementation Details

Using the [fluentd cloudwatch plugin][plugin].
Most of the CRD fields have an obvious corresponding fluentd parameter.

Set `auto_create_stream true` to create streams and groups on the fly.

`log-stream-name` is set to the `containerID` of the log record.

Log group names are set using `log_group_name` for a static `group` field.
The `log_group_name_key` can select the log category, namespace or label as group name.

### Open Questions

Which (if any) of the nice-to-have features should be included, and when?

Is the cluster logging operator responsible for deleting log streams that are no longer in use, for example when containers are deleted?

Is EKS authentication a requirement? If so need to define appropriate `secret` keys.

### Risks and Mitigations

[Cloudwatch quota][quota] can be exceeded if insufficiently granular streams are configured.
We configure a stream-per-container which is the finest granularity we have for logging.

- 5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
- The maximum batch size of a PutLogEvents request is 1MB.
- 800 transactions per second per account per Region, except for the following Regions where the quota is 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland). You can request a quota increase.

## Design Details

### Test Plan

- E2E tests: Need access to AWS logging accounts.
- Functional tests: can we use [fluentd] `in_cloudwatch_logs` as a dummy cloudwatch server?

### Graduation Criteria

- Initially release as [beta][maturity-levels] tech-preview to internal customers.
- GA when internal customers are satisfied.

### Version Skew Strategy

Not coupled to other components.

## References

- [Amazon CloudWatch][aws-cw]
- [Amazon CloudWatch Logs Concepts][concepts]
- [CloudWatch Logs Plugin for Fluentd][plugin]
- [Maturity Levels][maturity-levels]
- [CloudWatch Logs quotas][quota]

[aws-cw]: https://docs.aws.amazon.com/cloudwatch/index.html "[Amazon CloudWatch]"
[concepts]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html "[Amazon CloudWatch Logs Concepts]"
[plugin]: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs "[CloudWatch Logs Plugin for Fluentd]"
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions "[Maturity Levels]"
[quota]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html "[CloudWatch Logs quotas - Amazon CloudWatch Logs]"

0 comments on commit 4098cd5

Please sign in to comment.