Enhancement Proposal: API to Forward Logs to CloudWatch

openshift · Jan 7, 2021 · 4098cd5 · 4098cd5
1 parent 3b1489c
commit 4098cd5
Showing 1 changed file with 226 additions and 0 deletions.
diff --git a/enhancements/cluster-logging/forward_to_cloudwatch.md b/enhancements/cluster-logging/forward_to_cloudwatch.md
@@ -0,0 +1,226 @@
+---
+title: forward_to_cloudwatch
+authors:
+  - "@alanconway"
+reviewers:
+  - "@jcantrill"
+  - "@jeremyeder"
+approvers:
+creation-date: 2020-12-17
+last-updated: 2020-12-17
+status: implementable
+see-also:
+superseded-by:
+---
+
+# Forward to Cloudwatch
+
+## Release Signoff Checklist
+
+- [X] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+[Amazon Cloudwatch][aws-cw] is a hosted monitoring and log storage service.
+This proposal extends the `ClusterLogForwarder` API with an output type for Cloudwatch.
+
+## Motivation
+
+Amazon Cloudwatch is a popular log store.
+We have requests from external and  Red Hat-internal customers to support it.
+
+### Goals
+
+Enable log forwarding to Cloudwatch.
+
+### Non-Goals
+
+Enable Cloudwatch metric collection.
+
+## Proposal
+
+### Cloudwatch streams and groups
+
+[Cloudwatch][concepts] defines *log groups* and *log streams*. To paraphrase the documentation:
+
+>  A log stream is a sequence of log events that share the same source ... For example, an Apache access log on a specific host.
+
+>  Log groups define groups of log streams that share the same retention, monitoring, and access control settings ... For example, if you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access_log.
+
+In other words a *log stream* corresponds to the smallest log-producing units.
+A *log group* is a collection of related *log streams*.
+
+We must consider *container logs*, and *node logs*.
+- "application" logs are always container logs
+- "infrastructure" logs are a mix of container and node logs
+- "audit" logs are all node logs.
+
+For *container-scoped* logs we auto-create a log stream for *each container*, using the container UUID as the log stream name.
+- The log stream name is just a unique identifier for the source container.
+- Log *entries* include meta-data (container name, namespace etc.) for the user to search/index.
+  The log stream name is *not* needed for indexing or searching logs.
+- Log streams are for a *single source*, we cannot send logs from multiple nodes on the same log stream.
+
+For *node-scoped* logs we will create "audit"" and "infrastructure" streams for each node: e.g. `node-id/audit`.
+
+Log streams can be grouped by
+
+- **category**: there are three log groups "application", "infrastructure" and "audit".
+- **namespace name**: group per namespace *name*.
+  Used when successive namespace objects with the same name are considered "equivalent".
+  This is a common case, many core k8s tools and APIs work this way.
+- **namespace uuid**: group per namespace *object* using the UUID.
+  Destroying then creating a namespace object with the same name results in a *new log group*.  Use when it is important to distinguish logs from successive namespaces instances with the same name. For example, when namespace re-creation is considered a security risk.
+
+### API fields
+
+New API fields in the `output.cloudwatch` section:
+
+- `region`: (string) AWS region name, required to connect.
+- `retentionDays`: (number) Number of days to keep logs. Defaults to cloudwatch default retention.
+- `groupBy`: (string, default "category") Take group name from logging meta-data. Values:
+   - `category`: category of log entry - one of "application", "infrastructure", or "audit"
+   - `namespace_name`: Container's namespace name
+   - `namespace_uuid`: Container's namespace UUID
+
+Existing fields:
+
+- `url`: Not used in production. Sets the `endpoint` parameter in fluentd for use in testing.
+- `secret`: AWS credentials, keys `aws_access_key_id` and `aws_secret_access_key`.
+
+**Note**: The installer UI (Addon or OLM) can get AWS credentials from a `cloudcredential.openshift.io/v1`.
+The user only has to provide a `region` to enable cloudwatch forwarding for a cluster.
+Details are out of scope for this proposal.
+
+#### Nice To Have: more options for groupBy
+
+_NOT REQUIRED for initial implementation, noted here for possible extensions._
+
+The `groupBy` value is really just then name of a meta-data key in the message.
+There is no implementation cost to allowing arbitrary meta-data to be used as a group name.
+However, the choices should be restricted for safety and simplicity.
+
+A "safe" key must have values that:
+
+1. are valid cloudwatch group name strings.
+2. will not generate an excessive number of groups.
+3. are constant for messages in the same *log stream* (streams belong only one group)
+
+The following keys are safe and would be useful:
+
+- kubernetes.labels.`<key>`: Use pod label value with key `<key>`
+- openshift.labels.`<key>`: Use label added by the openshift log forwarder
+
+Other keys should be considered case-by case, for example:
+
+- `message` is definitely *not* safe, fails all safety requirements.
+- `ip_addr` is safe (node cardinality), but debatable if it would ever be useful.
+- `hostname` is safe (node cardinality), and probably more useful than ip_addr but still debatable.
+- etc.
+
+### User Stories
+
+#### I want to forward logs to Cloudwatch instead of a local store
+
+```
+apiVersion: "logging.openshift.io/v1"
+kind: "ClusterLogForwarder"
+spec:
+  outputs:
+   - name: CloudwatchOut
+     type: cloudwatch
+     cloudwatch:
+       region: myregion
+     secret:
+        name: mysecret
+  pipelines:
+  - inputRefs: [application, infrastructure, audit]
+  - outputRefs: [CloudwatchOut]
+```
+
+#### I want to group application logs by namespace
+
+To group by namespace name:
+
+```
+apiVersion: "logging.openshift.io/v1"
+kind: "ClusterLogForwarder"
+spec:
+  outputs:
+   - name: CloudwatchOut
+     type: cloudwatch
+     cloudwatch:
+       region: myregion
+	   groupBy: namespace_name
+     secret:
+        name: mysecret
+  pipelines:
+  - inputRefs: [application, infrastructure, audit]
+  - outputRefs: [CloudwatchOut]
+```
+
+To group by namespace UUID, replace `namespace_name` with `namespace_uuid`.
+
+### Implementation Details
+
+Using the [fluentd cloudwatch plugin][plugin].
+Most of the CRD fields have an obvious corresponding fluentd parameter.
+
+Set `auto_create_stream true` to create streams and groups on the fly.
+
+`log-stream-name` is set to the `containerID` of the log record.
+
+Log group names are set using `log_group_name` for a static `group` field.
+The `log_group_name_key` can select the log category, namespace or label as group name.
+
+### Open Questions
+
+Which (if any) of the nice-to-have features should be included, and when?
+
+Is the cluster logging operator responsible for deleting log streams that are no longer in use, for example when containers are deleted?
+
+Is EKS authentication a requirement? If so need to define appropriate `secret` keys.
+
+### Risks and Mitigations
+
+[Cloudwatch quota][quota] can be exceeded if insufficiently granular streams are configured.
+We configure a stream-per-container which is the finest granularity we have for logging.
+
+- 5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
+- The maximum batch size of a PutLogEvents request is 1MB.
+- 800 transactions per second per account per Region, except for the following Regions where the quota is 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland). You can request a quota increase.
+
+## Design Details
+
+### Test Plan
+
+- E2E tests: Need access to AWS logging accounts.
+- Functional tests: can we use [fluentd] `in_cloudwatch_logs` as a dummy cloudwatch server?
+
+### Graduation Criteria
+
+- Initially release as [beta][maturity-levels] tech-preview to internal customers.
+- GA when internal customers are satisfied.
+
+### Version Skew Strategy
+
+Not coupled to other components.
+
+## References
+
+- [Amazon CloudWatch][aws-cw]
+- [Amazon CloudWatch Logs Concepts][concepts]
+- [CloudWatch Logs Plugin for Fluentd][plugin]
+- [Maturity Levels][maturity-levels]
+- [CloudWatch Logs quotas][quota]
+
+[aws-cw]: https://docs.aws.amazon.com/cloudwatch/index.html "[Amazon CloudWatch]"
+[concepts]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html "[Amazon CloudWatch Logs Concepts]"
+[plugin]: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs "[CloudWatch Logs Plugin for Fluentd]"
+[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions "[Maturity Levels]"
+[quota]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html "[CloudWatch Logs quotas - Amazon CloudWatch Logs]"