diff --git a/enhancements/cluster-logging/forward_to_cloudwatch.md b/enhancements/cluster-logging/forward_to_cloudwatch.md new file mode 100644 index 00000000000..4cb6eb23305 --- /dev/null +++ b/enhancements/cluster-logging/forward_to_cloudwatch.md @@ -0,0 +1,226 @@ +--- +title: forward_to_cloudwatch +authors: + - "@alanconway" +reviewers: + - "@jcantrill" + - "@jeremyeder" +approvers: +creation-date: 2020-12-17 +last-updated: 2020-12-17 +status: implementable +see-also: +superseded-by: +--- + +# Forward to Cloudwatch + +## Release Signoff Checklist + +- [X] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +[Amazon Cloudwatch][aws-cw] is a hosted monitoring and log storage service. +This proposal extends the `ClusterLogForwarder` API with an output type for Cloudwatch. + +## Motivation + +Amazon Cloudwatch is a popular log store. +We have requests from external and Red Hat-internal customers to support it. + +### Goals + +Enable log forwarding to Cloudwatch. + +### Non-Goals + +Enable Cloudwatch metric collection. + +## Proposal + +### Cloudwatch streams and groups + +[Cloudwatch][concepts] defines *log groups* and *log streams*. To paraphrase the documentation: + +> A log stream is a sequence of log events that share the same source ... For example, an Apache access log on a specific host. + +> Log groups define groups of log streams that share the same retention, monitoring, and access control settings ... For example, if you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access_log. + +In other words a *log stream* corresponds to the smallest log-producing units. +A *log group* is a collection of related *log streams*. + +We must consider *container logs*, and *node logs*. +- "application" logs are always container logs +- "infrastructure" logs are a mix of container and node logs +- "audit" logs are all node logs. + +For *container-scoped* logs we auto-create a log stream for *each container*, using the container UUID as the log stream name. +- The log stream name is just a unique identifier for the source container. +- Log *entries* include meta-data (container name, namespace etc.) for the user to search/index. + The log stream name is *not* needed for indexing or searching logs. +- Log streams are for a *single source*, we cannot send logs from multiple nodes on the same log stream. + +For *node-scoped* logs we will create "audit"" and "infrastructure" streams for each node: e.g. `node-id/audit`. + +Log streams can be grouped by + +- **category**: there are three log groups "application", "infrastructure" and "audit". +- **namespace name**: group per namespace *name*. + Used when successive namespace objects with the same name are considered "equivalent". + This is a common case, many core k8s tools and APIs work this way. +- **namespace uuid**: group per namespace *object* using the UUID. + Destroying then creating a namespace object with the same name results in a *new log group*. Use when it is important to distinguish logs from successive namespaces instances with the same name. For example, when namespace re-creation is considered a security risk. + +### API fields + +New API fields in the `output.cloudwatch` section: + +- `region`: (string) AWS region name, required to connect. +- `retentionDays`: (number) Number of days to keep logs. Defaults to cloudwatch default retention. +- `groupBy`: (string, default "category") Take group name from logging meta-data. Values: + - `category`: category of log entry - one of "application", "infrastructure", or "audit" + - `namespace_name`: Container's namespace name + - `namespace_uuid`: Container's namespace UUID + +Existing fields: + +- `url`: Not used in production. Sets the `endpoint` parameter in fluentd for use in testing. +- `secret`: AWS credentials, keys `aws_access_key_id` and `aws_secret_access_key`. + +**Note**: The installer UI (Addon or OLM) can get AWS credentials from a `cloudcredential.openshift.io/v1`. +The user only has to provide a `region` to enable cloudwatch forwarding for a cluster. +Details are out of scope for this proposal. + +#### Nice To Have: more options for groupBy + +_NOT REQUIRED for initial implementation, noted here for possible extensions._ + +The `groupBy` value is really just then name of a meta-data key in the message. +There is no implementation cost to allowing arbitrary meta-data to be used as a group name. +However, the choices should be restricted for safety and simplicity. + +A "safe" key must have values that: + +1. are valid cloudwatch group name strings. +2. will not generate an excessive number of groups. +3. are constant for messages in the same *log stream* (streams belong only one group) + +The following keys are safe and would be useful: + +- kubernetes.labels.``: Use pod label value with key `` +- openshift.labels.``: Use label added by the openshift log forwarder + +Other keys should be considered case-by case, for example: + +- `message` is definitely *not* safe, fails all safety requirements. +- `ip_addr` is safe (node cardinality), but debatable if it would ever be useful. +- `hostname` is safe (node cardinality), and probably more useful than ip_addr but still debatable. +- etc. + +### User Stories + +#### I want to forward logs to Cloudwatch instead of a local store + +``` +apiVersion: "logging.openshift.io/v1" +kind: "ClusterLogForwarder" +spec: + outputs: + - name: CloudwatchOut + type: cloudwatch + cloudwatch: + region: myregion + secret: + name: mysecret + pipelines: + - inputRefs: [application, infrastructure, audit] + - outputRefs: [CloudwatchOut] +``` + +#### I want to group application logs by namespace + +To group by namespace name: + +``` +apiVersion: "logging.openshift.io/v1" +kind: "ClusterLogForwarder" +spec: + outputs: + - name: CloudwatchOut + type: cloudwatch + cloudwatch: + region: myregion + groupBy: namespace_name + secret: + name: mysecret + pipelines: + - inputRefs: [application, infrastructure, audit] + - outputRefs: [CloudwatchOut] +``` + +To group by namespace UUID, replace `namespace_name` with `namespace_uuid`. + +### Implementation Details + +Using the [fluentd cloudwatch plugin][plugin]. +Most of the CRD fields have an obvious corresponding fluentd parameter. + +Set `auto_create_stream true` to create streams and groups on the fly. + +`log-stream-name` is set to the `containerID` of the log record. + +Log group names are set using `log_group_name` for a static `group` field. +The `log_group_name_key` can select the log category, namespace or label as group name. + +### Open Questions + +Which (if any) of the nice-to-have features should be included, and when? + +Is the cluster logging operator responsible for deleting log streams that are no longer in use, for example when containers are deleted? + +Is EKS authentication a requirement? If so need to define appropriate `secret` keys. + +### Risks and Mitigations + +[Cloudwatch quota][quota] can be exceeded if insufficiently granular streams are configured. +We configure a stream-per-container which is the finest granularity we have for logging. + +- 5 requests per second per log stream. Additional requests are throttled. This quota can't be changed. +- The maximum batch size of a PutLogEvents request is 1MB. +- 800 transactions per second per account per Region, except for the following Regions where the quota is 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland). You can request a quota increase. + +## Design Details + +### Test Plan + +- E2E tests: Need access to AWS logging accounts. +- Functional tests: can we use [fluentd] `in_cloudwatch_logs` as a dummy cloudwatch server? + +### Graduation Criteria + +- Initially release as [beta][maturity-levels] tech-preview to internal customers. +- GA when internal customers are satisfied. + +### Version Skew Strategy + +Not coupled to other components. + +## References + +- [Amazon CloudWatch][aws-cw] +- [Amazon CloudWatch Logs Concepts][concepts] +- [CloudWatch Logs Plugin for Fluentd][plugin] +- [Maturity Levels][maturity-levels] +- [CloudWatch Logs quotas][quota] + +[aws-cw]: https://docs.aws.amazon.com/cloudwatch/index.html "[Amazon CloudWatch]" +[concepts]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html "[Amazon CloudWatch Logs Concepts]" +[plugin]: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs "[CloudWatch Logs Plugin for Fluentd]" +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions "[Maturity Levels]" +[quota]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html "[CloudWatch Logs quotas - Amazon CloudWatch Logs]"