Enhancement Proposal: API to Forward Logs to CloudWatch

openshift · Jan 14, 2021 · da133da · da133da
1 parent 0689995
commit da133da
Showing 1 changed file with 302 additions and 0 deletions.
diff --git a/enhancements/cluster-logging/forward_to_cloudwatch.md b/enhancements/cluster-logging/forward_to_cloudwatch.md
@@ -0,0 +1,302 @@
+---
+title: forward_to_cloudwatch
+authors:
+  - "@alanconway"
+reviewers:
+  - "@jcantrill"
+  - "@jeremyeder"
+approvers:
+creation-date: 2020-12-17
+last-updated: 2020-12-17
+status: implementable
+see-also:
+superseded-by:
+---
+
+# Forward to CloudWatch
+
+## Release Signoff Checklist
+
+- [X] Enhancement is `implementable`
+- [X] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+[Amazon CloudWatch][aws-cw] is a hosted monitoring and log storage service.
+This proposal extends the `ClusterLogForwarder` API with an output type for CloudWatch.
+
+## Motivation
+
+Amazon CloudWatch is a popular log store.
+We have requests from external and  Red Hat-internal customers to support it.
+
+### Goals
+
+Enable log forwarding to CloudWatch.
+
+### Non-Goals
+
+Enable CloudWatch metric collection.
+
+## Proposal
+
+### CloudWatch streams and groups
+
+[CloudWatch][concepts] defines *log groups* and *log streams*. To paraphrase the documentation:
+
+>  A log stream is a sequence of log events that share the same source ... For example, an Apache access log on a specific host.
+
+>  Log groups define groups of log streams that share the same retention, monitoring, and access control settings ... For example, if you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access_log.
+
+In other words a *log stream* corresponds to the smallest distinct source of logs.
+A *log group* is a collection of related *log streams*.
+
+#### Log streams
+
+The collector automatically creates a unique *log stream* for each log file it collects.
+
+- Stream names are globally unique.
+- Constructed without API calls
+- Each stream corresponds to a single tailed log file.
+
+**Note**: The log stream name is *opaque* to the end user for the first release.
+It should *not* be used for indexing, searching or as a reliable source of meta-data.
+The end user can retrieve all meta-data as JSON fields in the log record.
+See "Open Questions" for more detail.
+
+See "Implementation Details" for more.
+
+#### Log groups
+
+*Log groups* are named after some well-known identifier, known to the user.
+Log groups can be named after:
+
+- **Log type**: "application", "infrastructure", "audit".\
+  A single group for each log type.
+- **Namespace name**: Group per namespace *name*.
+  Used when successive namespace objects with the same name are considered "equivalent".
+  This is a common case, many core k8s tools and APIs work this way.
+- **Namespace UUID**: Group per namespace *object*.
+  Destroying then creating a namespace object with the same name results in a *new log group*.
+  Use when it is important to distinguish logs from successive namespaces instances with the same name.
+  For example, when namespace re-creation is considered a security risk.
+
+### API fields
+
+New API fields in the `output.cloudwatch` section:
+
+- `region`: (string) AWS region name, required to connect.
+- `groupBy`: (string, default "logType") Take group name from logging meta-data. Values:
+   - `logType`: one of "application", "infrastructure", or "audit"\
+     Note that *infrastructure* and  *audit* logs are always grouped by `logType`.
+   - `namespaceName`: *application* logs are grouped by namespace name.
+   - `namespaceUUID`: *application* logs are grouped by namespace UUID.
+
+Existing fields:
+
+- `url`: Not used in production. Sets the `endpoint` parameter in fluentd for use in testing.
+- `secret`: AWS credentials, the secret must contain keys `aws_access_key_id` and `aws_secret_access_key`.
+
+**Note**: The installer UI (Addon or OLM) can get AWS credentials from a `cloudcredential.openshift.io/v1`.
+The user only has to provide a `region` to enable CloudWatch forwarding for a cluster.
+Details are out of scope for this proposal.
+
+### User Stories
+
+**Note**: In all cases the CloudWatch *log stream names* are opaque values generated by the collector.
+The CloudWatch *log group names* are different depending on the use case.
+
+#### I want to forward logs to CloudWatch instead of a local store
+
+```
+apiVersion: "logging.openshift.io/v1"
+kind: "ClusterLogForwarder"
+spec:
+  outputs:
+  - name: CloudWatchOut
+	type: cloudwatch
+	cloudwatch:
+	  region: myregion
+	secret:
+	   name: mysecret
+  pipelines:
+  - inputRefs: [application, infrastructure, audit]
+    outputRefs: [CloudWatchOut]
+```
+
+CloudWatch group names are: "application", "infrastructure", "audit"
+
+#### I want to group application logs by namespace
+
+To group by namespace name:
+
+```
+apiVersion: "logging.openshift.io/v1"
+kind: "ClusterLogForwarder"
+spec:
+  outputs:
+  - name: CloudWatchOut
+	type: cloudwatch
+	cloudwatch:
+	  region: myregion
+	  groupBy: namespaceName
+	secret:
+	   name: mysecret
+  pipelines:
+  - inputRefs: [application, infrastructure, audit]
+    outputRefs: [CloudWatchOut]
+```
+
+CloudWatch group names for *application* logs are the namespaces from which logs are collected.
+Group names for *infrastructure* and *audit* logs are still "infrastructure" and "audit".
+
+To group by namespace UUID instead, replace `namespaceName` with `namespaceUUID`.
+
+### Implementation Details
+
+Use the [fluentd CloudWatch plugin][plugin] to connect to CloudWatch.
+Plugin configuration settings:
+
+- `auto_create_stream`: true to create streams and groups on the fly.
+- `log-stream-name`: set to `<hotsname>.<routing-key>` for all log types. Guaranteed to be globally unique.
+- `log_group_name`: Always set to "infrastructure" or "audit" for logs of those types.\
+   Set to "application" for application logs if `groupBy=logType`
+- `log_group_name_key` set to meta-data key:
+  - `namespace_name` if `groupBy=namespaceName`.
+  - `namespace_uuid` if `groupBy=namespaceUUID`.
+- `region`: Set from `cloudwatch.region`
+- `aws_access_key_id`, `aws_secret_access_key`: Set from `secret`
+- `endpoint`: set from optional `url`, for testing and debugging.
+
+### Nice To Have: more options for log groups
+
+_NOT REQUIRED for initial implementation, noted here for possible extensions._
+
+The `groupBy` value translates to a meta-data key in the message.
+There is no implementation cost to allowing arbitrary meta-data to be used as a group name.
+However, the choices should be restricted for safety and simplicity.
+
+A "safe" key must have values that:
+
+1. are valid CloudWatch group name strings.
+2. will not generate an excessive number of groups.
+3. are constant for messages in the same *log stream* (streams belong only one group)
+
+The following keys are safe and would be useful:
+
+- kubernetes.labels.`<key>`: Use pod label value with key `<key>`
+- openshift.labels.`<key>`: Use label added by the openshift log forwarder
+
+Other keys should be considered case-by case, for example:
+
+- `message` is definitely *not* safe, fails all safety requirements.
+- `ip_addr` is safe (node cardinality), but debatable if it would ever be useful.
+- `hostname` is safe (node cardinality), and probably more useful than ip_addr but still debatable.
+- etc.
+
+Custom log groups can be created using `openshift.labels`.
+To support custom logs we add:
+
+- `groupByOptional`: (list of string) List of optional metadata keys to use for `groupBy`.
+  The first key that is present and non-empty is instead of `groupBy`.
+  If none found, use the value of `groupBy`.
+
+For example, I want to group most logs by log type, except for logs from
+namespaces [magic1, magic2] which should be in log group "magic".
+
+```
+apiVersion: "logging.openshift.io/v1"
+kind: "ClusterLogForwarder"
+spec:
+  intputs:
+  - name: MagicApp
+    application:
+	  namespaces: [ magic1, magic2 ]
+  outputs:
+  - name: CloudWatchOut
+	type: cloudwatch
+	cloudwatch:
+	  region: myregion
+	  groupBy: logType
+	  groupByOptional: [ openshift.labels.logGroup ]
+	secret:
+	   name: mysecret
+  pipelines:
+  - inputRefs: [application, infrastructure, audit]
+    outputRefs: [CloudWatchOut]
+  - inputRefs: [MagicApp]
+    outputRefs: [CloudWatchOut]
+	labels: { logGroup: magic }
+```
+
+### Open Questions
+
+#### Log stream names and static meta-data
+
+Initial log stream names will use our current fluent tags for uniqueness,
+which includes some static meta-data.
+
+We *may* want to advertise this stream name format as a way to access static meta-data,
+and reduce the repetition of static data in log records.
+It is too early to decide now because:
+
+- We need to clean up the format before making it public
+- We need to solve the static meta-data problem consistently for other output types as well.
+- There may be other solutions e.g. using [cloudwatch group tags][groups-and-streams]
+
+For now the name will be documented as *opaque* to the user, so we can make changes in future without breaking user assumptions.
+
+#### EKS authentication
+
+Is this a requirement? If so need to define appropriate `secret` keys.
+
+#### Additional API fields
+
+- `retentionDays`: (number) Number of days to keep logs.
+- [cloudwatch tags][groups-and-streams]
+
+### Risks and Mitigations
+
+[CloudWatch quota][quota] can be exceeded if insufficiently granular streams are configured.
+We configure a stream-per-container which is the finest granularity we have for logging.
+
+- 5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
+- The maximum batch size of a PutLogEvents request is 1MB.
+- 800 transactions per second per account per Region, except for the following Regions where the quota is 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland). You can request a quota increase.
+
+## Design Details
+
+### Test Plan
+
+- E2E tests: Need access to AWS logging accounts.
+- Functional tests: can we use [fluentd] `in_cloudwatch_logs` as a dummy cloudwatch server?
+
+### Graduation Criteria
+
+- Initially release as [beta][maturity-levels] tech-preview to internal customers.
+- GA when internal customers are satisfied.
+
+### Version Skew Strategy
+
+Not coupled to other components.
+
+## References
+
+- [Amazon CloudWatch][aws-cw]
+- [Amazon CloudWatch Logs Concepts][concepts]
+- [CloudWatch Logs Plugin for Fluentd][plugin]
+- [Maturity Levels][maturity-levels]
+- [CloudWatch Logs quotas][quota]
+- [CloudWatch Log Groups and Streams][groups-and-streams]
+
+[aws-cw]: https://docs.aws.amazon.com/cloudwatch/index.html "[Amazon CloudWatch]"
+[concepts]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html "[Amazon CloudWatch Logs Concepts]"
+[plugin]: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs "[CloudWatch Logs Plugin for Fluentd]"
+[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions "[Maturity Levels]"
+[quota]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html "[CloudWatch Logs quotas - Amazon CloudWatch Logs]"
+[groups-and-streams]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html "Log streams and groups"
+[put-logs]: https://docs.aws.amazon.com/cli/latest/reference/logs/put-log-events.html "Put log events API"