-
Notifications
You must be signed in to change notification settings - Fork 475
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhancement Proposal: API to Forward Logs to CloudWatch
- Loading branch information
1 parent
3b1489c
commit 4098cd5
Showing
1 changed file
with
226 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
--- | ||
title: forward_to_cloudwatch | ||
authors: | ||
- "@alanconway" | ||
reviewers: | ||
- "@jcantrill" | ||
- "@jeremyeder" | ||
approvers: | ||
creation-date: 2020-12-17 | ||
last-updated: 2020-12-17 | ||
status: implementable | ||
see-also: | ||
superseded-by: | ||
--- | ||
|
||
# Forward to Cloudwatch | ||
|
||
## Release Signoff Checklist | ||
|
||
- [X] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Operational readiness criteria is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
[Amazon Cloudwatch][aws-cw] is a hosted monitoring and log storage service. | ||
This proposal extends the `ClusterLogForwarder` API with an output type for Cloudwatch. | ||
|
||
## Motivation | ||
|
||
Amazon Cloudwatch is a popular log store. | ||
We have requests from external and Red Hat-internal customers to support it. | ||
|
||
### Goals | ||
|
||
Enable log forwarding to Cloudwatch. | ||
|
||
### Non-Goals | ||
|
||
Enable Cloudwatch metric collection. | ||
|
||
## Proposal | ||
|
||
### Cloudwatch streams and groups | ||
|
||
[Cloudwatch][concepts] defines *log groups* and *log streams*. To paraphrase the documentation: | ||
|
||
> A log stream is a sequence of log events that share the same source ... For example, an Apache access log on a specific host. | ||
> Log groups define groups of log streams that share the same retention, monitoring, and access control settings ... For example, if you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access_log. | ||
In other words a *log stream* corresponds to the smallest log-producing units. | ||
A *log group* is a collection of related *log streams*. | ||
|
||
We must consider *container logs*, and *node logs*. | ||
- "application" logs are always container logs | ||
- "infrastructure" logs are a mix of container and node logs | ||
- "audit" logs are all node logs. | ||
|
||
For *container-scoped* logs we auto-create a log stream for *each container*, using the container UUID as the log stream name. | ||
- The log stream name is just a unique identifier for the source container. | ||
- Log *entries* include meta-data (container name, namespace etc.) for the user to search/index. | ||
The log stream name is *not* needed for indexing or searching logs. | ||
- Log streams are for a *single source*, we cannot send logs from multiple nodes on the same log stream. | ||
|
||
For *node-scoped* logs we will create "audit"" and "infrastructure" streams for each node: e.g. `node-id/audit`. | ||
|
||
Log streams can be grouped by | ||
|
||
- **category**: there are three log groups "application", "infrastructure" and "audit". | ||
- **namespace name**: group per namespace *name*. | ||
Used when successive namespace objects with the same name are considered "equivalent". | ||
This is a common case, many core k8s tools and APIs work this way. | ||
- **namespace uuid**: group per namespace *object* using the UUID. | ||
Destroying then creating a namespace object with the same name results in a *new log group*. Use when it is important to distinguish logs from successive namespaces instances with the same name. For example, when namespace re-creation is considered a security risk. | ||
|
||
### API fields | ||
|
||
New API fields in the `output.cloudwatch` section: | ||
|
||
- `region`: (string) AWS region name, required to connect. | ||
- `retentionDays`: (number) Number of days to keep logs. Defaults to cloudwatch default retention. | ||
- `groupBy`: (string, default "category") Take group name from logging meta-data. Values: | ||
- `category`: category of log entry - one of "application", "infrastructure", or "audit" | ||
- `namespace_name`: Container's namespace name | ||
- `namespace_uuid`: Container's namespace UUID | ||
|
||
Existing fields: | ||
|
||
- `url`: Not used in production. Sets the `endpoint` parameter in fluentd for use in testing. | ||
- `secret`: AWS credentials, keys `aws_access_key_id` and `aws_secret_access_key`. | ||
|
||
**Note**: The installer UI (Addon or OLM) can get AWS credentials from a `cloudcredential.openshift.io/v1`. | ||
The user only has to provide a `region` to enable cloudwatch forwarding for a cluster. | ||
Details are out of scope for this proposal. | ||
|
||
#### Nice To Have: more options for groupBy | ||
|
||
_NOT REQUIRED for initial implementation, noted here for possible extensions._ | ||
|
||
The `groupBy` value is really just then name of a meta-data key in the message. | ||
There is no implementation cost to allowing arbitrary meta-data to be used as a group name. | ||
However, the choices should be restricted for safety and simplicity. | ||
|
||
A "safe" key must have values that: | ||
|
||
1. are valid cloudwatch group name strings. | ||
2. will not generate an excessive number of groups. | ||
3. are constant for messages in the same *log stream* (streams belong only one group) | ||
|
||
The following keys are safe and would be useful: | ||
|
||
- kubernetes.labels.`<key>`: Use pod label value with key `<key>` | ||
- openshift.labels.`<key>`: Use label added by the openshift log forwarder | ||
|
||
Other keys should be considered case-by case, for example: | ||
|
||
- `message` is definitely *not* safe, fails all safety requirements. | ||
- `ip_addr` is safe (node cardinality), but debatable if it would ever be useful. | ||
- `hostname` is safe (node cardinality), and probably more useful than ip_addr but still debatable. | ||
- etc. | ||
|
||
### User Stories | ||
|
||
#### I want to forward logs to Cloudwatch instead of a local store | ||
|
||
``` | ||
apiVersion: "logging.openshift.io/v1" | ||
kind: "ClusterLogForwarder" | ||
spec: | ||
outputs: | ||
- name: CloudwatchOut | ||
type: cloudwatch | ||
cloudwatch: | ||
region: myregion | ||
secret: | ||
name: mysecret | ||
pipelines: | ||
- inputRefs: [application, infrastructure, audit] | ||
- outputRefs: [CloudwatchOut] | ||
``` | ||
|
||
#### I want to group application logs by namespace | ||
|
||
To group by namespace name: | ||
|
||
``` | ||
apiVersion: "logging.openshift.io/v1" | ||
kind: "ClusterLogForwarder" | ||
spec: | ||
outputs: | ||
- name: CloudwatchOut | ||
type: cloudwatch | ||
cloudwatch: | ||
region: myregion | ||
groupBy: namespace_name | ||
secret: | ||
name: mysecret | ||
pipelines: | ||
- inputRefs: [application, infrastructure, audit] | ||
- outputRefs: [CloudwatchOut] | ||
``` | ||
|
||
To group by namespace UUID, replace `namespace_name` with `namespace_uuid`. | ||
|
||
### Implementation Details | ||
|
||
Using the [fluentd cloudwatch plugin][plugin]. | ||
Most of the CRD fields have an obvious corresponding fluentd parameter. | ||
|
||
Set `auto_create_stream true` to create streams and groups on the fly. | ||
|
||
`log-stream-name` is set to the `containerID` of the log record. | ||
|
||
Log group names are set using `log_group_name` for a static `group` field. | ||
The `log_group_name_key` can select the log category, namespace or label as group name. | ||
|
||
### Open Questions | ||
|
||
Which (if any) of the nice-to-have features should be included, and when? | ||
|
||
Is the cluster logging operator responsible for deleting log streams that are no longer in use, for example when containers are deleted? | ||
|
||
Is EKS authentication a requirement? If so need to define appropriate `secret` keys. | ||
|
||
### Risks and Mitigations | ||
|
||
[Cloudwatch quota][quota] can be exceeded if insufficiently granular streams are configured. | ||
We configure a stream-per-container which is the finest granularity we have for logging. | ||
|
||
- 5 requests per second per log stream. Additional requests are throttled. This quota can't be changed. | ||
- The maximum batch size of a PutLogEvents request is 1MB. | ||
- 800 transactions per second per account per Region, except for the following Regions where the quota is 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland). You can request a quota increase. | ||
|
||
## Design Details | ||
|
||
### Test Plan | ||
|
||
- E2E tests: Need access to AWS logging accounts. | ||
- Functional tests: can we use [fluentd] `in_cloudwatch_logs` as a dummy cloudwatch server? | ||
|
||
### Graduation Criteria | ||
|
||
- Initially release as [beta][maturity-levels] tech-preview to internal customers. | ||
- GA when internal customers are satisfied. | ||
|
||
### Version Skew Strategy | ||
|
||
Not coupled to other components. | ||
|
||
## References | ||
|
||
- [Amazon CloudWatch][aws-cw] | ||
- [Amazon CloudWatch Logs Concepts][concepts] | ||
- [CloudWatch Logs Plugin for Fluentd][plugin] | ||
- [Maturity Levels][maturity-levels] | ||
- [CloudWatch Logs quotas][quota] | ||
|
||
[aws-cw]: https://docs.aws.amazon.com/cloudwatch/index.html "[Amazon CloudWatch]" | ||
[concepts]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogsConcepts.html "[Amazon CloudWatch Logs Concepts]" | ||
[plugin]: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs "[CloudWatch Logs Plugin for Fluentd]" | ||
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions "[Maturity Levels]" | ||
[quota]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html "[CloudWatch Logs quotas - Amazon CloudWatch Logs]" |