From 3f61bf068bff3f2baf6e6a9c6e0a463eff0f980e Mon Sep 17 00:00:00 2001 From: Alan Conway Date: Wed, 12 May 2021 14:10:39 -0400 Subject: [PATCH] Simplify and clarify JSON forwarding proposal, better examples. --- .../forwarding-json-structured-logs.md | 271 ++++++------------ 1 file changed, 89 insertions(+), 182 deletions(-) diff --git a/enhancements/cluster-logging/forwarding-json-structured-logs.md b/enhancements/cluster-logging/forwarding-json-structured-logs.md index 2ef3fe7f8e9..5028a0678a2 100644 --- a/enhancements/cluster-logging/forwarding-json-structured-logs.md +++ b/enhancements/cluster-logging/forwarding-json-structured-logs.md @@ -28,74 +28,48 @@ see-also: - [X] Enhancement is `implementable` - [X] Design details are appropriately documented from clear requirements -- [ ] Test plan is defined -- [ ] Graduation criteria for dev preview, tech preview, GA +- [X] Test plan is defined +- [X] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) ## Summary -This enhancement will allow structured JSON log entries to be forwarded as JSON objects in JSON output records. +This enhancement will allow structured JSON log entries to be forwarded as JSON objects in JSON output records: -The current logging [data model][data_model] stores the log entry as a JSON *string*, not a JSON *object*. -Consumers can't access the log entry fields without a second JSON parse of this string. +* Add a log field called `structured` to hold JSON log entries as objects. +* Extend `ClusterLogForwarder` input API to enable JSON parsing. +* Extend `ClusterLogForwarder` elasticsearch output API to select an index for structured logs. +* Replaces the [defunct MERGE_JSON_LOG][defunct_merge] feature. -The current implementation also 'flattens' labels to work around Elasticsearch limitations. +For example, given this structured log entry: -To illustrate, given a log entry `{"name":"fred","home":"bedrock"}` from a container with the label `app.kubernetes.io/name="flintstones"`. -The current output record looks like: - -```json -{ - "message":"{\"name\":\"fred\",\"home\":\"bedrock\"}", - "kubernetes":{"flat_labels":["app_kubernetes_io/name=flintstones", ...]}, - ... other metadata -} +``` json +{"level":"info","name":"fred","home":"bedrock"} ``` -This proposal enables an alternate form of output record including a `structured` object field for JSON log entries. -For Elasticsearch outputs, new configuration is added to determine the index to use. +The current forwarded log record looks like this: -```json -{ - "structured":{ - "name":"fred", - "home":"bedrock", - }", - "kubernetes":{"labels":{"app.kubernetes.io/name": "flintstones", ...}}, - ... -} +``` json +{"message":"{\"level\":\"info\",\"name\":\"fred\",\"home\":\"bedrock\"", + "more fields..."} ``` -This proposal describes - -* extensions to the logging [data model][data_model] - `structured` field. -* extensions to the `ClusterLogForwarder` API to configure JSON parsing and forwarding. -* extensions to the `ClusterLogForwarder` API to select an index. -* indexing structured records in current and future Elasticsearch stores. -* replacing the [defunct MERGE_JSON_LOG][defunct_merge] feature. - -**Note**: This proposal focuses on JSON, but the data model and API changes can apply to other structured formats that may be supported in future. +The proposed new record with JSON parsing enabled looks like this: -## Terminology - -Logging terminology can have overlapping meanings, this document uses the following terms with specific meanings: - -- *Consumer*: A destination for logs, identified by a forwarder `output`. - The default consumer is the *Elasticsearch Log Store* -- *Entry*: a single log entry, usually a single line of text. May be JSON or other format. -- *Structured Log*: a log where each *entry* is formatted as a *JSON object*. -- *Record*: A key-value record including the *entry* and meta-data collected by the logging system. -- *Data Model*: [Description][data_model] of the forwarder's own *record* format. - -Elasticsearch *Index*: an index receives a stream of logs with similar JSON format. +``` json +{"structued": { "level": "info", "name": "fred", "home": "bedrock" }, + "more fields..."} +``` ## Motivation ### Goals -* Direct access to JSON log entries as JSON objects for `fluentdForward`, `Elasticsearch` or any other JSON-aware consumer. -* Elasticsearch outputs can associate index names with log records by category, namespace, k8s label, or`input` selection criteria. -* Upgrade path from today's implementation. +* Direct access to JSON log entries as JSON objects + * for `fluentdForward`, `Elasticsearch` or any other JSON-aware consumer. +* Elasticsearch outputs can direct logs with different JSON formats to different indices + * based on category, k8s label, or input selectors. +* Backwards compatible with today's implementation. ### Non-Goals @@ -111,14 +85,13 @@ One new top-level fields added to the logging output record [data model][data_mo Relationship between `structured` and `message` fields: -* `message` field is always present for backwards compatibility (a future proposal will address filtering out unwanted fields in the record) -* Both `message` and `structured` fields may be present. +* Both `message` and `structured` fields *may* be present. + * For first release, `message` is always present for backwards compatibility. + * In future releases `message` may be removed when `structured` is present. * If there is no structured data, the `structured` field will be missing or empty * If `message` and `structured` are both present and non-empty, `message` MUST contain the JSON-quoted string equivalent of the `structured` value. -*Elasticsearch* outputs MAY specify special index names for records, see below. - -#### ClusterLogForwarder configuration +### ClusterLogForwarder configuration New *pipeline* field: @@ -128,154 +101,104 @@ New *pipeline* field: For each log entry; if `parse: json` is set _and_ the entry is valid JSON, the output record will include a `structured` field _equivalent_ to the JSON entry. It may differ in field order and use of white-space. -All record will have a `message` field for backwards compatibility. +### Output type elasticsearch -#### Output configuration for type:elasticsearch +For most output types it is sufficient to enable `parse: json` to forward JSON data. -New fields for output type `elasticsearch` -* `indexKey`: (string, optional) Use value of meta-data key as `index` value, if present. - These keys are supported: - - `kubernetes.namespaceName`: Use the namespace name as the index name. - - `kubernetes.labels.`: Use the string value of kubernetes label with key ``. - - `openshift.labels.`: Use the string value of an openshift label with key `` (see [](forwarder-tagging.md)) - - *other keys may be added in future* -* `indexName`: (string, optional) - * If `indexKey` is not set, or the key is missing, use `indexName` as the `index` value. - * If `indexKey` is set and the key is present, that takes precedence over `indexName` +Elasticsearch is a special case; JSON records with _different formats_ must go to different indices, otherwise type conflicts and cardinality problems can occur. -**Note:** The Elasticsearch output will _delete_ the `structured` field if the rules above do not yield a non-empty index name. -This is to avoid sending randomly structured JSON to shared indices. +**Note**: It is important not to overload elasticsearch with too many indices. +Create new indices for each different log _format_, **not** for each application or namespace. +For example, most programs from Apache log with the same JSON format. +All such logs should go to one index, even if they come from different namespaces or applications. +Once stored, Elasticsearch queries can separate logs by application, namespace and many other criteria. -Indices are created automatically on-demand by co-operation between the forwarder and the managed default store. +The following new sub-fields in the `elasticsearch` output field allow JSON records to be directed to different indices. -#### Default output configuration +* `structuredIndexName`: (string, optional) Elasticsearch index for _structured_ records.\ + **Note**: Only records with a `structured` field are sent to this index. + Other records are indexed as usual in the application, infrastructure or audit indices. -To allow index settings on the default elasticsearch output, there is a new section in the ClusterLogForwarder spec: +* `structuredIndexKey`: (string, optional) Use the value of this meta-data key as the index.\ + Like `structuredIndexName`, but the index name is the value of a meta-data key from the record. These keys are supported: + - `kubernetes.labels.`: Use the string value of kubernetes label with key ``. + - `openshift.labels.`: Use the string value of an openshift label with key `` (see [](forwarder-tagging.md)) + +**Note:** If both `structuredIndexName` and `structuredIndexKey` are set, +the `structuredIndexKey` value is used if it is not empty, `structuredIndexName` otherwise. -* outputDefaults (type map, optional): - Each map under outputDefaults MUST have a `type` field. - All other fields are used as defaults for any outputs of the same `type` _including_ the `default` output +**Note:** If the rules above do not give a non-empty index name, the record will be forwarded _unstructured_: with a `message` field and no `structured` field. +This avoids sending unknown JSON records to shared indices. + +Structured indices are created automatically by the managed default store. +In order to forward to an external Elasticsearch instance, indices must be created in advance. ### User Stories -#### I want the default Elasticsearch to index by k8s label on the source pod +#### I want to forward JSON logs to a remote destination that is not Elasticsearch + +This example shows a remote `fluentd` but the same applies to any output other than Elasticsearch: ```yaml -outputDefaults: -- type: elasticsearch - indexKey: kubernetes.labels.myIndex +outputs: +- name: myFluentd + type: fluentdForward + url: ... pipelines: - inputRefs: [ application ] - outputRefs: default + outputRefs: myFluentd structured: json ``` -For example, this log record will go to the index `flintstones`: +#### I want to forward to default Elasticsearch, using a k8s pod label to determine the index -```json -{ - "structured":{ - "name":"fred", - "home":"bedrock", - }", - "kubernetes":{"labels":{"myIndex": "flintstones", ...}}, - ... -} -``` +For elasticsearch outputs, we must separate logs with different formats into different indices. +Lets assume that: -Logs from pods with no `myIndex` label will go to the `app` index like any unstructured `message` record. - -```json -{ - "message":"{\"name\":\"fred\",\"home\":\"bedrock\"}", - ... other metadata -} -``` - -#### I want input selector to set the elasticsearch index - -Pipelines can set an `openshift` label on records from input selectors. -This is used the same way as the kubernetes label in the previous example: +* Applications log in two structured JSON formats called "apache" and "google". +* User labels pods using those formats with `logFormat=apache` or `logFormat=google` +* Use the following forwarder configuration: ```yaml outputDefaults: -- type: elasticsearch - indexKey: openshift.labels.myIndex - -inputs: -- name: InputItchy - application: - namespaces: [ fred, bob ] -- name: InputScratchy - application: - namespaces: [ jill, jane ] - -pipeline: -- inputRefs: [ InputItchy ] - outputRefs: [ default ] - structured: json - labels: { myIndex: itchy } -- inputRefs: [ InputScratchy ] - outputRefs: [ default ] +- elasticsearch: + structuredIndexKey: kubernetes.labels.logFormat + +pipelines: +- inputRefs: [ application ] + outputRefs: default structured: json - labels: {myIndex: scratchy } ``` -Would produce records like: +This structured log record will go to the index `apache`: ```json { - "structured":{ - "name":"fred", - "home":"bedrock", - }", - "openshift": {"labels":{ "myIndex":"itchy" } } - "kubernetes": {"namespace_name":"fred", ...}}, - ... -} - -{ - "structured":{ - "name":"fred", - "home":"bedrock", - }", - "openshift": {"labels":{"myIndex":"scratchy" } } - "kubernetes":{"namespace_name":"jill", ...}, - ... -} -``` - -#### I want a replacement for the defunct MERGE_JSON_LOG feature - -Setting `parse: json` provides all the information formerly available via [MERGE_JSON_LOG][defunct_merge]. - -```yaml -outputs: -- name: OutputJSON - type: fluentdForward -pipelines: -- inputRefs: [ application ] - outputRefs: [ OutputJSON ] - structured: json + "structured":{"name":"fred","home":"bedrock"}, + "kubernetes":{"labels":{"logFormat": "apache", ...}} +!} ``` -Would produce records like: +This structured log record will go to the index `google`: -```yaml +```json { - "structured":{ - "name":"fred", - "home":"bedrock", - }", - ... + "structured":{"name":"wilma","home":"bedrock"}, + "kubernetes":{"labels":{"logFormat": "google", ...}} } ``` -**Note**: +**Note**: Only _structured_ logs with a `logForward` label go to the `logForward` index. +All others go to the default application index as _unstructured_ records, including: + +* Records with missing or empty "logFormat" label. +* Records that could not be parsed as JSON, _even if_ they have a `logFormat` label. + +#### I want a replacement for the defunct MERGE_JSON_LOG feature -* *Not a drop-in replacement* - log entry field `home` is available `structured.home`, not just `home` -* With the Elasticsearch store you should not forward all JSON logs to the same index, see the other use cases. +Setting `parse: json` provides all the information formerly available via [MERGE_JSON_LOG][defunct_merge], but in a slightly different format. +For example a log entry field `name` is available `structured.name` in the forwarded records. ### Implementation Details @@ -293,19 +216,6 @@ If a single input feeds structured and non-structured pipelines, use duplicate i The Elasticsearch output needs to take extra measures: -##### Flattening the `kubernetes.labels` map - -* Transform key:value object into array of "NAME=VALUE" strings. -* Replace '.' with '_' in label names. -* Use field name `flat_labels` instead of `labels` - -This makes label keys difficult to use for a consumer. -There is no automatic way to reverse the process since '.' and '_' are both legal characters in label names. - -The new forwarder will check the version of the Elasticsearch Operator (via labels or annotations, TBD). -Flattening to `flat_labels` will be enabled if the ELO is identified as a 'legacy' version. -Other output types, and future ELO versions based on the [pipeline proposal][es-pipeline] will receive a field named `labels` with the unmodified labels. - ##### Elasticsearch Indexing The current implementation relies on a fixed set of rollover indices and aliases being set up on Elasticsearch. @@ -347,9 +257,6 @@ Mitigation: * Need sync the [Elasticsearch pipeline proposal][es-pipeline] with this document, update both if needed. * Exact requirements for flattening and de-dottting with existing ES. See [this PR comment](https://github.com/openshift/enhancements/pull/518#issuecomment-749564743) -* Sidecar injection: multi-container pods, especially those with injected sidecar images, - may have different log formats for each container. -* Future enhancement proposals may add more capabilities to edit/filter a structured record for output. #### Examples