From 3f61bf068bff3f2baf6e6a9c6e0a463eff0f980e Mon Sep 17 00:00:00 2001
From: Alan Conway <aconway@redhat.com>
Date: Wed, 12 May 2021 14:10:39 -0400
Subject: [PATCH] Simplify and clarify JSON forwarding proposal, better
 examples.

---
 .../forwarding-json-structured-logs.md        | 271 ++++++------------
 1 file changed, 89 insertions(+), 182 deletions(-)

diff --git a/enhancements/cluster-logging/forwarding-json-structured-logs.md b/enhancements/cluster-logging/forwarding-json-structured-logs.md
index 2ef3fe7f8e9..5028a0678a2 100644
--- a/enhancements/cluster-logging/forwarding-json-structured-logs.md
+++ b/enhancements/cluster-logging/forwarding-json-structured-logs.md
@@ -28,74 +28,48 @@ see-also:
 
 - [X] Enhancement is `implementable`
 - [X] Design details are appropriately documented from clear requirements
-- [ ] Test plan is defined
-- [ ] Graduation criteria for dev preview, tech preview, GA
+- [X] Test plan is defined
+- [X] Graduation criteria for dev preview, tech preview, GA
 - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
 
 ## Summary
 
-This enhancement will allow structured JSON log entries to be forwarded as JSON objects in JSON output records.
+This enhancement will allow structured JSON log entries to be forwarded as JSON objects in JSON output records:
 
-The current logging [data model][data_model] stores the log entry as a JSON *string*, not a JSON *object*.
-Consumers can't access the log entry fields without a second JSON parse of this string.
+* Add a log field called `structured` to hold JSON log entries as objects.
+* Extend `ClusterLogForwarder` input API to enable JSON parsing.
+* Extend `ClusterLogForwarder` elasticsearch output API to select an index for structured logs.
+* Replaces the [defunct MERGE_JSON_LOG][defunct_merge] feature.
 
-The current implementation also 'flattens' labels to work around Elasticsearch limitations.
+For example, given this structured log entry:
 
-To illustrate, given a log entry `{"name":"fred","home":"bedrock"}` from a container with the label `app.kubernetes.io/name="flintstones"`.
-The current output record looks like:
-
-```json
-{
-  "message":"{\"name\":\"fred\",\"home\":\"bedrock\"}",
-  "kubernetes":{"flat_labels":["app_kubernetes_io/name=flintstones", ...]},
-  ... other metadata
-}
+``` json
+{"level":"info","name":"fred","home":"bedrock"}
 ```
 
-This proposal enables an alternate form of output record including a `structured` object field for JSON log entries.
-For Elasticsearch outputs, new configuration is added to determine the index to use.
+The current forwarded log record looks like this:
 
-```json
-{
-  "structured":{
-    "name":"fred",
-    "home":"bedrock",
-  }",
-  "kubernetes":{"labels":{"app.kubernetes.io/name": "flintstones", ...}},
-  ...
-}
+``` json
+{"message":"{\"level\":\"info\",\"name\":\"fred\",\"home\":\"bedrock\"",
+ "more fields..."}
 ```
 
-This proposal describes
-
-* extensions to the logging [data model][data_model] - `structured` field.
-* extensions to the `ClusterLogForwarder` API to configure JSON parsing and forwarding.
-* extensions to the `ClusterLogForwarder` API to select an index.
-* indexing structured records in current and future Elasticsearch stores.
-* replacing the [defunct MERGE_JSON_LOG][defunct_merge] feature.
-
-**Note**: This proposal focuses on JSON, but the data model and API changes can apply to other structured formats that may be supported in future.
+The proposed new record with JSON parsing enabled looks like this:
 
-## Terminology
-
-Logging terminology can have overlapping meanings, this document uses the following terms with specific meanings:
-
-- *Consumer*: A destination for logs, identified by a forwarder `output`.
-  The default consumer is the *Elasticsearch Log Store*
-- *Entry*: a single log entry, usually a single line of text. May be JSON or other format.
-- *Structured Log*: a log where each *entry* is formatted as a *JSON object*.
-- *Record*: A key-value record including the *entry* and meta-data collected by the logging system.
-- *Data Model*: [Description][data_model] of the forwarder's own *record* format.
-
-Elasticsearch *Index*: an index receives a stream of logs with similar JSON format.
+``` json
+{"structued": { "level": "info", "name": "fred", "home": "bedrock" },
+ "more fields..."}
+```
 
 ## Motivation
 
 ### Goals
 
-* Direct access to JSON log entries as JSON objects for `fluentdForward`, `Elasticsearch` or any other JSON-aware consumer.
-* Elasticsearch outputs can associate index names with log records by category, namespace, k8s label, or`input` selection criteria.
-* Upgrade path from today's implementation.
+* Direct access to JSON log entries as JSON objects
+  * for `fluentdForward`, `Elasticsearch` or any other JSON-aware consumer.
+* Elasticsearch outputs can direct logs with different JSON formats to different indices
+  * based on category, k8s label, or input selectors.
+* Backwards compatible with today's implementation.
 
 ### Non-Goals
 
@@ -111,14 +85,13 @@ One new top-level fields added to the logging output record [data model][data_mo
 
 Relationship between `structured` and `message` fields:
 
-* `message` field is always present for backwards compatibility (a future proposal will address filtering out unwanted fields in the record)
-* Both `message` and `structured` fields may be present.
+* Both `message` and `structured` fields *may* be present.
+  * For first release, `message` is always present for backwards compatibility.
+  * In future releases `message` may be removed when `structured` is present.
 * If there is no structured data, the `structured` field will be missing or empty
 * If `message` and `structured` are both present and non-empty, `message` MUST contain the JSON-quoted string equivalent of the `structured` value.
 
-*Elasticsearch* outputs MAY specify special index names for records, see below.
-
-#### ClusterLogForwarder configuration
+### ClusterLogForwarder configuration
 
 New *pipeline* field:
 
@@ -128,154 +101,104 @@ New *pipeline* field:
 For each log entry; if `parse: json` is set _and_ the entry is valid JSON, the output record will include a `structured` field _equivalent_ to the JSON entry.
 It may differ in field order and use of white-space.
 
-All record will have a `message` field for backwards compatibility.
+### Output type elasticsearch
 
-#### Output  configuration for type:elasticsearch
+For most output types it is sufficient to enable `parse: json` to forward JSON data.
 
-New fields for output type `elasticsearch`
-* `indexKey`: (string, optional) Use value of meta-data key as `index` value, if present.
-  These keys are supported:
-  - `kubernetes.namespaceName`: Use the namespace name as the index name.
-  - `kubernetes.labels.<key>`: Use the string value of kubernetes label with key `<key>`.
-  - `openshift.labels.<key>`: Use the string value of an openshift label with key `<key>` (see [](forwarder-tagging.md))
-  - *other keys may be added in future*
-* `indexName`: (string, optional)
-  * If `indexKey` is not set, or the key is missing, use `indexName` as the `index` value.
-  * If `indexKey` is set and the key is present, that takes precedence over `indexName`
+Elasticsearch is a special case; JSON records with _different formats_ must go to different indices, otherwise type conflicts and cardinality problems can occur.
 
-**Note:** The Elasticsearch output will _delete_ the `structured` field if the rules above do not yield a non-empty index name.
-This is to avoid sending randomly structured JSON to shared indices.
+**Note**: It is important not to overload elasticsearch with too many indices.
+Create new indices for each different log _format_, **not** for each application or namespace.
+For example, most programs from Apache log with the same JSON format.
+All such logs should go to one index, even if they come from different namespaces or applications.
+Once stored, Elasticsearch queries can separate logs by application, namespace and many other criteria.
 
-Indices are created automatically on-demand by co-operation between the forwarder and the managed default store.
+The following new sub-fields in the `elasticsearch` output field allow JSON records to be directed to different indices.
 
-#### Default output configuration
+* `structuredIndexName`: (string, optional) Elasticsearch index for _structured_ records.\
+  **Note**: Only records with a `structured` field are sent to this index.
+  Other records are indexed as usual in the application, infrastructure or audit indices.
 
-To allow index settings on the default elasticsearch output, there is a new section in the ClusterLogForwarder spec:
+* `structuredIndexKey`: (string, optional) Use the value of this meta-data key as the index.\
+  Like `structuredIndexName`, but the index name is the value of a meta-data key from the record. These keys are supported:
+  - `kubernetes.labels.<key>`: Use the string value of kubernetes label with key `<key>`.
+  - `openshift.labels.<key>`: Use the string value of an openshift label with key `<key>` (see [](forwarder-tagging.md))
+
+**Note:** If both `structuredIndexName` and `structuredIndexKey` are set,
+the `structuredIndexKey` value is used if it is not empty, `structuredIndexName` otherwise.
 
-* outputDefaults (type map, optional):
-  Each map under outputDefaults MUST have a `type` field.
-  All other fields are used as defaults for any outputs of the same `type` _including_ the `default` output
+**Note:** If the rules above do not give a non-empty index name, the record will be forwarded _unstructured_: with a `message` field and no `structured` field.
+This avoids sending unknown JSON records to shared indices.
+
+Structured indices are created automatically by the managed default store.
+In order to forward to an external Elasticsearch instance, indices must be created in advance.
 
 ### User Stories
 
-#### I want the  default Elasticsearch to index by k8s label on the source pod
+#### I want to forward JSON logs to a remote destination that is not Elasticsearch
+
+This example shows a remote `fluentd` but the same applies to any output other than Elasticsearch:
 
 ```yaml
-outputDefaults:
-- type: elasticsearch
-  indexKey: kubernetes.labels.myIndex
+outputs:
+- name: myFluentd
+  type: fluentdForward
+  url: ...
 
 pipelines:
 - inputRefs: [ application ]
-  outputRefs: default
+  outputRefs: myFluentd
   structured: json
 ```
 
-For example, this log record will go to the index `flintstones`:
+#### I want to forward to default Elasticsearch, using a k8s pod label to determine the index
 
-```json
-{
-  "structured":{
-    "name":"fred",
-    "home":"bedrock",
-  }",
-  "kubernetes":{"labels":{"myIndex": "flintstones", ...}},
-  ...
-}
-```
+For elasticsearch outputs, we must separate logs with different formats into different indices.
+Lets assume that:
 
-Logs from pods with no `myIndex` label will go to the `app` index like any unstructured `message` record.
-
-```json
-{
-  "message":"{\"name\":\"fred\",\"home\":\"bedrock\"}",
-  ... other metadata
-}
-```
-
-#### I want input selector to set the elasticsearch index
-
-Pipelines can set an `openshift` label on records from input selectors.
-This is used the same way as the kubernetes label in the previous example:
+* Applications log in two structured JSON formats called "apache" and "google".
+* User labels pods using those formats with `logFormat=apache` or `logFormat=google`
+* Use the following forwarder configuration:
 
 ```yaml
 outputDefaults:
-- type: elasticsearch
-  indexKey: openshift.labels.myIndex
-
-inputs:
-- name: InputItchy
-  application:
-    namespaces: [ fred, bob ]
-- name: InputScratchy
-  application:
-    namespaces: [ jill, jane ]
-
-pipeline:
-- inputRefs: [ InputItchy ]
-  outputRefs: [ default ]
-  structured: json
-  labels: { myIndex: itchy }
-- inputRefs: [ InputScratchy ]
-  outputRefs: [ default ]
+- elasticsearch:
+    structuredIndexKey: kubernetes.labels.logFormat
+
+pipelines:
+- inputRefs: [ application ]
+  outputRefs: default
   structured: json
-  labels: {myIndex: scratchy }
 ```
 
-Would produce records like:
+This structured log record will go to the index `apache`:
 
 ```json
 {
-  "structured":{
-    "name":"fred",
-    "home":"bedrock",
-  }",
-  "openshift": {"labels":{ "myIndex":"itchy" } }
-  "kubernetes": {"namespace_name":"fred", ...}},
-  ...
-}
-
-{
-  "structured":{
-    "name":"fred",
-    "home":"bedrock",
-  }",
-  "openshift": {"labels":{"myIndex":"scratchy" } }
-  "kubernetes":{"namespace_name":"jill", ...},
-  ...
-}
-```
-
-#### I want a replacement for the defunct MERGE_JSON_LOG feature
-
-Setting `parse: json` provides all the information formerly available via [MERGE_JSON_LOG][defunct_merge].
-
-```yaml
-outputs:
-- name: OutputJSON
-  type: fluentdForward
-pipelines:
-- inputRefs: [ application ]
-  outputRefs: [ OutputJSON ]
-  structured: json
+  "structured":{"name":"fred","home":"bedrock"},
+  "kubernetes":{"labels":{"logFormat": "apache", ...}}
+!}
 ```
 
-Would produce records like:
+This structured log record will go to the index `google`:
 
-```yaml
+```json
 {
-  "structured":{
-    "name":"fred",
-    "home":"bedrock",
-  }",
-  ...
+  "structured":{"name":"wilma","home":"bedrock"},
+  "kubernetes":{"labels":{"logFormat": "google", ...}}
 }
 ```
 
-**Note**:
+**Note**: Only _structured_ logs with a `logForward` label go to the `logForward` index.
+All others go to the default application index as _unstructured_ records, including:
+
+* Records with missing or empty "logFormat" label.
+* Records that could not be parsed as JSON,  _even if_ they have a `logFormat` label.
+
+#### I want a replacement for the defunct MERGE_JSON_LOG feature
 
-* *Not a drop-in replacement* - log entry field `home` is available `structured.home`, not just `home`
-* With the Elasticsearch store you should not forward all JSON logs to the same index, see the other use cases.
+Setting `parse: json` provides all the information formerly available via [MERGE_JSON_LOG][defunct_merge], but in a slightly different format.
+For example a log entry field `name` is available `structured.name` in the forwarded records.
 
 ### Implementation Details
 
@@ -293,19 +216,6 @@ If a single input feeds structured and non-structured pipelines, use duplicate i
 
 The Elasticsearch output needs to take extra measures:
 
-##### Flattening the `kubernetes.labels` map
-
-* Transform key:value object into array of "NAME=VALUE" strings.
-* Replace '.' with '_' in label names.
-* Use field name `flat_labels` instead of `labels`
-
-This makes label keys difficult to use for a consumer.
-There is no automatic way to reverse the process since '.' and '_' are both legal characters in label names.
-
-The new forwarder will check the version of the Elasticsearch Operator (via labels or annotations, TBD).
-Flattening to `flat_labels` will be enabled if the ELO is identified as a 'legacy' version.
-Other output types, and future ELO versions based on the [pipeline proposal][es-pipeline] will receive a field named `labels` with the unmodified labels.
-
 ##### Elasticsearch Indexing <a name="es-notes"></a>
 
 The current implementation relies on a fixed set of rollover indices and aliases being set up on Elasticsearch.
@@ -347,9 +257,6 @@ Mitigation:
 
 * Need sync the [Elasticsearch pipeline proposal][es-pipeline] with this document, update both if needed.
 * Exact requirements for flattening and de-dottting with existing ES. See [this PR comment](https://github.com/openshift/enhancements/pull/518#issuecomment-749564743)
-* Sidecar injection: multi-container pods, especially those with injected sidecar images,
-  may have different log formats for each container.
-* Future enhancement proposals may add more capabilities to edit/filter a structured record for output.
 
 #### Examples