Skip to content

Commit

Permalink
Simplify and clarify JSON forwarding proposal, better examples.
Browse files Browse the repository at this point in the history
  • Loading branch information
alanconway committed May 13, 2021
1 parent 7962498 commit 3f61bf0
Showing 1 changed file with 89 additions and 182 deletions.
271 changes: 89 additions & 182 deletions enhancements/cluster-logging/forwarding-json-structured-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,74 +28,48 @@ see-also:

- [X] Enhancement is `implementable`
- [X] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [X] Test plan is defined
- [X] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

This enhancement will allow structured JSON log entries to be forwarded as JSON objects in JSON output records.
This enhancement will allow structured JSON log entries to be forwarded as JSON objects in JSON output records:

The current logging [data model][data_model] stores the log entry as a JSON *string*, not a JSON *object*.
Consumers can't access the log entry fields without a second JSON parse of this string.
* Add a log field called `structured` to hold JSON log entries as objects.
* Extend `ClusterLogForwarder` input API to enable JSON parsing.
* Extend `ClusterLogForwarder` elasticsearch output API to select an index for structured logs.
* Replaces the [defunct MERGE_JSON_LOG][defunct_merge] feature.

The current implementation also 'flattens' labels to work around Elasticsearch limitations.
For example, given this structured log entry:

To illustrate, given a log entry `{"name":"fred","home":"bedrock"}` from a container with the label `app.kubernetes.io/name="flintstones"`.
The current output record looks like:

```json
{
"message":"{\"name\":\"fred\",\"home\":\"bedrock\"}",
"kubernetes":{"flat_labels":["app_kubernetes_io/name=flintstones", ...]},
... other metadata
}
``` json
{"level":"info","name":"fred","home":"bedrock"}
```

This proposal enables an alternate form of output record including a `structured` object field for JSON log entries.
For Elasticsearch outputs, new configuration is added to determine the index to use.
The current forwarded log record looks like this:

```json
{
"structured":{
"name":"fred",
"home":"bedrock",
}",
"kubernetes":{"labels":{"app.kubernetes.io/name": "flintstones", ...}},
...
}
``` json
{"message":"{\"level\":\"info\",\"name\":\"fred\",\"home\":\"bedrock\"",
"more fields..."}
```

This proposal describes

* extensions to the logging [data model][data_model] - `structured` field.
* extensions to the `ClusterLogForwarder` API to configure JSON parsing and forwarding.
* extensions to the `ClusterLogForwarder` API to select an index.
* indexing structured records in current and future Elasticsearch stores.
* replacing the [defunct MERGE_JSON_LOG][defunct_merge] feature.

**Note**: This proposal focuses on JSON, but the data model and API changes can apply to other structured formats that may be supported in future.
The proposed new record with JSON parsing enabled looks like this:

## Terminology

Logging terminology can have overlapping meanings, this document uses the following terms with specific meanings:

- *Consumer*: A destination for logs, identified by a forwarder `output`.
The default consumer is the *Elasticsearch Log Store*
- *Entry*: a single log entry, usually a single line of text. May be JSON or other format.
- *Structured Log*: a log where each *entry* is formatted as a *JSON object*.
- *Record*: A key-value record including the *entry* and meta-data collected by the logging system.
- *Data Model*: [Description][data_model] of the forwarder's own *record* format.

Elasticsearch *Index*: an index receives a stream of logs with similar JSON format.
``` json
{"structued": { "level": "info", "name": "fred", "home": "bedrock" },
"more fields..."}
```

## Motivation

### Goals

* Direct access to JSON log entries as JSON objects for `fluentdForward`, `Elasticsearch` or any other JSON-aware consumer.
* Elasticsearch outputs can associate index names with log records by category, namespace, k8s label, or`input` selection criteria.
* Upgrade path from today's implementation.
* Direct access to JSON log entries as JSON objects
* for `fluentdForward`, `Elasticsearch` or any other JSON-aware consumer.
* Elasticsearch outputs can direct logs with different JSON formats to different indices
* based on category, k8s label, or input selectors.
* Backwards compatible with today's implementation.

### Non-Goals

Expand All @@ -111,14 +85,13 @@ One new top-level fields added to the logging output record [data model][data_mo

Relationship between `structured` and `message` fields:

* `message` field is always present for backwards compatibility (a future proposal will address filtering out unwanted fields in the record)
* Both `message` and `structured` fields may be present.
* Both `message` and `structured` fields *may* be present.
* For first release, `message` is always present for backwards compatibility.
* In future releases `message` may be removed when `structured` is present.
* If there is no structured data, the `structured` field will be missing or empty
* If `message` and `structured` are both present and non-empty, `message` MUST contain the JSON-quoted string equivalent of the `structured` value.

*Elasticsearch* outputs MAY specify special index names for records, see below.

#### ClusterLogForwarder configuration
### ClusterLogForwarder configuration

New *pipeline* field:

Expand All @@ -128,154 +101,104 @@ New *pipeline* field:
For each log entry; if `parse: json` is set _and_ the entry is valid JSON, the output record will include a `structured` field _equivalent_ to the JSON entry.
It may differ in field order and use of white-space.

All record will have a `message` field for backwards compatibility.
### Output type elasticsearch

#### Output configuration for type:elasticsearch
For most output types it is sufficient to enable `parse: json` to forward JSON data.

New fields for output type `elasticsearch`
* `indexKey`: (string, optional) Use value of meta-data key as `index` value, if present.
These keys are supported:
- `kubernetes.namespaceName`: Use the namespace name as the index name.
- `kubernetes.labels.<key>`: Use the string value of kubernetes label with key `<key>`.
- `openshift.labels.<key>`: Use the string value of an openshift label with key `<key>` (see [](forwarder-tagging.md))
- *other keys may be added in future*
* `indexName`: (string, optional)
* If `indexKey` is not set, or the key is missing, use `indexName` as the `index` value.
* If `indexKey` is set and the key is present, that takes precedence over `indexName`
Elasticsearch is a special case; JSON records with _different formats_ must go to different indices, otherwise type conflicts and cardinality problems can occur.

**Note:** The Elasticsearch output will _delete_ the `structured` field if the rules above do not yield a non-empty index name.
This is to avoid sending randomly structured JSON to shared indices.
**Note**: It is important not to overload elasticsearch with too many indices.
Create new indices for each different log _format_, **not** for each application or namespace.
For example, most programs from Apache log with the same JSON format.
All such logs should go to one index, even if they come from different namespaces or applications.
Once stored, Elasticsearch queries can separate logs by application, namespace and many other criteria.

Indices are created automatically on-demand by co-operation between the forwarder and the managed default store.
The following new sub-fields in the `elasticsearch` output field allow JSON records to be directed to different indices.

#### Default output configuration
* `structuredIndexName`: (string, optional) Elasticsearch index for _structured_ records.\
**Note**: Only records with a `structured` field are sent to this index.
Other records are indexed as usual in the application, infrastructure or audit indices.

To allow index settings on the default elasticsearch output, there is a new section in the ClusterLogForwarder spec:
* `structuredIndexKey`: (string, optional) Use the value of this meta-data key as the index.\
Like `structuredIndexName`, but the index name is the value of a meta-data key from the record. These keys are supported:
- `kubernetes.labels.<key>`: Use the string value of kubernetes label with key `<key>`.
- `openshift.labels.<key>`: Use the string value of an openshift label with key `<key>` (see [](forwarder-tagging.md))

**Note:** If both `structuredIndexName` and `structuredIndexKey` are set,
the `structuredIndexKey` value is used if it is not empty, `structuredIndexName` otherwise.

* outputDefaults (type map, optional):
Each map under outputDefaults MUST have a `type` field.
All other fields are used as defaults for any outputs of the same `type` _including_ the `default` output
**Note:** If the rules above do not give a non-empty index name, the record will be forwarded _unstructured_: with a `message` field and no `structured` field.
This avoids sending unknown JSON records to shared indices.

Structured indices are created automatically by the managed default store.
In order to forward to an external Elasticsearch instance, indices must be created in advance.

### User Stories

#### I want the default Elasticsearch to index by k8s label on the source pod
#### I want to forward JSON logs to a remote destination that is not Elasticsearch

This example shows a remote `fluentd` but the same applies to any output other than Elasticsearch:

```yaml
outputDefaults:
- type: elasticsearch
indexKey: kubernetes.labels.myIndex
outputs:
- name: myFluentd
type: fluentdForward
url: ...

pipelines:
- inputRefs: [ application ]
outputRefs: default
outputRefs: myFluentd
structured: json
```
For example, this log record will go to the index `flintstones`:
#### I want to forward to default Elasticsearch, using a k8s pod label to determine the index
```json
{
"structured":{
"name":"fred",
"home":"bedrock",
}",
"kubernetes":{"labels":{"myIndex": "flintstones", ...}},
...
}
```
For elasticsearch outputs, we must separate logs with different formats into different indices.
Lets assume that:
Logs from pods with no `myIndex` label will go to the `app` index like any unstructured `message` record.

```json
{
"message":"{\"name\":\"fred\",\"home\":\"bedrock\"}",
... other metadata
}
```

#### I want input selector to set the elasticsearch index

Pipelines can set an `openshift` label on records from input selectors.
This is used the same way as the kubernetes label in the previous example:
* Applications log in two structured JSON formats called "apache" and "google".
* User labels pods using those formats with `logFormat=apache` or `logFormat=google`
* Use the following forwarder configuration:

```yaml
outputDefaults:
- type: elasticsearch
indexKey: openshift.labels.myIndex
inputs:
- name: InputItchy
application:
namespaces: [ fred, bob ]
- name: InputScratchy
application:
namespaces: [ jill, jane ]
pipeline:
- inputRefs: [ InputItchy ]
outputRefs: [ default ]
structured: json
labels: { myIndex: itchy }
- inputRefs: [ InputScratchy ]
outputRefs: [ default ]
- elasticsearch:
structuredIndexKey: kubernetes.labels.logFormat
pipelines:
- inputRefs: [ application ]
outputRefs: default
structured: json
labels: {myIndex: scratchy }
```

Would produce records like:
This structured log record will go to the index `apache`:

```json
{
"structured":{
"name":"fred",
"home":"bedrock",
}",
"openshift": {"labels":{ "myIndex":"itchy" } }
"kubernetes": {"namespace_name":"fred", ...}},
...
}
{
"structured":{
"name":"fred",
"home":"bedrock",
}",
"openshift": {"labels":{"myIndex":"scratchy" } }
"kubernetes":{"namespace_name":"jill", ...},
...
}
```

#### I want a replacement for the defunct MERGE_JSON_LOG feature

Setting `parse: json` provides all the information formerly available via [MERGE_JSON_LOG][defunct_merge].

```yaml
outputs:
- name: OutputJSON
type: fluentdForward
pipelines:
- inputRefs: [ application ]
outputRefs: [ OutputJSON ]
structured: json
"structured":{"name":"fred","home":"bedrock"},
"kubernetes":{"labels":{"logFormat": "apache", ...}}
!}
```

Would produce records like:
This structured log record will go to the index `google`:

```yaml
```json
{
"structured":{
"name":"fred",
"home":"bedrock",
}",
...
"structured":{"name":"wilma","home":"bedrock"},
"kubernetes":{"labels":{"logFormat": "google", ...}}
}
```

**Note**:
**Note**: Only _structured_ logs with a `logForward` label go to the `logForward` index.
All others go to the default application index as _unstructured_ records, including:

* Records with missing or empty "logFormat" label.
* Records that could not be parsed as JSON, _even if_ they have a `logFormat` label.

#### I want a replacement for the defunct MERGE_JSON_LOG feature

* *Not a drop-in replacement* - log entry field `home` is available `structured.home`, not just `home`
* With the Elasticsearch store you should not forward all JSON logs to the same index, see the other use cases.
Setting `parse: json` provides all the information formerly available via [MERGE_JSON_LOG][defunct_merge], but in a slightly different format.
For example a log entry field `name` is available `structured.name` in the forwarded records.

### Implementation Details

Expand All @@ -293,19 +216,6 @@ If a single input feeds structured and non-structured pipelines, use duplicate i

The Elasticsearch output needs to take extra measures:

##### Flattening the `kubernetes.labels` map

* Transform key:value object into array of "NAME=VALUE" strings.
* Replace '.' with '_' in label names.
* Use field name `flat_labels` instead of `labels`

This makes label keys difficult to use for a consumer.
There is no automatic way to reverse the process since '.' and '_' are both legal characters in label names.

The new forwarder will check the version of the Elasticsearch Operator (via labels or annotations, TBD).
Flattening to `flat_labels` will be enabled if the ELO is identified as a 'legacy' version.
Other output types, and future ELO versions based on the [pipeline proposal][es-pipeline] will receive a field named `labels` with the unmodified labels.

##### Elasticsearch Indexing <a name="es-notes"></a>

The current implementation relies on a fixed set of rollover indices and aliases being set up on Elasticsearch.
Expand Down Expand Up @@ -347,9 +257,6 @@ Mitigation:

* Need sync the [Elasticsearch pipeline proposal][es-pipeline] with this document, update both if needed.
* Exact requirements for flattening and de-dottting with existing ES. See [this PR comment](https://github.com/openshift/enhancements/pull/518#issuecomment-749564743)
* Sidecar injection: multi-container pods, especially those with injected sidecar images,
may have different log formats for each container.
* Future enhancement proposals may add more capabilities to edit/filter a structured record for output.

#### Examples

Expand Down

0 comments on commit 3f61bf0

Please sign in to comment.