Skip to content

AbstractEcsLoggingTest should log keys as nested objects, not with dotted key names #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andreaturli opened this issue Nov 18, 2019 · 15 comments

Comments

@andreaturli
Copy link

andreaturli commented Nov 18, 2019

I've noticed AbstractEcsLoggingTest checks for keys like process.thread.name while I think the key should be something like `"process": { "thread": { "name": "co.elastic.logging.logback.EcsEncoderTest"}}" instead of something like

 {"@timestamp":"2019-11-18T13:42:33.333Z","log.level":"DEBUG","message":"test","service.name":"test","process.thread.name":"main","log.logger":"co.elastic.logging.logback.EcsEncoderTest","log.origin":{"file.name":"AbstractEcsEncoderTest.java","function":"debug","file.line":47}}
@webmat
Copy link

webmat commented Nov 18, 2019

Agreed, the event above should look like this (once prettified), on disk:

{ "@timestamp":"2019-11-18T13:42:33.333Z",
  "message": "test",
  "log": {
    "level": "DEBUG",
    "logger": "co.elastic.logging.logback.EcsEncoderTest",
    "origin": {
      "file": {
        "name": "AbstractEcsEncoderTest.java",
        "line": 47
      },
      "function": "debug"
    }
  },
  "service": { "name": "test" }
}

There's no automatic transformation of dotted keys to nesting in Elasticsearch / ingest pipelines. The raw events have to be nested from the start.

Only way to change dots to nesting is Logstash' "dedot" filter, which is O(n).

@felixbarny
Copy link
Member

Elasticsearch will automatically create nested documents if a key contains a dot. Just the _source will not contain a properly nested JSON. But when you do searches/aggregations, ES will treat them as nested fields.

@webmat
Copy link

webmat commented Nov 18, 2019

The mappings may consider both as equivalent, but the ES mappings aren't the only thing at play here.

Dotted key names means consumers of the raw documents (e.g. pipelines & apps accessing documents via the API) will not reliably be able to access the nested fields of these documents.

The structure must be nested objects only, no dots in key names. Otherwise these are not valid ECS documents.

@felixbarny
Copy link
Member

Good points. The problem is that the MDC does not support nested fields. We use that, for example, for the APM/log correlation with the field trace.id. Does that mean we have to transform an MDC value of trace.id=72ace1d into a nested object like "trace": { "id": "72ace1d" }? Also, when there are labels like foo.bar=baz and foo.baz=bar, do those have to be transformed to "foo": {"bar": "baz", "baz": "bar"}? That would significantly increase the overhead.

While using a nested structure for process.thread.name would be possible, "process": { "thread": { "name": "co.elastic.logging.logback.EcsEncoderTest"}} is significantly less readable.

Is there any Filebeat processor that could normalize that?

Maybe it's another reason why we should go with a custom log format that supports dots to denote nesting, multi-line strings and does not require keys for timestamp, log.level and message.

@webmat
Copy link

webmat commented Nov 20, 2019

In the call yesterday other good points were made around having dotted key names in logs.

  • If we want to have the beginning of the JSON log lines easy to read, log.level (with dots) would be important. Otherwise we have to bring in the whole log.* sub-object in that place, which means we can't start with a few regular keys e.g. timestamp, log.level, message, then the rest.
  • Some logging libraries don't give the flexibility to perform a change such as transforming dotted key names (e.g. from the MDC) to nesting. So it's not even a matter of "should we do this", but rather "we can't do this".

So it's starting to look like there isn't actually a choice to accept the dotted key names in the raw JSON logs. This will be confusing to people, as the documents really must be converted to nesting ASAP, to avoid pipelines having to hunt around for the key names that are present. Trying nested and falling back to looking for a dotted key name doesn't work. Because when there's more than one level of nesting, it turns into a combinatorial explosion of possibilities:

log.origin.file.name
log.origin.file { name }
log.origin { file.name }
log.origin { file { name } }
log { origin.file.name }
log { origin.file { name } }
log { origin { file.name } }
log { origin { file { name } } }

On the other hand, if the documentation for our logging libraries mention these dotted keys and how they're transformed into nesting ASAP to become ECS compliant, that can be fine.

Another way to think about this: if we step out of JSON and pretend we had gone with KV instead, the keys could have probably been dotted in the raw logs; and we would have converted that to nesting anyway. So it's fine to do this in JSON too, I guess. We just need to be clear about this conversion in the logging library docs.

There isn't currently a processor that performs this conversion in Filebeat or libbeat. But @urso is open to having this created. We can also explore doing this in Javascript via the new script Beats processor, in the meantime.

@pgomulka
Copy link
Contributor

pgomulka commented Nov 22, 2019

I think the examples from #51 (comment) should be supported by Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/7.x/dot-expand-processor.html

In my view we should only nest when there is a need to make a log line more compact and easier to read. For instance I don't think this would be need to nest just one field i.e process.thread.name as per @felixbarny comment, but even more fields often times would not be beneficial to fully nest

like process.thread.name process.thread.id would probably be easier to digest if logged as process.thread: {id : 1, name: "thread-0 }

I also support comment made by @webmat that beginning of log line should be easy to read, as otherwise app admins would have trouble investigating their logs visually.

Maybe this should be configurable?

@felixbarny
Copy link
Member

The problem with the dot expander processor is that you have to explicitly state which field it should expand. Maybe we can extend this processor so that it discovers all dots and converts them to nested objects? I don't think that would be too computationally expensive, especially because we would apply it to documents that have minimal nesting to begin with.

@webmat
Copy link

webmat commented Nov 27, 2019

Agreed, the dot expander should (perhaps optionally) support finding and converting all dotted key names without having to call out to them explicitly. Otherwise the problem isn't resolved with anything under labels, which we can't know in advance.

But inspecting the whole object is O(N) where N is the total amount of keys (not the amount of dotted keys).

@Fr33Radical
Copy link

Fr33Radical commented Dec 4, 2019

I was about to create the topic too.
Someone report to me that they receive logs from the OS I think that contains dots and should not be nested.

I will ask for an example.
I don't think it is a good idea to have a hybrid nested/flat JSON logs. At this time, they are creating a parser that do something like this for each field at the entry of the pipeline :

 json {
        source => "[event][original]"
    }
        
    mutate {
        rename => { "message" => "[message]" }
        rename => { "process.thread.name" => "[process][thread][name]" }
    }

Also, when there are labels like foo.bar=baz and foo.baz=bar, do those have to be transformed to "foo": {"bar": "baz", "baz": "bar"}? That would significantly increase the overhead.

For my part, I'd prefer to have the choice between performance vs nested JSON.

My issue is about the format when logs are sent to a distant server from the application.

@felixbarny
Copy link
Member

Ideally, I'd like the performance penalty to be in the processing pipeline and not in the application.

But there's another thing we didn't mention yet - jq-friendliness. When dealing with JSON logs, it can be quite powerful to search specific log events via jq. So having them properly nested instead of dotted would be a big plus.

I'm wondering if we can just start the log messages with timestamp and message and put the "log": { "level": "...", "logger": "..."} object after it. We'd basically lose the visual alignment of the log level which would be somewhat sad.

I'm really torn on that mater but maybe we should not focus too much on readable JSON logs and instead document how to effectively handle JSON logs with jq and tail. And with or stack of course. But the requirement to be human-readable is more for the potential case where the centralized logging is down for whatever reason and you have to fall back to ssh/less/grep.

But also in development, where you look at logs in your IDE, human-readable logs are just more pleasant.

Do we want to tell users they need to configure dedicated plain-text logging for that or are we trying to cover both cases with human and machine-readable logs?

@Fr33Radical
Copy link

I'd just like to add. Could the project/this feature stay backward compatible?

@webmat
Copy link

webmat commented Jan 3, 2020

@Fr33Radical As long as the project is in major version 0 (0.1.3 as of this comment), backwards compatibility is best effort only. It's too early to guarantee until we hit 1.x.

@felixbarny
Copy link
Member

Closing this as it's not feasible to guarantee all fields are nested, especially when allowing user-defined custom attributes.

This means that the JSON produced by ecs-logging libraries is strictly speaking not ECS compatible. Therefore, there has to be another processing step that takes the log file and converts dots to nested fields. I've created elastic/beats#17021 for this. Please add your 👍 there.

@Fr33Radical
Copy link

If we can live with the fact of having nested json with no exception. Then, is there a way to achieve this without filebeat? (Waiting for the fix).

@felixbarny
Copy link
Member

Update:
Converting dotted fields to nested objects is now possible both in Beats with the expand_keys option of the decode_json_fields processor and as of 7.15.0, it will also be possible with the dot_expander Elasticsearch ingest node processor.
This issue contains a POC for a ingest node pipeline that auto-detects and properly parses ECS JSON: elastic/integrations#1454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants