Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON codec doesn't handle special chars in field names #13606

Closed
garethhumphriesgkc opened this issue Jan 12, 2022 · 7 comments · Fixed by #14044
Closed

JSON codec doesn't handle special chars in field names #13606

garethhumphriesgkc opened this issue Jan 12, 2022 · 7 comments · Fixed by #14044
Assignees
Labels

Comments

@garethhumphriesgkc
Copy link

garethhumphriesgkc commented Jan 12, 2022

Logstash information:

$ docker run --rm docker.elastic.co/logstash/logstash:7.16.2 --version
Using bundled JDK: /usr/share/logstash/jdk
logstash 7.16.2

OS version (uname -a if on a Unix-like system):

$ docker --version
Docker version 20.10.7, build 20.10.7-0ubuntu5.1
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.10
DISTRIB_CODENAME=impish
DISTRIB_DESCRIPTION="Ubuntu 21.10"
$ uname -a
Linux workstation 5.13.0-22-generic #22-Ubuntu SMP Fri Nov 5 13:21:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behaviour:
When sending json data though the json filter or input codec, if one of the field names contains a specials char (e.g. '['), the parser fails.
Note that '[' is a legal char in a field name according to the json spec, and can be successfully parsed by other parsers (e.g. jq and ruby)

Steps to reproduce:

  1. Start a very simple logstash instance: docker run --rm -p 6996:6996 docker.elastic.co/logstash/logstash:7.16.2 -e 'input {tcp { port => 6996 } }filter { json { source => "message" } }output { stdout {} }'
  2. Send in some sample data - when there are special chars present _jsonparsefailure occurs: echo '{"str[substr]": true}' | nc -w1 127.0.0.1 6996
[2022-01-12T04:45:11,758][WARN ][logstash.filters.json    ][main][a700f2c38ad0ec2c2223fd9c1342d70b4cac73422170d6d5c123552c5bca662b] Exception caught in json filter {:source=>"message", :raw=>"{\"str[substr]\": true}", :exception=>#<RuntimeError: Invalid FieldReference: `str[substr]`>}
{
    "@timestamp" => 2022-01-12T04:45:11.615Z,
       "message" => "{\"str[substr]\": true}",
          "tags" => [
        [0] "_jsonparsefailure"
    ],
          "host" => "gateway",
          "port" => 58704,
      "@version" => "1"
}

Notes:

  • skip_on_invalid_json => true doesn't change this behaviour
  • Identical behaviour when config is: 'input {tcp { port => 6996 codec => json } }output { stdout {} }'
  • When input is '{"str\[substr": true}', we don't get the exception but message isn't parsed. This is true for any normal char with a preceding \ (e.g.: '{"str\a": true}'). While this is invalid according to the spec, most other parsers treat an escaped non-special char as just the char itself.
  • Thus far only [ and ] seem to provoke the exception.
  • When special chars are absent, message is parsed fine
  • When json filter and input codec are both absent, message is parsed fine
@andsel
Copy link
Contributor

andsel commented Apr 11, 2022

Thank you @garethhumphriesgkc for reporting,
this is a long standing issue due to the way Logstash event references the fields.

More info at #11608

There is a workaround suggested in comment that simply translates square brackets to something else

@garethhumphriesgkc
Copy link
Author

Hi Andsel,

The workarounds there seem to be either downgrade to logstash 6 or put in a filter or script with significant performance impact - are these what you're suggesting?

The obvious fix seems to be to escape the [ on ingest, treat it as escaped throughout the pipeline, and unescape it on output once processing is complete, is there a reason that wouldn't work? Is something more permanent in the works? If so, is there any indication of timeline?

@andsel
Copy link
Contributor

andsel commented Apr 12, 2022

The workarounds there seem to be either downgrade to logstash 6 or put in a filter or script with significant performance impact - are these what you're suggesting?

Sorry for the confusion, I suggest to use scripts. About performance impact we don't have any measure about it. The Ruby script uses gsub with regex and we should expect some impact, but depends on how long and complicated are the strings to proces.

The obvious fix seems to be to escape the [ on ingest, treat it as escaped throughout the pipeline, and unescape it on output once processing is complete, is there a reason that wouldn't work?

The problem sits in the way Logstash pipeline parser works, so it's something deep in the code base.
How do you intend to escape in input and unescape in output? Modifying the code in the plugins?

Is something more permanent in the works? If so, is there any indication of timeline?

we have some ideas on it and working on it, but no timeline

@garethhumphriesgkc
Copy link
Author

The follow-up comment on that thread mentions 4k eps down to 150, that's quite dramatic. It also permanently changes the data, so it can't ever be the same before and after logstash.

There are plenty of other json parsers that use square brackets in their language, so I'd look to what they do - as you suggest though "field[string]" seems like the most obvious choice.

It's only when parsing events the issue happens, so logstash can clearly handle the embedded chars under normally operation, it's just a matter of having the json filter/codec handle them and being able to represent them in the DSL.

@garethhumphriesgkc
Copy link
Author

I stand corrected - other filters/codecs don't handle it. I got thrown by the CSV filter persisting detected headers across TCP connections - when you start a brand new CSV input and inlcude [] in the header row, I see "tags" => [[0] "_csvparsefailure"]

yaauie added a commit to yaauie/logstash that referenced this issue Apr 19, 2022
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: elastic#13606
Resolves: elastic#11608
yaauie added a commit to yaauie/logstash that referenced this issue Apr 20, 2022
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: elastic#13606
Resolves: elastic#11608
yaauie added a commit to yaauie/logstash that referenced this issue Apr 25, 2022
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: elastic#13606
Resolves: elastic#11608
yaauie added a commit to yaauie/logstash that referenced this issue Apr 25, 2022
Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: elastic#13606
Resolves: elastic#11608
yaauie added a commit that referenced this issue May 24, 2022
* add failing tests for Event.new with field that look like field references

* fix: correctly handle FieldReference-special characters in field names.

Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: #13606
Resolves: #11608

* field_reference: support escape sequences

Adds a `config.field_reference.escape_style` option and a companion
command-line flag `--field-reference-escape-style` allowing a user
to opt into one of two proposed escape-sequence implementations for field
reference parsing:

 - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes
 - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points

The default is `NONE`, which does _not_ proccess escape sequences.
With this setting a user effectively cannot reference a field whose name
contains FieldReference-reserved characters.

| ESCAPE STYLE | `[`     | `]`     |
| ------------ | ------- | ------- |
| `NONE`       | _N/A_   | _N/A_   |
| `PERCENT`    | `%5B`   | `%5D`   |
| `AMPERSAND`  | `&#91;` | `&#93;` |

* fixup: no need to double-escape HTML-ish escape sequences in docs

* Apply suggestions from code review

Co-authored-by: Karol Bucek <kares@users.noreply.github.com>

* field-reference: load escape style in runner

* docs: sentences over semiciolons

* field-reference: faster shortcut for PERCENT escape mode

* field-reference: escape mode control downcase

* field_reference: more s/experimental/technical preview/

* field_reference: still more s/experimental/technical preview/

Co-authored-by: Karol Bucek <kares@users.noreply.github.com>
@garethhumphriesgkc
Copy link
Author

I just tested this, and the bug does not appear to be fixed. Running 8.5.3, the above repro still results in a _jsonparsefailure:

$ docker run --rm -p 6996:6996 docker.elastic.co/logstash/logstash:8.5.3 -e 'input {tcp { port => 6996 } }filter { json { source => "message" } }output { stdout {} }' --field-reference-escape-style ampersand
.
.
.
[2022-12-19T20:17:12,486][WARN ][logstash.filters.json    ][main][a700f2c38ad0ec2c2223fd9c1342d70b4cac73422170d6d5c123552c5bca662b] Exception caught in json filter {:source=>"message", :raw=>"{\"str[substr]\": true}", :exception=>#<RuntimeError: Invalid FieldReference: `str[substr]`>}
{
      "@version" => "1",
       "message" => "{\"str[substr]\": true}",
    "@timestamp" => 2022-12-19T20:17:12.344138503Z,
          "tags" => [
        [0] "_jsonparsefailure"
    ],
         "event" => {
        "original" => "{\"str[substr]\": true}"
    }
}

@andsel
Copy link
Contributor

andsel commented Dec 21, 2022

I tried with a couple of pipelines and the problem seems to be in using the JSON filter for some reason.

Feeding the data

{"str[substr]": true}

to

logstash-8.3.3/bin/logstash -e 'input {tcp { port => 1234 } }filter { json { source => "message" } }output { stdout {} }' --field-reference-escape-style=percent

fails with

[2022-12-21T15:21:02,919][WARN ][logstash.filters.json    ][main][dacc140c452c0d09a3d6f5a419982229df4afbd6a1ec8113d5c8f800b4df40b9] Exception caught in json filter {:source=>"message", :raw=>"{\"str[substr]\": true}", :exception=>#<RuntimeError: Invalid FieldReference: `str[substr]`>}
{
    "@timestamp" => 2022-12-21T14:21:02.786132Z,
       "message" => "{\"str[substr]\": true}",
      "@version" => "1",
         "event" => {
        "original" => "{\"str[substr]\": true}"
    },
          "tags" => [
        [0] "_jsonparsefailure"
    ]
}

but using the json codec in the input

logstash-8.3.3/bin/logstash -e 'input {tcp { port => 1234 codec => json } }output { stdout {} }' --field-reference-escape-style=percent

it successed with

[2022-12-21T15:21:44,272][INFO ][logstash.codecs.jsonlines][main][609c80422695297123b69074205871c4ed126bca9b52cc36541056e9aaa0a50d] ECS compatibility is enabled but `target` option was not specified. This may cause fields to be set at the top-level of the event where they are likely to clash with the Elastic Common Schema. It is recommended to set the `target` option to avoid potential schema conflicts (if your data is ECS compliant or non-conflicting, feel free to ignore this message)
{
       "@version" => "1",
     "@timestamp" => 2022-12-21T14:21:48.929152Z,
    "str[substr]" => true
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants