Skip to content

Commit

Permalink
Field Reference: handle special characters (#14044)
Browse files Browse the repository at this point in the history
* add failing tests for Event.new with field that look like field references

* fix: correctly handle FieldReference-special characters in field names.

Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: #13606
Resolves: #11608

* field_reference: support escape sequences

Adds a `config.field_reference.escape_style` option and a companion
command-line flag `--field-reference-escape-style` allowing a user
to opt into one of two proposed escape-sequence implementations for field
reference parsing:

 - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes
 - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points

The default is `NONE`, which does _not_ proccess escape sequences.
With this setting a user effectively cannot reference a field whose name
contains FieldReference-reserved characters.

| ESCAPE STYLE | `[`     | `]`     |
| ------------ | ------- | ------- |
| `NONE`       | _N/A_   | _N/A_   |
| `PERCENT`    | `%5B`   | `%5D`   |
| `AMPERSAND`  | `[` | `]` |

* fixup: no need to double-escape HTML-ish escape sequences in docs

* Apply suggestions from code review

Co-authored-by: Karol Bucek <kares@users.noreply.github.com>

* field-reference: load escape style in runner

* docs: sentences over semiciolons

* field-reference: faster shortcut for PERCENT escape mode

* field-reference: escape mode control downcase

* field_reference: more s/experimental/technical preview/

* field_reference: still more s/experimental/technical preview/

Co-authored-by: Karol Bucek <kares@users.noreply.github.com>
  • Loading branch information
yaauie and kares authored May 24, 2022
1 parent 9e2c87a commit d845411
Show file tree
Hide file tree
Showing 12 changed files with 504 additions and 162 deletions.
12 changes: 12 additions & 0 deletions docs/static/field-reference.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ It is often useful to be able to refer to a field or collection of fields by nam
you can use the Logstash field reference syntax.

The syntax to access a field specifies the entire path to the field, with each fragment wrapped in square brackets.
When a field name contains square brackets, they must be properly <<formal-grammar-escape-sequences, _escaped_>>.

_Field References_ can be expressed literally within <<conditionals,_Conditional_>> statements in your pipeline configurations,
as string arguments to your pipeline plugins, or within sprintf statements that will be used by your pipeline plugins:
Expand Down Expand Up @@ -133,3 +134,14 @@ embeddedFieldReference
;

An _Embedded Field Reference_ is a _Field Reference_ that is itself wrapped in square brackets (`[` and `]`), and can be a component of a _Composite Field Reference_.

[float]
[[formal-grammar-escape-sequences]]
=== Escape Sequences

For {ls} to reference a field whose name contains a character that has special meaning in the field reference grammar, the character must be escaped.
Logstash can be globally configured to use one of two field reference escape modes:

- `none` (default): no escape sequence processing is done. Fields containing literal square brackets cannot be referenced by the Event API.
- `percent`: URI-style percent encoding of UTF-8 bytes. The left square bracket (`[`) is expressed as `%5B`, and the right square bracket (`]`) is expressed as `%5D`.
- `ampersand`: HTML-style ampersand encoding (`&#` + decimal unicode codepoint + `;`). The left square bracket (`[`) is expressed as `&#91;`, and the right square bracket (`]`) is expressed as `&#93;`.
17 changes: 15 additions & 2 deletions docs/static/settings-file.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ Values other than `disabled` are currently considered BETA, and may produce unin
| `config.string`
| A string that contains the pipeline configuration to use for the main pipeline. Use the same syntax as
the config file.
| None
| _N/A_

| `config.test_and_exit`
| When set to `true`, checks that the configuration is valid and then exits. Note that grok patterns are not checked for
Expand Down Expand Up @@ -178,9 +178,22 @@ Values other than `disabled` are currently considered BETA, and may produce unin
| When set to `true`, quoted strings will process the following escape sequences: `\n` becomes a literal newline (ASCII 10). `\r` becomes a literal carriage return (ASCII 13). `\t` becomes a literal tab (ASCII 9). `\\` becomes a literal backslash `\`. `\"` becomes a literal double quotation mark. `\'` becomes a literal quotation mark.
| `false`

| `config.field_reference.escape_style`
a| Provides a way to reference fields that contain <<formal-grammar-escape-sequences,field reference special characters>> `[` and `]`.

NOTE: This feature is in *technical preview* and may change in the future.

Current options are:

* `percent`: URI-style `%`{plus}`HH` hexadecimal encoding of UTF-8 bytes (`[` -> `%5B`; `]` -> `%5D`)
* `ampersand`: HTML-style `&#`{plus}`DD`{plus}`;` encoding of decimal Unicode code-points (`[` -> `&#91;`; `]` -> `&#93;`)
* `none`: field names containing special characters _cannot_ be referenced.

| `none`

| `modules`
| When configured, `modules` must be in the nested YAML structure described above this table.
| None
| _N/A_

| `queue.type`
| The internal queuing model to use for event buffering. Specify `memory` for legacy in-memory based queuing, or `persisted` for disk-based ACKed queueing (<<persistent-queues,persistent queues>>).
Expand Down
1 change: 1 addition & 0 deletions logstash-core/lib/logstash/environment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ module Environment
Setting::Boolean.new("config.reload.automatic", false),
Setting::TimeValue.new("config.reload.interval", "3s"), # in seconds
Setting::Boolean.new("config.support_escapes", false),
Setting::String.new("config.field_reference.escape_style", "none", true, %w(none percent ampersand)),
Setting::Boolean.new("metric.collect", true),
Setting::String.new("pipeline.id", "main"),
Setting::Boolean.new("pipeline.system", false),
Expand Down
13 changes: 13 additions & 0 deletions logstash-core/lib/logstash/runner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,11 @@ class LogStash::Runner < Clamp::StrictCommand
:default => LogStash::SETTINGS.get_default("config.string"),
:attribute_name => "config.string"

option ["--field-reference-escape-style"], "STYLE",
I18n.t("logstash.runner.flag.field-reference-escape-style"),
:default => LogStash::SETTINGS.get_default("config.field_reference.escape_style"),
:attribute_name => "config.field_reference.escape_style"

# Module settings
option ["--modules"], "MODULES",
I18n.t("logstash.runner.flag.modules"),
Expand Down Expand Up @@ -324,6 +329,14 @@ def execute
validate_settings! or return 1
@dispatcher.fire(:before_bootstrap_checks)

field_reference_escape_style_setting = settings.get_setting('config.field_reference.escape_style')
if field_reference_escape_style_setting.set?
logger.warn(I18n.t("logstash.settings.technical_preview.set", canonical_name: field_reference_escape_style_setting.name))
end
field_reference_escape_style = field_reference_escape_style_setting.value
logger.debug("Setting global FieldReference escape style: #{field_reference_escape_style}")
org.logstash.FieldReference::set_escape_style(field_reference_escape_style)

return start_shell(setting("interactive"), binding) if setting("interactive")

module_parser = LogStash::Modules::CLIParser.new(setting("modules_list"), setting("modules_variable_list"))
Expand Down
29 changes: 29 additions & 0 deletions logstash-core/locales/en.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,31 @@ en:
"%{default_output}"
If you wish to use both defaults, please use
the empty string for the '-e' flag.
field-reference-escape-style: |+
Use the given STYLE when parsing field
references. This allows you to reference fields
whose name includes characters that are
meaningful in a field reference including square
brackets (`[` and `]`).
This feature is in TECHNICAL PREVIEW, and
implementations are subject to change.
Available escape styles are:
- `none`: escape sequences in field references
are not processed, which means fields that
contain special characters cannot be
referenced.
- `percent`: characters may be encoded with
URI-style percent notation represeting UTF-8
bytes (`[` is `%5B`; `]` is `%5D`).
Unlike URI-encoding, literal percent characters
do not need to be escaped unless followed by a
sequence of 2 capital hexadecimal characters.
- `ampersand`: characters may be encoded with
HTML-style ampersand-hash encoding notation
representing decimal unicode codepoints
(`[` is `&#91;`; `]` is `&#93;`).
modules: |+
Load Logstash modules.
Modules can be defined using multiple instances
Expand Down Expand Up @@ -431,4 +456,8 @@ en:
ambiguous: >-
Both `%{canonical_name}` and its deprecated alias `%{deprecated_alias}` have been set.
Please only set `%{canonical_name}`
technical_preview:
set: >-
The setting `%{canonical_name}` is in TECHNICAL PREVIEW and its implementation
is subject to change in a future release of Logstash
14 changes: 14 additions & 0 deletions logstash-core/spec/logstash/event_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,15 @@
expect { e.set('[', 'value') }.to raise_exception(::RuntimeError)
end
end

context 'with map value whose keys have FieldReference-special characters' do
let(:event) { LogStash::Event.new }
let(:value) { {"this[field]" => "okay"} }
it 'sets the value correctly' do
event.set('[some][field]', value.dup)
expect(event.get('[some][field]')).to eq(value)
end
end
end

context "timestamp" do
Expand Down Expand Up @@ -351,6 +360,11 @@
expect(e.timestamp.to_iso8601).to eq("2016-05-28T23:02:05.350Z")
end

it 'accepts maps whose keys contain FieldReference-special characters' do
e = LogStash::Event.new({"nested" => {"i][egal" => "okay"}, "n[0]" => "bene"})
expect(e.get('nested')).to eq({"i][egal" => "okay"})
expect(e.to_hash).to include("n[0]" => "bene")
end
end

context "method missing exception messages" do
Expand Down
49 changes: 43 additions & 6 deletions logstash-core/src/main/java/org/logstash/ConvertedMap.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import java.util.HashMap;
import java.util.IdentityHashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import org.jruby.RubyHash;
import org.jruby.RubyString;
import org.jruby.runtime.ThreadContext;
Expand All @@ -35,15 +36,51 @@
* as key.</p>
* <p>The {@code put} method will work with any {@link String} key but is only intended for use in
* situations where {@link ConvertedMap#putInterned(String, Object)} would require manually
* interning the {@link String} key. This is due to the fact that we use our internal
* {@link FieldReference} cache to get an interned version of the given key instead of JDKs
* {@link String#intern()}, which is faster since it works from a much smaller and hotter cache
* in {@link FieldReference#CACHE} than using String interning directly.</p>
* interning the {@link String} key.
* This is due to the fact that it is based on {@link IdentityHashMap}, and we rely on the String
* intern pool to ensure identity match of equivalent strings.
* For performance, we keep a global cache of strings that have been interned for use with {@link ConvertedMap},
* and encourage interning through {@link ConvertedMap#internStringForUseAsKey(String)} to avoid
* the performance pentalty of the global string intern pool.
*/
public final class ConvertedMap extends IdentityHashMap<String, Object> {

private static final long serialVersionUID = 1L;

private static final ConcurrentHashMap<String,String> KEY_CACHE = new ConcurrentHashMap<>(100, 0.2F, 16);

/**
* Returns an equivalent interned string, possibly avoiding the
* global intern pool.
*
* @param candidate the candidate {@link String}
* @return an interned string from the global String intern pool
*/
static String internStringForUseAsKey(final String candidate) {
// TODO: replace with LRU cache and/or isolated intern pool
final String cached = KEY_CACHE.get(candidate);
if (cached != null) { return cached; }

final String interned = candidate.intern();
if (KEY_CACHE.size() <= 10_000 ) {
KEY_CACHE.put(interned, interned);
}
return interned;
}

/**
* Ensures that the provided {@code String[]} contains only
* instances that have been {@link ConvertedMap::internStringForUseAsKey},
* possibly replacing entries with equivalent interned strings.
*
* @param candidates an array of non-null strings
*/
static void internStringsForUseAsKeys(final String[] candidates) {
for (int i = 0; i < candidates.length; i++) {
candidates[i] = internStringForUseAsKey(candidates[i]);
}
}

private static final RubyHash.VisitorWithState<ConvertedMap> RUBY_HASH_VISITOR =
new RubyHash.VisitorWithState<ConvertedMap>() {
@Override
Expand Down Expand Up @@ -91,7 +128,7 @@ public static ConvertedMap newFromRubyHash(final ThreadContext context, final Ru

@Override
public Object put(final String key, final Object value) {
return super.put(FieldReference.from(key).getKey(), value);
return super.put(internStringForUseAsKey(key), value);
}

/**
Expand All @@ -118,6 +155,6 @@ public Object unconvert() {
* @return Interned String
*/
private static String convertKey(final RubyString key) {
return FieldReference.from(key).getKey();
return internStringForUseAsKey(key.asJavaString());
}
}
67 changes: 48 additions & 19 deletions logstash-core/src/main/java/org/logstash/FieldReference.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,10 @@
import java.util.Map;

import java.util.concurrent.ConcurrentHashMap;
import java.util.stream.Collectors;

import org.jruby.RubyString;
import org.logstash.util.EscapeHandler;

/**
* Represents a reference to another field of the event {@link Event}
Expand All @@ -44,19 +47,41 @@ public static class IllegalSyntaxException extends RuntimeException {
}
}

private static EscapeHandler ESCAPE_HANDLER = EscapeHandler.NONE;

public static void setEscapeStyle(final String escapeStyleSpec) {
final EscapeHandler newEscapeHandler;
switch(escapeStyleSpec) {
case "none":
newEscapeHandler = EscapeHandler.NONE;
break;
case "percent":
newEscapeHandler = EscapeHandler.PERCENT;
break;
case "ampersand":
newEscapeHandler = EscapeHandler.AMPERSAND;
break;
default:
throw new IllegalArgumentException(String.format("Invalid escape style: `%s`", escapeStyleSpec));
}
ESCAPE_HANDLER = newEscapeHandler;
CACHE.clear();
RUBY_CACHE.clear();
}

/**
* This type indicates that the referenced that is the metadata of an {@link Event} found in
* {@link Event#metadata}.
* {@link Event#getMetadata()}.
*/
public static final int META_PARENT = 0;

/**
* This type indicates that the referenced data must be looked up from {@link Event#metadata}.
* This type indicates that the referenced data must be looked up from {@link Event#getMetadata()}.
*/
public static final int META_CHILD = 1;

/**
* This type indicates that the referenced data must be looked up from {@link Event#data}.
* This type indicates that the referenced data must be looked up from {@link Event#getData()}.
*/
private static final int DATA_CHILD = -1;

Expand All @@ -72,15 +97,6 @@ public static class IllegalSyntaxException extends RuntimeException {
*/
private static final StrictTokenizer TOKENIZER = new StrictTokenizer();

/**
* Unique {@link FieldReference} pointing at the timestamp field in a {@link Event}.
*/
public static final FieldReference TIMESTAMP_REFERENCE =
deduplicate(new FieldReference(EMPTY_STRING_ARRAY, Event.TIMESTAMP, DATA_CHILD));

private static final FieldReference METADATA_PARENT_REFERENCE =
new FieldReference(EMPTY_STRING_ARRAY, Event.METADATA, META_PARENT);

/**
* Cache of all existing {@link FieldReference} by their {@link RubyString} source.
*/
Expand All @@ -93,6 +109,11 @@ public static class IllegalSyntaxException extends RuntimeException {
private static final Map<String, FieldReference> CACHE =
new ConcurrentHashMap<>(64, 0.2F, 1);

/**
* Unique {@link FieldReference} pointing at the timestamp field in a {@link Event}.
*/
public static final FieldReference TIMESTAMP_REFERENCE = FieldReference.from(Event.TIMESTAMP);

private final String[] path;

private final String key;
Expand All @@ -106,9 +127,10 @@ public static class IllegalSyntaxException extends RuntimeException {
private final int type;

private FieldReference(final String[] path, final String key, final int type) {
this.key = key;
this.type = type;
this.key = ConvertedMap.internStringForUseAsKey(key);
ConvertedMap.internStringsForUseAsKeys(path);
this.path = path;
this.type = type;
hash = calculateHash(this.key, this.path, this.type);
}

Expand Down Expand Up @@ -219,12 +241,14 @@ private static FieldReference parseToCache(final String reference) {
}

private static FieldReference parse(final CharSequence reference) {
final List<String> path = TOKENIZER.tokenize(reference);
final List<String> path = TOKENIZER.tokenize(reference).stream()
.map(ESCAPE_HANDLER::unescape)
.collect(Collectors.toList());

final String key = path.remove(path.size() - 1).intern();
final String key = path.remove(path.size() - 1);
final boolean empty = path.isEmpty();
if (empty && key.equals(Event.METADATA)) {
return METADATA_PARENT_REFERENCE;
return new FieldReference(EMPTY_STRING_ARRAY, key, META_PARENT);
} else if (!empty && path.get(0).equals(Event.METADATA)) {
return new FieldReference(
path.subList(1, path.size()).toArray(EMPTY_STRING_ARRAY), key, META_CHILD
Expand All @@ -240,7 +264,11 @@ private static FieldReference parse(final CharSequence reference) {
**/
private static class StrictTokenizer {

public List<String> tokenize(CharSequence reference) {
/**
* @param reference a sequence of characters representing a reference to a field
* @return a list of string path fragments.
*/
public List<String> tokenize(final CharSequence reference) {
ArrayList<String> path = new ArrayList<>();
final int length = reference.length();

Expand Down Expand Up @@ -286,7 +314,7 @@ public List<String> tokenize(CharSequence reference) {

if (splitPoint < i) {
// if we have something to add, add it.
path.add(reference.subSequence(splitPoint, i).toString().intern());
path.add(reference.subSequence(splitPoint, i).toString());
}

depth--;
Expand All @@ -310,6 +338,7 @@ public List<String> tokenize(CharSequence reference) {
// further processing
path.add(reference.toString());
return path;

} else if (depth > 0) {
// when we hit the end-of-input while still in an open bracket, we have an invalid field reference
potentiallyAmbiguousSyntaxDetected = true;
Expand Down
Loading

0 comments on commit d845411

Please sign in to comment.