You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, PPL could query the field defined at index time (schema-on-write), PPL also support define the field at query time (schema-on-read) through eval. It would also be helpful to be able to extract fields from a text field using regex expression. For example, at index time, instead of extract the fields from the raw log message, users may choose to index all the raw log message as a text field. During the query time, they could use regex on-the-fly to extract any field they want and do further processing.
Requirements
Implement parse command
E | parse <text-field> <regex-expression>
regex-expression: regex string with named capture groups '(?<namedGroup>regex)'
Grammar
E: the input event stream.
expression: An expression that evaluates to a string, this is usually the raw text field.
namedGroup: The name of a field to assign a value to, extracted from the string expression.
Functionality: Fields extracted by the parse command at search time could be used as same as fields defined at index time.
Security: The access to the string expression should follow the FGAC field access policy. The field extracted at the search time doesn’t follow the FGAC field access policy, the reason is if the client has been authorized to access the string expression, it should be authorized to access the field extract from the string expression.
Performance: The parse command should have comparable performance as runtime search request.
Field type conversion
Derived fields are string by default. They can be casted using the cast function.
Performance
Currently the regex evaluation happens in the coordinating node. For the first phase we will use circuit breakers to limit query execution time, and for the future we will work to support distributed queries.
Additionally parse command uses java.util.regex, we can evaluate if it helps to switch to optimized third party libraries like re2j. re2j is linear time and better at handling untrusted regex expressions. It is not Perl regex compatible and doesn't support things like lookarounds, but those might not be as important for PPL regex.
Overview
Currently, PPL could query the field defined at index time (schema-on-write), PPL also support define the field at query time (schema-on-read) through eval. It would also be helpful to be able to extract fields from a text field using regex expression. For example, at index time, instead of extract the fields from the raw log message, users may choose to index all the raw log message as a text field. During the query time, they could use regex on-the-fly to extract any field they want and do further processing.
Requirements
parse
commandGrammar
E
: the input event stream.expression
: An expression that evaluates to a string, this is usually the raw text field.namedGroup
: The name of a field to assign a value to, extracted from the string expression.Example
Field type conversion
Derived fields are string by default. They can be casted using the cast function.
Performance
Currently the regex evaluation happens in the coordinating node. For the first phase we will use circuit breakers to limit query execution time, and for the future we will work to support distributed queries.
Additionally
parse
command usesjava.util.regex
, we can evaluate if it helps to switch to optimized third party libraries like re2j.re2j
is linear time and better at handling untrusted regex expressions. It is not Perl regex compatible and doesn't support things like lookarounds, but those might not be as important for PPL regex.Limitations on current implementation
See https://github.com/opensearch-project/sql/blob/578c6b27bf3eb6707f3c5927783bd4f6a0de71c0/docs/user/ppl/cmd/parse.rst#limitation
Demo (type conversion syntax has changed)
Screen.Recording.2022-01-07.at.11.31.56.AM.mov
The text was updated successfully, but these errors were encountered: