Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] [EQL] Index selection as part of the rule vs the request #49462

Closed
stacey-gammon opened this issue Nov 21, 2019 · 9 comments
Closed
Labels
:Analytics/EQL EQL querying

Comments

@stacey-gammon
Copy link
Contributor

Background

We are working on adding a new commercially licensed EQL search API, so that users can search their data stored in ES using the Event Query Language. One of the biggest open questions we have right now is whether index selection should be represented in the language, or rather something sent along separately, along with the request.

Option 1: Index selection as part of the language

process WHERE process_name == "explorer.exe"
would become
myIndex-* WHERE event.type == "process" and process_name == "explorer.exe"

Depending on how you ingested your data, you could also potentially form your query like:

endpoint-process-* WHERE process_name == "explorer.exe"

however structuring your data in any given way would not be required, it would be up to the user.

Example request

GET _eql/search
{
  rule: `process-* WHERE process_name == "explorer.exe"`
}

Option 2: Index selection as part of the request

process WHERE process_name == "explorer.exe"

remains the same. The request would need to include the pre-where field to query on. It would simply be a shortcut to writing any WHERE event.type == 'process' and process_name == "explorer.exe", assuming you chose event.type as your filtering field in the request. For example:

Example request

GET myIndex-*/_eql/search
{
  rule: `process WHERE process_name == "explorer.exe"`,
  type_field: `event.type`
}

Considerations

When deciding which path to take, there are many things we should keep in mind:

  1. The flexibility of the language. Option 1 allows for greater flexibility. Option 2 could be just as flexible but would require additional steps/setup.

  2. How important is it to keep the language as close to possible as it is today? Option 2 keeps the language more familiar.
    2a. How will this decision affect EQL queries that exist today? How difficult will it be to convert them to the ES variety?

  3. Can we use the same queries for searching historical data vs consecutively running queries?

  4. Can we use the same queries for searching data directly on an endpoint vs data stored in ES? If not, how easy would it be to convert from one to the other?

  5. How might this decision affect the response and how the response is consumed by Kibana?

  6. Are there any search performance implications?

  7. How will this affect Kibana (autocomplete)

1. The flexibility and generality of the language

Going with option one would allow someone to create an ad hoc join query on any data set. For example:

join by ip
  [packetbeat-* where true]
  [firewall-data-* where true]
  [endpoint-network-* where true]

If you wanted to achieve the same thing, with the index pattern coming from outside the language, you would either have to search on * and add an indices query, e.g.:

GET */_eql/search
{
  rule: '''
     join by ip
       [any where _index == "packetbeat-*"]
       [any where _index == "firewall-data-*"]
       [any where _index == "endpoint-network-*"]
    '''
}

or you'd have to create an alias ahead of time to point to those three indices (at which point... I suppose the query is irrelevant. just search for ip).

If we went this route another question is whether the user is typing in the string * or if they have to create an index pattern object (of course from ES API standpoint, it's just a string). But from the perspective of Kibana, this can be awkward. Is the user going to have to create a * index pattern object to achieve this query? In either case, we would hit some major issues with autocomplete:

Screen Shot 2019-11-21 at 2 24 14 PM

Winner: Option 1 gives the user more dynamic flexibility

2. Keeping the language as familiar as possible

Option 1 changes the language up quite a bit from what users are already expecting. It's looking like endpoint data will index all event data into a single index, which means almost every query already constructed today would change from {eventType} WHERE xyz to someIndex-* WHERE xyz and event.type == {eventType}.

However, we could probably create an automatic converter to switch from one version to the other.

Winner: Option 2 keeps the language more familiar

3. Use the same queries for searching historical data vs consecutively running queries?

I think Option 2 is probably the winner here, but if we have the automatic converter from one version to the other, maybe it's not as important. However it does mean that people need to understand two varieties of queries.

Winner: Probably option 2

4. Can we use the same queries for searching data directly on an endpoint vs data stored in ES? If not, how easy would it be to convert from one to the other?

This might actually be the same question as 3 but I'm not certain. I haven't given this one too much thought, so please anyone feel free to add more deets here (or anywhere)!

5. How might this decision affect the response and how the response is consumed by Kibana?

This was an interesting consideration @costin brought up. Allowing queries to be run across really disparate indices means the result set could be a single table but with a huge number of fields. Consider the query:

join by ip
  [packetbeat-* where true]
  [firewall-data-* where true]
  [endpoint-network-* where true]

one row will have a column for every one of these indices.

Winner: Option 2 will make it more difficult for a user to create these giant table results. That being said, it's a restriction. Do we want that restriction?

6. Are there any search performance implications?

Well if go with option 2 and support this query:

GET */_eql/search
{
  rule: '''
     join by ip
       [any where _index == "packetbeat-*"]
       [any where _index == "firewall-data-*"]
       [any where _index == "endpoint-network-*"]
    '''
}

we'd probably want indices query to be just as efficient as specifying the query in the URL.

Winner: Option 1, but, there are probably work arounds in either case (e.g. use a index alias instead of *).

7. How will this affect Kibana? (autocomplete)

I mentioned autocomplete above. It's fine from a Kibana standpoint if we ask the user for an index separately, or we expect it as part of the language. The one caveat is autocomplete and the example given above. Autocomplete in the case of:

sequence [myIndex-* WHERE fieldHere..]

would be a much better experience. KQL already has performance issues when trying to query for a field set when there is a high number of fields.

Summary

I'm personally a fan of Option 1 because of the flexibility and the improved auto-complete experience (thanks @rw-access for this suggestion!).

cc @colings86 @rw-access @costin @aleksmaus @jpountz @scunningham

@stacey-gammon stacey-gammon added the :Analytics/EQL EQL querying label Nov 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/EQL)

@costin
Copy link
Member

costin commented Nov 22, 2019

👍 for Option 1.

Two main reasons come to mind:

a. Redundancy
The language already encapsulates the target in its declaration - in fact it's the first item: target WHERE filter.
Supporting another way to override the target (URL or otherwise) makes the language redundant (which goes against the whole premise of a declarative language) and confusing - which one wins? The language or the URL/?

Further more, this means a query is non-deterministic since depending on the URL (in your proposal), the target it declares might be considered or not. This is a significant step backwards in terms of readability.

b. Scoping
target WHERE filter is a building block - it can be used at top-level but also embedded inside sequence and join which by definition, are touching multiple targets.

In the following declaration:

join by ip
  [file where ...]
  [process where ...]
  [network where ...]

there are 3 targets.
In the language they are clearly defined per-filter/rule however re-declaring them for the whole query/request means that the rules will be applied across all of them, essentially overriding the language scoping itself. Not good.

Side-note: Mapping

Language aside, I believe the conundrum of this issue relates to the way data is mapped.
Elasticsearch (and to this effect Kibana) uses indices in the URL since these are decoupled from the actual query (which has the nice effect of re-usability including aliasing).

EQL is the opposite, by default it touches multiple targets which raises the question what is the target ? an index or a field? That is, how fine is the target granularity?

Namely what does target WHERE filter mean - is target an index or a field inside the index (which in effect means the whole request runs against the same index):

x) FROM index WHERE targetField == target AND filter
or
y) FROM target WHERE filter.

There's pros and cons to each:

target as fields ( x ) means:
➖ the target it can be used inside filtering which will have (unexpected) side-effects:
target WHERE targetField == target1 which would translate to FROM index WHERE targetField == target1 AND targetField == target which matches nothing.
We can potentially address that through dedicated validation.
⭕️ using the field type for differentiation encourages different document types to be mapped under the same index. I can see both positives and negatives to this (depending on how different certain documents are).
⭕️ indices will require some sort of targetField - not a big deal but not ideal either.
⭕️ a user using x) can move to one index per field-type but vice-versa, moving from index per type to one index with multiple types would not be possible.

Overall I find y) better - there's no redundancy or special mapping and different indices or the same one can be reused across rules depending on the mapping:

join by ip
  [index where someField == value]
  [index where someOtherField == another]
  [anotherIndex where ...]

This avoids the need to use indices in the URL and can even allow multiple indices to be specified in
the same rule:

join by ip
  [indexA, indexB where someField == value]
  ...

@colings86
Copy link
Contributor

Thanks for raising this @stacey-gammon. I have a few thoughts here.

Firstly I think that question 4 is a really important one. The EQL language is being used in the wild on endpoints and users already have rules written in EQL. If we come up with a solution which requires users to rewrite those rules and/or change the way they think about writing rules this will create friction for those users to use EQL against Elasticsearch.

or you'd have to create an alias ahead of time to point to those three indices (at which point... I suppose the query is irrelevant. just search for ip).

Note that its possible to reference multiple indices as a comma delimited list e.g. index1,index2,index3 so using * would not be necessary.

For question 6 I'm not sure I understand why using _index is necessary. to select the indices we would use a comma delimited list of the indices we want tot apply the rule to and then the rule would be written in terms of the event types:

GET packetbeat-*,firewall-data-*,endpoint-network-*/_eql/search
{
  rule: '''
     join by ip
       [packet where true]
       [firewall where true]
       [network where true]
    '''
}

(note this looks a little weird since the example is contrived)

It also gives the user more granularity into the event types. For example the event types that packetbeat uses may be stored in the same index so the use can easily select the event type they are interested in using the existing EQL approach e.g. apache_packet where....

Supporting another way to override the target (URL or otherwise) makes the language redundant (which goes against the whole premise of a declarative language) and confusing - which one wins? The language or the URL/?

I'm not sure I agree with this. This is not what option 2 does IMO. If we think about how EQL works today on the endpoint, there is a single store of events. This is analogous to an index. So the URL would be pointing to which store of events to run the rule on. The first word of the rule (or sequence element) indicates the type of event to use as context (process, file, network etc.). These do not overlap. Some users may decide to store data for different event types in different indexes and some my not, but this does not affect that the first part of the rule (or sequence element) is the event type not the index.

Another factor to consider here is that users will need to store their data in different ways (especially when we look past EQL only being used with endpoint), and rules are shipped with the product and shared between users. This means that having the index pattern in the rule itself ties the rule to particular index topologies and makes it harder to share rules. It also makes it hard to adapt the index topology over time if we need to since if we split event types into different indices in the future or combined event types into the same index in the future.

Personally I think option 2 will create a similar UX to the existing EQL implementation, and allows users to share rules without worrying about how the data is spread across indexes.

@rw-access
Copy link
Contributor

rw-access commented Nov 25, 2019

If we come up with a solution which requires users to rewrite those rules and/or change the way they think about writing rules this will create friction for those users to use EQL against Elasticsearch.

We don't have a choice. None of the existing rules will work as-is, so all rules must be rewritten. Every field looks different now that we use ECS is the schema instead of the format represented by the Endgame platform. We can write one-off scripts, like eqllib convert-query to help with this process.

I think it's important that the language is completely independent from how its stored. Coupling a specific key-value pair, like eventType=X sounds like its trying to be flexible but makes the language and API awkwardly restrictive.

I've mentioned this in other mediums, but I think it's worth bringing up again: it sounds like we're trying to achieve shareability while maintain full flexibility from how the data is stored. These are at ends with each other, but I think there's a (somewhat) simple solution to reconcile the two together. We can add a shorthand "target" that can be defined separately for each organization, depending on their indexing strategy. Assuming ECS is widely adopted and universal, I think this is reasonable:

One thing that I think could work is adding another concept, similar to index patterns. We could allow users to create their own short-hand, like a narrowing query to find process events. For instance, a user could specify (outside of EQL) process means endgame-process-* where event.category="process". Then if you have an EQL query in the form process where process.name == "net.exe", you know the index pattern to use, and the narrowing query to add to it. It would essentially expand to endgame-process-* where event.category="process" and process.name == "net.exe".

For example, a search request

GET index-pattern-*/_eql/search?sync_search_threshold=5s

{
  "event_mapping": {
    "file" : {
      "index": "endgame-file-*"
      },
     "process": {
       "index": ["endgame-*"],
       "filter": "event.category == 'process'"
     }
  },
  "rule": """
            sequence with maxspan=5h 
              [file where user.name != 'SYSTEM' by file.path]
              [process where user.name = 'SYSTEM' by process.path]
          """
}

@clintongormley
Copy link
Contributor

We had a Zoom meeting and came to the following conclusions:

  • The index pattern should be specified in the URL of the request
  • EQL should default to expecting data in ECS format (ie the object type to the left of the where clause will be translated into a query on event.type), but this mapping can be overridden with a parameter in the request body
  • The request body should also accept a filter (like SQL) for applying eg a timerange filter

For example:

GET endpoint-*/_eql
{
  "query": "process where foo...",
  "filter": {
    "range": {
      "@timestamp": {
        "gte": "2019-01-01",
        "lt": "2020-01-01"
      }
    }
  },
  "event_field_lookup": "event.type" # default
}

@jpountz
Copy link
Contributor

jpountz commented Nov 25, 2019

@clintongormley Do you have a recording of the meeting by any chance?

@clintongormley
Copy link
Contributor

Sorry, no, but happy to answer questions @jpountz

@jpountz
Copy link
Contributor

jpountz commented Nov 26, 2019

No questions in particular, I was just curious about any details that might have been discussed in this meeting.

Good arguments have been made for both options, but I think that the argument that convinced me the most is the one by Colin about decoupling rules and index topologies so that one can change topologies without having to rewrite rules. That said, I expect that targets will often match entire index patterns, which resonated with an idea that I've been thinking about for the past days about having some special fields that have the same value for all documents in an index and could get similar optimizations to term/wildcard queries on _index or range queries on @timestamp. It almost makes option 1 and option 2 meet in the sense that using that special field type for event.type and later filtering on event.type would be mostly the same as providing an index pattern as a target.

@colings86
Copy link
Contributor

colings86 commented Nov 28, 2019

Closing this issue in favour of #49634 which includes the outcome of using event.type discussed here but is for the EQL search REST API design as a while

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying
Projects
None yet
Development

No branches or pull requests

7 participants