-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] [EQL] Index selection as part of the rule vs the request #49462
Comments
Pinging @elastic/es-search (:Search/EQL) |
👍 for Option 1. Two main reasons come to mind: a. Redundancy Further more, this means a query is non-deterministic since depending on the URL (in your proposal), the target it declares might be considered or not. This is a significant step backwards in terms of readability. b. Scoping In the following declaration:
there are 3 targets. Side-note: MappingLanguage aside, I believe the conundrum of this issue relates to the way data is mapped. EQL is the opposite, by default it touches multiple targets which raises the question what is the target ? an index or a field? That is, how fine is the target granularity? Namely what does x) There's pros and cons to each: target as fields ( x ) means: Overall I find y) better - there's no redundancy or special mapping and different indices or the same one can be reused across rules depending on the mapping:
This avoids the need to use indices in the URL and can even allow multiple indices to be specified in
|
Thanks for raising this @stacey-gammon. I have a few thoughts here. Firstly I think that question 4 is a really important one. The EQL language is being used in the wild on endpoints and users already have rules written in EQL. If we come up with a solution which requires users to rewrite those rules and/or change the way they think about writing rules this will create friction for those users to use EQL against Elasticsearch.
Note that its possible to reference multiple indices as a comma delimited list e.g. For question 6 I'm not sure I understand why using _index is necessary. to select the indices we would use a comma delimited list of the indices we want tot apply the rule to and then the rule would be written in terms of the event types:
(note this looks a little weird since the example is contrived) It also gives the user more granularity into the event types. For example the event types that packetbeat uses may be stored in the same index so the use can easily select the event type they are interested in using the existing EQL approach e.g.
I'm not sure I agree with this. This is not what option 2 does IMO. If we think about how EQL works today on the endpoint, there is a single store of events. This is analogous to an index. So the URL would be pointing to which store of events to run the rule on. The first word of the rule (or sequence element) indicates the type of event to use as context (process, file, network etc.). These do not overlap. Some users may decide to store data for different event types in different indexes and some my not, but this does not affect that the first part of the rule (or sequence element) is the event type not the index. Another factor to consider here is that users will need to store their data in different ways (especially when we look past EQL only being used with endpoint), and rules are shipped with the product and shared between users. This means that having the index pattern in the rule itself ties the rule to particular index topologies and makes it harder to share rules. It also makes it hard to adapt the index topology over time if we need to since if we split event types into different indices in the future or combined event types into the same index in the future. Personally I think option 2 will create a similar UX to the existing EQL implementation, and allows users to share rules without worrying about how the data is spread across indexes. |
We don't have a choice. None of the existing rules will work as-is, so all rules must be rewritten. Every field looks different now that we use ECS is the schema instead of the format represented by the Endgame platform. We can write one-off scripts, like I think it's important that the language is completely independent from how its stored. Coupling a specific key-value pair, like eventType=X sounds like its trying to be flexible but makes the language and API awkwardly restrictive. I've mentioned this in other mediums, but I think it's worth bringing up again: it sounds like we're trying to achieve shareability while maintain full flexibility from how the data is stored. These are at ends with each other, but I think there's a (somewhat) simple solution to reconcile the two together. We can add a shorthand "target" that can be defined separately for each organization, depending on their indexing strategy. Assuming ECS is widely adopted and universal, I think this is reasonable:
For example, a search request GET index-pattern-*/_eql/search?sync_search_threshold=5s
{
"event_mapping": {
"file" : {
"index": "endgame-file-*"
},
"process": {
"index": ["endgame-*"],
"filter": "event.category == 'process'"
}
},
"rule": """
sequence with maxspan=5h
[file where user.name != 'SYSTEM' by file.path]
[process where user.name = 'SYSTEM' by process.path]
"""
} |
We had a Zoom meeting and came to the following conclusions:
For example:
|
@clintongormley Do you have a recording of the meeting by any chance? |
Sorry, no, but happy to answer questions @jpountz |
No questions in particular, I was just curious about any details that might have been discussed in this meeting. Good arguments have been made for both options, but I think that the argument that convinced me the most is the one by Colin about decoupling rules and index topologies so that one can change topologies without having to rewrite rules. That said, I expect that targets will often match entire index patterns, which resonated with an idea that I've been thinking about for the past days about having some special fields that have the same value for all documents in an index and could get similar optimizations to |
Closing this issue in favour of #49634 which includes the outcome of using |
Background
We are working on adding a new commercially licensed EQL search API, so that users can search their data stored in ES using the Event Query Language. One of the biggest open questions we have right now is whether index selection should be represented in the language, or rather something sent along separately, along with the request.
Option 1: Index selection as part of the language
process WHERE process_name == "explorer.exe"
would become
myIndex-* WHERE event.type == "process" and process_name == "explorer.exe"
Depending on how you ingested your data, you could also potentially form your query like:
endpoint-process-* WHERE process_name == "explorer.exe"
however structuring your data in any given way would not be required, it would be up to the user.
Example request
Option 2: Index selection as part of the request
process WHERE process_name == "explorer.exe"
remains the same. The request would need to include the pre-where field to query on. It would simply be a shortcut to writing
any WHERE event.type == 'process' and process_name == "explorer.exe"
, assuming you choseevent.type
as your filtering field in the request. For example:Example request
Considerations
When deciding which path to take, there are many things we should keep in mind:
The flexibility of the language. Option 1 allows for greater flexibility. Option 2 could be just as flexible but would require additional steps/setup.
How important is it to keep the language as close to possible as it is today? Option 2 keeps the language more familiar.
2a. How will this decision affect EQL queries that exist today? How difficult will it be to convert them to the ES variety?
Can we use the same queries for searching historical data vs consecutively running queries?
Can we use the same queries for searching data directly on an endpoint vs data stored in ES? If not, how easy would it be to convert from one to the other?
How might this decision affect the response and how the response is consumed by Kibana?
Are there any search performance implications?
How will this affect Kibana (autocomplete)
1. The flexibility and generality of the language
Going with option one would allow someone to create an ad hoc join query on any data set. For example:
If you wanted to achieve the same thing, with the index pattern coming from outside the language, you would either have to search on * and add an indices query, e.g.:
or you'd have to create an alias ahead of time to point to those three indices (at which point... I suppose the query is irrelevant. just search for ip).
If we went this route another question is whether the user is typing in the string
*
or if they have to create an index pattern object (of course from ES API standpoint, it's just a string). But from the perspective of Kibana, this can be awkward. Is the user going to have to create a*
index pattern object to achieve this query? In either case, we would hit some major issues with autocomplete:Winner: Option 1 gives the user more dynamic flexibility
2. Keeping the language as familiar as possible
Option 1 changes the language up quite a bit from what users are already expecting. It's looking like endpoint data will index all event data into a single index, which means almost every query already constructed today would change from
{eventType} WHERE xyz
tosomeIndex-* WHERE xyz and event.type == {eventType}
.However, we could probably create an automatic converter to switch from one version to the other.
Winner: Option 2 keeps the language more familiar
3. Use the same queries for searching historical data vs consecutively running queries?
I think Option 2 is probably the winner here, but if we have the automatic converter from one version to the other, maybe it's not as important. However it does mean that people need to understand two varieties of queries.
Winner: Probably option 2
4. Can we use the same queries for searching data directly on an endpoint vs data stored in ES? If not, how easy would it be to convert from one to the other?
This might actually be the same question as 3 but I'm not certain. I haven't given this one too much thought, so please anyone feel free to add more deets here (or anywhere)!
5. How might this decision affect the response and how the response is consumed by Kibana?
This was an interesting consideration @costin brought up. Allowing queries to be run across really disparate indices means the result set could be a single table but with a huge number of fields. Consider the query:
one row will have a column for every one of these indices.
Winner: Option 2 will make it more difficult for a user to create these giant table results. That being said, it's a restriction. Do we want that restriction?
6. Are there any search performance implications?
Well if go with option 2 and support this query:
we'd probably want indices query to be just as efficient as specifying the query in the URL.
Winner: Option 1, but, there are probably work arounds in either case (e.g. use a index alias instead of
*
).7. How will this affect Kibana? (autocomplete)
I mentioned autocomplete above. It's fine from a Kibana standpoint if we ask the user for an index separately, or we expect it as part of the language. The one caveat is autocomplete and the example given above. Autocomplete in the case of:
sequence [myIndex-* WHERE fieldHere..]
would be a much better experience. KQL already has performance issues when trying to query for a field set when there is a high number of fields.
Summary
I'm personally a fan of Option 1 because of the flexibility and the improved auto-complete experience (thanks @rw-access for this suggestion!).
cc @colings86 @rw-access @costin @aleksmaus @jpountz @scunningham
The text was updated successfully, but these errors were encountered: