Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs UI] [R&D] Determine scope of work for migrating to async search in the Logs UI stream view #76677

Closed
jasonrhodes opened this issue Sep 3, 2020 · 10 comments
Assignees
Labels
chore Feature:Logs UI Logs UI feature R&D Research and development ticket (not meant to produce code, but to make a decision) Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Milestone

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Sep 3, 2020

We plan to migrate the Logs UI stream view to use the data plugin for ES queries and to make use of ES async search. We need to understand more about what this work is going to entail.

Questions we should answer in this R&D:

  1. Does the Kibana data plugin provide the functions we need to make the ES queries we are currently making?
  2. How can we split this work up into multiple steps to keep the scope small?
  3. What are the benefits/costs of using a server-side "custom search strategy" vs not?
    • which strategy should we use to start with? the non-async implementation at first, then move to async?
  4. Do we foresee problems with the existing Log Stream (auto refresh, scrolling refreshes, highlights, etc) that will cause problems when using the data plugin?
  5. Do we plan to eventually make use of "partial query results" anywhere in the Logs UI?
  6. Do we plan to eventually make use of long-running async queries (notifying a user anywhere in the app when a query has finished) anywhere in the Logs UI?
  7. Do we plan to make use of async query cancellation in the Logs UI, and how? What should the UX be for this, roughly?
  8. Are there any other places in the Logs UI that we make ES queries that we should consider migrating to async search? (Assuming we will not worry about things like ML and Alerting APIs since those will need to be migrated by their associated teams.)

Timebox: 1 week

@jasonrhodes jasonrhodes added chore Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services R&D Research and development ticket (not meant to produce code, but to make a decision) labels Sep 3, 2020
@jasonrhodes jasonrhodes added this to the Logs UI 7.10 milestone Sep 3, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@Kerry350 Kerry350 self-assigned this Sep 7, 2020
@Kerry350
Copy link
Contributor

Kerry350 commented Sep 10, 2020

Queries

This is an overview of the queries in use on the log stream page (this isn't exhaustive as queries will differ slightly based on the source configuration). Please note that where properties are concerned with listing many fields I've cut these down to a small sample for brevity.

`getLogEntries` (Without highlights)
{
   "allowNoIndices":true,
   "index":"logs-*,filebeat-*,kibana_sample_data_logs*",
   "ignoreUnavailable":true,
   "body":{
      "size":200,
      "track_total_hits":false,
      "_source":[
         "event.dataset",
         "apache2.access",
         "apache2.access.remote_ip",
         "message"
      ],
      "query":{
         "bool":{
            "filter":[
               {
                  "range":{
                     "@timestamp":{
                        "gte":1599475846836,
                        "lte":1599562246837,
                        "format":"epoch_millis"
                     }
                  }
               }
            ]
         }
      },
      "sort":{
         "@timestamp":"asc",
         "_doc":"asc"
      },
      "search_after":[
         1599562219000,
         24460
      ]
   }
}
`getLogEntries` (With highlights)
{
   "allowNoIndices":true,
   "index":"logs-*,filebeat-*,kibana_sample_data_logs*",
   "ignoreUnavailable":true,
   "body":{
      "size":102,
      "track_total_hits":false,
      "_source":[
         "event.dataset",
         "apache2.access",
         "apache2.access.remote_ip"
      ],
      "query":{
         "bool":{
            "filter":[
               {
                  "multi_match":{
                     "fields":[
                        "event.dataset",
                        "apache2.access",
                        "apache2.access.remote_ip"
                     ],
                     "lenient":true,
                     "query":"Service",
                     "type":"phrase"
                  }
               },
               {
                  "range":{
                     "@timestamp":{
                        "gte":1599476372648,
                        "lte":1599562913677,
                        "format":"epoch_millis"
                     }
                  }
               }
            ]
         }
      },
      "sort":{
         "@timestamp":"asc",
         "_doc":"asc"
      },
      "highlight":{
         "boundary_scanner":"word",
         "fields":{
            "event.dataset":{
               
            },
            "apache2.access":{
               
            },
            "apache2.access.remote_ip":{
               
            },
         },
         "fragment_size":1,
         "number_of_fragments":100,
         "post_tags":[
            ""
         ],
         "pre_tags":[
            ""
         ],
         "highlight_query":{
            "multi_match":{
               "fields":[
                  "event.dataset",
                  "apache2.access",
                  "apache2.access.remote_ip"
               ],
               "lenient":true,
               "query":"Service",
               "type":"phrase"
            }
         }
      },
      "search_after":[
         1599560776000,
         12178
      ]
   }
}
`getContainedLogSummaryBuckets` (Without highlights)
{
   "allowNoIndices":true,
   "index":"logs-*,filebeat-*,kibana_sample_data_logs*",
   "ignoreUnavailable":true,
   "body":{
      "aggregations":{
         "count_by_date":{
            "date_range":{
               "field":"@timestamp",
               "format":"epoch_millis",
               "ranges":[
                  {
                     "from":1599475846836,
                     "to":1599476710847.78
                  },
                  {
                     "from":1599476710847,
                     "to":1599477574858.78
                  },
                  {
                     "from":1599477574858,
                     "to":1599478438869.78
                  }
               ]
            },
            "aggregations":{
               "top_hits_by_key":{
                  "top_hits":{
                     "size":1,
                     "sort":[
                        {
                           "@timestamp":"asc"
                        },
                        {
                           "_doc":"asc"
                        }
                     ],
                     "_source":false
                  }
               }
            }
         }
      },
      "query":{
         "bool":{
            "filter":[
               {
                  "range":{
                     "@timestamp":{
                        "gte":1599475846836,
                        "lte":1599562248014,
                        "format":"epoch_millis"
                     }
                  }
               }
            ]
         }
      },
      "size":0,
      "track_total_hits":false
   }
}
`getContainedLogSummaryBuckets` (With highlights)
{
   "allowNoIndices":true,
   "index":"logs-*,filebeat-*,kibana_sample_data_logs*",
   "ignoreUnavailable":true,
   "body":{
      "aggregations":{
         "count_by_date":{
            "date_range":{
               "field":"@timestamp",
               "format":"epoch_millis",
               "ranges":[
                  {
                     "from":1599476372648,
                     "to":1599477238058.29
                  },
                  {
                     "from":1599477238058,
                     "to":1599478103468.29
                  },
                  {
                     "from":1599478103468,
                     "to":1599478968878.29
                  }
               ]
            },
            "aggregations":{
               "top_hits_by_key":{
                  "top_hits":{
                     "size":1,
                     "sort":[
                        {
                           "@timestamp":"asc"
                        },
                        {
                           "_doc":"asc"
                        }
                     ],
                     "_source":false
                  }
               }
            }
         }
      },
      "query":{
         "bool":{
            "filter":[
               {
                  "multi_match":{
                     "fields":[
                        "event.dataset",
                        "apache2.access",
                        "apache2.access.remote_ip",
                        "apache2.access.user_name"
                     ],
                     "lenient":true,
                     "query":"Service",
                     "type":"phrase"
                  }
               },
               {
                  "range":{
                     "@timestamp":{
                        "gte":1599476372648,
                        "lte":1599562913677,
                        "format":"epoch_millis"
                     }
                  }
               }
            ]
         }
      },
      "size":0,
      "track_total_hits":false
   }
}
`getLogItem`
{
   "index":"logs-*,filebeat-*,kibana_sample_data_logs*",
   "terminate_after":1,
   "body":{
      "size":1,
      "sort":[
         {
            "@timestamp":"desc"
         },
         {
            "_doc":"desc"
         }
      ],
      "query":{
         "ids":{
            "values":[
               "ww5WbXQBHyE_1lJ-g4vn"
            ]
         }
      }
   }
}

To move over to async search we will need to be able to fulfill all of these queries.

Terminology / acronyms

  • SS = Search strategy
  • CSS = Custom search strategy

Q&A

❓ Does the Kibana data plugin provide the functions we need to make the ES queries we are currently making?

📝 Yes, as far as I can tell we can make the queries listed above using the data plugin, facilitated by the search function.


❓ How can we split this work up into multiple steps to keep the scope small?

📝 There's a few ways we could split this up, I believe the most optimal would be:

  • Write our custom server side search strategy (this would be a light wrapper around the async search strategy, more below). This would be generic, to be used by all queries.

  • Write a new hook (which might be a wrapper around useTrackedPromise) which knows how to deal with our custom server side strategy, observables etc. This would be generic, to be used by all queries.

  • Choose a query, for example getLogEntries, and move it across:

    • Convert the server side logic
    • Convert the client side logic
  • Repeat the above for all queries on the page


❓ What are the benefits/costs of using a server-side "custom search strategy" vs not?
❓ which strategy should we use to start with? the non-async implementation at first, then move to async?

📝 I don't think are any costs per se to using a CSS, other than maintenance on our part. A few benefits are:

  • Resilience to upstream changes, e.g. we can lay out our own abstraction layer around the async search strategy, so if anything changes upstream, we only need to change things in one place vs many. This is similar to how we have convenience wrappers around calling ES in our Kibana framework adapter.

In my opinion we should just skip straight to using the async search strategy, the sync search strategy is using the async strategy under the hood anyway. This doesn't mean we need to immediately buy into things like partial results, these can be incremental improvements.

We can as a first step (via our CSS) utilise async search, but just wait for all of the results (if we need to). If some features are easier to adapt immediately to fully async (with partial results etc) we can also very easily have two methods on our CSS (or an option / parameter), one would return partial results, one could just be "async as sync" and gather them all before sending them back.

Either way the client side hook should be preemptively built with a fully async nature in mind, e.g. using observables, which can incrementally emit new data etc.

From my perspective the hardest part of supporting partial results will be the UI / UX, as opposed to the direct technical work.

I've added a dedicated section on how this CSS could work below.


❓ Do we foresee problems with the existing Log Stream (auto refresh, scrolling refreshes, highlights, etc) that will cause problems when using the data plugin?

📝 I don't see any problems arising from using the data plugin here, realistically this is just the mechanism for how we get given data. However, how we use the data isn't necessarily easy. Partial results and cancellation raise UI / UX questions, and I think we should get some design input there, especially for cancellation (more below).

We could very easily (famous last words) keep the loading spinner functionality that we have now, but immediately render the partial results we receive. Loading spinner stays in place until we have received everything, but users enjoy immediate rendering of incoming results. This might create janky behaviour with the scroll area, I'm not 100% sure.


❓ Do we plan to eventually make use of "partial query results" anywhere in the Logs UI?

📝 I think this is a tentative "yes". The UI / UX might need some input. But it would be easy for us to render items to the stream as we receive them. We should definitely make our CSS and hook partial results aware, even if what that means at first is we wait until we have everything before rendering.

Where partial results seem to shine (right now) is for huge queries, queries that take minutes to execute, and it's imperative to start showing something to the user whilst the query carries on running in the background. The stream page isn't really built around this paradigm. However, I still think showing things quickly is important.

It's important to note that right now this option means we'd not really get anything out of partial results. They'd be reported every 64 shards. We could, of course, always change this upstream ourselves (as a configurable option).


❓ Do we plan to eventually make use of long-running async queries (notifying a user anywhere in the app when a query has finished) anywhere in the Logs UI?

📝 I can't think of an immediate use case for this. However, if we implement our CSS and hook to be partial results and cancellation aware (which we should be anyway) then this doesn't become hard to add later. Maybe there becomes a use case for years worth of stream viewing?


❓ Do we plan to make use of async query cancellation in the Logs UI, and how? What should the UX be for this, roughly?

📝 We should definitely add first class cancellation support to our CSS and hook. However, how this immediately manifests in the UI is a different question. In the stream the scroll to top / bottom scenario again becomes interesting, if a user scrolls to an extreme, and thus triggers a request, we could allow this to be cancelled, however this request will probably just trigger again after cancellation due to the scroll boundary.

For the UX, it probably makes sense to allow cancellation wherever we currently show loading states. So it would become a loading state with an option to cancel.

Some sort of "banner" could also be used, but it becomes tricky when the page is dispatching multiple requests. Would a banner approach always cancel all requests in progress? On the stream page, this approach would mean hitting cancel and effectively cancelling queries for entries, minimap buckets, and highlights.

I think we need to carefully think about what cancellation means. E.g. I change the time range, and a query is dispatched, I then choose to cancel, does the time range then revert back?

Regardless of whether we need to surface this in the UI, we need cancellation support regardless. If props change and we need to dispatch a new request, we need to be able to cancel the old one.


❓ Are there any other places in the Logs UI that we make ES queries that we should consider migrating to async search? (Assuming we will not worry about things like ML and Alerting APIs since those will need to be migrated by their associated teams.)

📝 As we'd be ignoring alerting and ML I don't think are many more locations. We could possibly use async search for the log entry examples shown on the anomalies / categories pages, and for chart preview data on the alerting flyout.


Custom search strategy

Our CSS approach could work a few different ways. We could have a CSS for each query type we want supported through the data plugin (a getLogEntries search strategy for example). There could be search strategies for areas, e.g. a LogStream search strategy that has functions for getLogEntries, getHighlights etc. However, in my opinion these two approaches are needlessly complex, we don't really have a reason to have many, many different search strategies. The main reasons I see for us having a CSS, is A) so we have a central location that inerfaces with the data plugin and it's core SS and B) to make it easy for other plugins to use our data.

Instead a better approach seems to be creating one CSS (the Infra CSS if you will), and that singular CSS would know how to deal with all of our various query types. This mechanism (the dealing with different query types) would be completely up to us. However, the security solution has already implemented something similar in a factory fashion and I really like the approach they've taken. Ours wouldn't look exactly like this, but it'd be similar.

Security solution PR: #75439

They have two CSS, which handle all of their various query types across the entire app. The provider lives here, providing a search and cancel function. There's very little code or complexity here, the app centric stuff happens in the queryFactory.

If we take a look at their hostOverview query factory we can see that there's a buildDsl and parse function. Every query type factory has these, this is a unified interface.

buildDsl unsurprisingly builds the DSL, .dsl is also added to the filenames (which I really like).

And parse knows exactly how to deal with the results.

On the client side we can see they're taking a "wait for everything" approach, by waiting until the response is not partial and is not running. If they hit a partial and not running state, they error out (as opposed to just showing the partial results so far).

I found this all very easy to follow as someone who has never seen the Security Solution before.

I think our ideal solution will look somewhat similar to this. It means we ultimately register one CSS, and can handle all query types in there. If another app wants to make use of this all they need to do is take the types we set up for the query type, and call our CSS.

We also have things (like the source configuration) that many query types are all going to want to access, we can ensure these are passed in at the top level, so that all factories can access these.

Real (but rough) example

Some rough code exists here. This isn't perfect, it's just to show the ideas mentioned here.

Useful resources

Notes

@weltenwort
Copy link
Member

weltenwort commented Sep 22, 2020

That's a great analysis, thank you. 👍 I concur with most of your conclusions.

I'd like to add some arguments for the "multiple search strategies" architecture over the "multiplexing with a single search strategy" approach, if I may.


We could have a CSS for each query type we want supported through the data plugin (a getLogEntries search strategy for example). There could be search strategies for areas, e.g. a LogStream search strategy that has functions for getLogEntries, getHighlights etc. However, in my opinion these two approaches are needlessly complex, we don't really have a reason to have many, many different search strategies.

About complexity: Regarding the complexity, I'd question whether the request/response types from the security solution or your example would scale well for our number of request types. I imagine we'd up with quite deeply nested conditional types if a binary branch already looks like this:

type StrategyRequestType<T extends FactoryQueryTypes> = T extends LogEntriesQueries.entries
  ? LogEntriesEntriesReqestOptions
  : T extends LogEntriesQueries.item
  ? LogEntriesItemReqestOptions
  : never;

export type StrategyResponseType<T extends FactoryQueryTypes> = T extends LogEntriesQueries.entries
  ? LogEntriesEntriesStrategyResponse
  : T extends LogEntriesQueries.item
  ? LogEntriesItemStrategyResponse
  : never;
  
export const infraSearchStrategyProvider = <T extends FactoryQueryTypes>(
  data: PluginStart
): ISearchStrategy<StrategyRequestType<T>, StrategyResponseType<T>> => {
  // ...
}

Not that I'm against conditional types in general, but these seem needlessly complex compared to straight-forward, separate search strategy providers.


About aesthetics: On the aesthetic side, I would argue that the data plugin's routes already perform a multiplexing to various search strategies. Nesting another multiplexing operation within that causes the code path to contain deeper, more nested branches which usually doesn't benefit clarity and maintainability. Instead I suspect it furthers the monolithic aspects of our plugin, which could reduce the flexibility for further refactorings.


About suitability: More importantly, I wonder if the multiplexing approach is actually suited for the internal workings of our to-be-written search strategies. The two-stage "create dsl" and "parse response" structure might be appealingly clear, but some of our routes actually perform more complex operations that can't be easily mapped to that. The log entries query, for example, actually consists of two queries with particular filters, search_after clauses and sort options. These will need to be combined both in terms of their async IDs and their in-progress/final responses. We could, of course, come up with a more powerful dispatching mechanism to accommodate that, but I feel we would be better off keeping it simple and "close to the metal" (which relates the aesthetics argument).


Let me know if these arguments make sense. There might be upsides for grouping some queries together if they are always requested together and require identical parameters. Overall, though, I don't really see the downside of separate search strategies. Could you elaborate where you see the complexity in that approach?

@jasonrhodes
Copy link
Member Author

Thanks, @weltenwort -- I think I'm okay with either approach but the idea of simplifying those conditional types is pretty appealing. I'll leave this decision up to you and @Kerry350 :D Thanks to both of you for the detailed write-ups!

@Kerry350
Copy link
Contributor

@weltenwort Thanks for the feedback 👌

About complexity: Regarding the complexity, I'd question whether the request/response types from the security solution or your example would scale well for our number of request types.

but these seem needlessly complex compared to straight-forward, separate search strategy providers.

100%, I can't argue against this. The complex types were the big downside to this approach.

On the aesthetic side, I would argue that the data plugin's routes already perform a multiplexing to various search strategies. Nesting another multiplexing operation within that causes the code path to contain deeper, more nested branches which usually doesn't benefit clarity and maintainability. Instead I suspect it furthers the monolithic aspects of our plugin, which could reduce the flexibility for further refactorings.

Yes, that's a fair comment. Although I guess there's the counter argument we shouldn't necessarily consider how other plugins implement their functionality under the hood. Only the exposed interface.

More importantly, I wonder if the multiplexing approach is actually suited for the internal workings of our to-be-written search strategies. The two-stage "create dsl" and "parse response" structure might be appealingly clear, but some of our routes actually perform more complex operations that can't be easily mapped to that. The log entries query, for example, actually consists of two queries with particular filters, search_after clauses and sort options. These will need to be combined both in terms of their async IDs and their in-progress/final responses. We could, of course, come up with a more powerful dispatching mechanism to accommodate that, but I feel we would be better off keeping it simple and "close to the metal" (which relates the aesthetics argument).

The point here that we have much more complex routes that perform multiple queries is certainly true. Specifically I want to zone in on this bit though:

The log entries query, for example, actually consists of two queries with particular filters, search_after clauses and sort options. These will need to be combined both in terms of their async IDs and their in-progress/final responses. We could, of course, come up with a more powerful dispatching mechanism to accommodate that, but I feel we would be better off keeping it simple and "close to the metal" (which relates the aesthetics argument).

I believe we'll need to handle this regardless? Granted a flat buildDsl / parse wouldn't work "out of the box" and we'd need to piece the multiple queries together somehow. But if we use multiple custom search strategies, my thinking is we'd still want there to be a single getLogEntries custom search strategy (to keep with that example). In that scenario we'd still need a way of managing these multiple queries and piecing them together within the search method of the custom search strategy.

I just want to confirm we're on the same page there with the having multiple custom search strategies, that we'd still need to handle these multiple queries and their relationship within search. CSS -> search would ultimately wrap multiple data plugin -> search calls, and know how to create a sane response from that?

Could you elaborate where you see the complexity in that approach?

Complexity is probably the wrong word, I don't think it introduces complexity, but I just didn't see the upside to necessarily registering many search strategies. Quick aside question: If we use multiple custom search strategies would we namespace these with logs: as a prefix?

I do want to point out though that I'm not super passionate in either direction, I went with this for the super rough POC as it had parity with another plugin. But custom search strategies will work just as well. The implementation only shifts subtly from building one of these factory query type handlers, to instead adding a separate custom search strategy with a search / cancel. Happy to proceed with multiple strategies, the overarching question of "can we swap to the data plugin?" is at least answered via the R&D 😄

@weltenwort
Copy link
Member

In that scenario we'd still need a way of managing these multiple queries and piecing them together within the search method of the custom search strategy.
I just want to confirm we're on the same page there with the having multiple custom search strategies, that we'd still need to handle these multiple queries and their relationship within search. CSS -> search would ultimately wrap multiple data plugin -> search calls, and know how to create a sane response from that?

Absolutely, the bodies of the search() (and cancel()) functions might be a bit more complex and abstract the multiple queries away, as in

  1. decode set of incoming ids (if polling)
  2. issue/retrieve several async queries (sometimes in parallel, sometimes in sequence building on each other)
  3. create combined response body and encode all required query ids for the next polling cycle

If we use multiple custom search strategies would we namespace these with logs: as a prefix?

A predictable naming scheme is always a good idea. Would a log prefix as in logEntries be enough?

I do want to point out though that I'm not super passionate in either direction, I went with this for the super rough POC as it had parity with another plugin. But custom search strategies will work just as well. The implementation only shifts subtly from building one of these factory query type handlers, to instead adding a separate custom search strategy with a search / cancel. Happy to proceed with multiple strategies, the overarching question of "can we swap to the data plugin?" is at least answered via the R&D

Agreed, the implementation difference is minor. I didn't want to diminish your work. You answered all the research questions thoroughly, so I figured it's time to start discussing the color of the shed. 🚲 😉

@Kerry350
Copy link
Contributor

@weltenwort

Absolutely, the bodies of the search() (and cancel()) functions might be a bit more complex and abstract the multiple queries away, as in [...]

Perfect, then we're on the same page 👌

A predictable naming scheme is always a good idea. Would a log prefix as in logEntries be enough?

It would probably be sufficient. I'm just thinking if some other solution does something with logs they might also use logWhatever and then it gets more confusing.

[...] so I figured it's time to start discussing the color of the shed. 🚲 😉

Haha, love that 😂

Great, well I'm happy for us to adjust to multiple custom search strategies. @jasonrhodes has said he is. So as long as @afgomez is happy I think we're all aligned. I can tweak the two implementation tickets that exist at some point.

@jasonrhodes
Copy link
Member Author

jasonrhodes commented Sep 23, 2020

I like the idea of some kind of prefix distinction, whether it's logs: or logs/ or some other special pattern, rather than just relying on log for the reasons @Kerry350 mentions. We should check with @elastic/kibana-app-arch to see if they have any suggestions for how we can all namespace these kinds of registrations?

I'm liking this shed paint though. Well done! :D

@weltenwort
Copy link
Member

💭 AFAIK the search strategy name is used in a url, so a / in the name might be problematic.

@sgrodzicki
Copy link

R&D work is done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Feature:Logs UI Logs UI feature R&D Research and development ticket (not meant to produce code, but to make a decision) Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

5 participants