[Security Solution][Detections] In-memory Rule Management and Monitoring tables #119601

banderror · 2021-11-24T12:43:44Z

Summary

Previously we considered an option of implementing in-memory filtering, sorting and searching in the browser for our Rule Management and Monitoring tables: #89877

This PR was abandoned because:

Our rules/_find and rules/_find_statuses endpoints were extremely slow when the page size was 100+ items. We had N+1 problem and were fetching rule status and actions SOs per each rule in separate requests to Elasticsearch. This was leading to the fact that in-browser app in attempt to load all the 500+ rules was generating 1500+ requests to ES under the hood. If we multiply that by the number of simultaneous users and add the fact that some of them have more than 500 rules, it becomes clear that it wasn't a scalable solution.
Saved objects API didn't support aggregations back then. Now it does, and we don't fetch N status SOs of N rules in N queries anymore.

Since now SO API supports aggs, and we're in the process of getting rid of legacy SOs, we are going to reconsider the in-memory approach.

To do

Implement a POC for the in-memory implementation of the Rule Management and Monitoring tables.
Add support for sorting by all the existing columns of the Rule Management and Monitoring tables.
Test performance of the POC on a normal (600) and large (few thousand) amount of rules.
- Measure the full page load time and subsequent table refresh time
- Measure event loop blocking on the server (could be caused by JSON (de)serialization of a large amount of rules)

The text was updated successfully, but these errors were encountered:

banderror · 2021-12-16T13:32:34Z

@FrankHassanabad had valid concerns about this approach a few weeks ago, and I finally compiled a summary of our chat and decisions made so far.

Concerns:

Table loading and subsequent refresh polling. Full reload of all rules can generate significant load on the cluster and saturate the network, especially if being done often and regularly in the background.
Users can open many tabs at a time with Kibana and Rule Management Page and keep them open for a long time. These tabs will be refreshing the rules in the background.
If we load all rules in the browser memory, users with low scale front end browsers and memory chips can experience issues and crashes.
@FrankHassanabad: “I would strongly recommend making both code paths work with a switch so if customers no longer work anymore you can turn it off and they don’t get sorting”

Action items:

Tables will first load all the data (static rules data + dynamic rule monitoring data), then regularly poll and refresh only the data that has changed since the last refresh (LDU approach). This should substantially decrease the amount of data passed from ES to Kibana and then from Kibana to the end user.
- Do that in a follow-up PR
- We will need to handle rule deletions when refreshing the table. There's no easy way of detecting that a rule or a SO in general was deleted. So when the table is already loaded, rules deleted after that by other users will continue to be shown in the table. Some options:
  - Soft deletion. @FrankHassanabad: "These issues could be alleviated if SO objects allowed soft deletes and then at a later point in time (even if it was minutes, hours, days, weeks, etc...) would eventually remove the objects deleted or if it behind the scenes at least recorded the delete through a copy and a "shadow type". This would allow better UI/UX such as undo. It would probably require an RFC and a lot of education on the merits of it across teams. It might be better to take a "history approach RFC/feature" that includes recording deletes as an easier avenue to get your shadow objects/types."
  - Some kind of log of deletions (could leverage event log for that and use it as a temp workaround until soft deletion is implemented for all SOs) that we could use to return ids of deleted rules to the FE to update the tables.
  - Invalidate the cache of loaded rules every X minutes to force a full reload - probably the easiest one.
We’ll add a min value for the refresh interval. If the user specifies a smaller value, it will be set to the min. Min could be 30, 60, 120 or whatever seconds — we’ll try to pick this value based on perf testing results.
- UPD: 60 seconds is the current minimum. Nothing to do here.
(Maybe) We will consider “incremental” data loading for the tables - loading rules in N smaller requests and not a single request. For example, the 1st request loads only the data needed for the 1st page of the table to be rendered asap, the 2nd request loads data for the next 1000 rules, etc until all the rules are loaded. Until all the data is loaded, the table is visible but not fully functional (e.g. sorting is disabled).
- Potential reason 1: to reduce Time To Interactive if loading with a single request is too slow.
- Potential reason 2: to unblock the Kibana's event loop on the server if large amount of rules being transferred will lead to that.
- Instead of incremental loading introduce a configurable threshold value (let's say 2000 rules) and fallback users with more than 2000 rules to paginated version of the table. Test that approach and reconsider if doesn't play well.
On the FE, state will be stored outside of the components tree and will be persistent across page navigation. The app won’t be reloading data when it’s already loaded in memory and is not stale.
- Implemented: [Security Solution] In-memory rules table implementation #119611
The app will start polling on table mount, stop on unmount. If the user navigates to a different page, there won’t be any polling in the background.
- Implemented: [Security Solution] In-memory rules table implementation #119611
The app will stop polling when the user leaves the browser tab. So only the currently active tab will affect the network and generate load on the cluster.
- Implemented: [Security Solution] In-memory rules table implementation #119611
We will add a feature switch for the in-memory table. When disabled, the app will work as it works now - based on the server-side implementation.
- To be implemented in [Security Solution] In-memory rules table implementation #119611

Other notes:

Rules data doesn’t take much space. In most cases, all rules take less than 1Mb of data. CPU time-wise, for 1200 rules JSON parsing time takes 80-90 ms compared to the total page load time of 5000-6000 ms.

…119611) **Addresses: #119601 ## Summary With this implementation, we load detection rules into the in-memory cache on the initial page load. This change has several notable effects visible to users: - Table pagination is now almost instant as we don't need to make an additional HTTP request to render the table. - Table filters get applied instantly when the cache is hot as we use filter parameters as the cache key. If a user has already filtered rules by a specific parameter, we will first display the rules using cached data and re-fetch for new data in the background. - Table sorting and filtration also happen without visible delays as we don't need to make an additional HTTP request for that. - Faster navigation between pages once rules data is loaded. We do not re-fetch all rules if a user leaves the rules management page and then returns. - Introduced an adjustable threshold that represents the maximum number of rules for which the in-memory implementation is enabled (by default 3000). <img width="1503" alt="Screenshot 2022-01-20 at 16 47 45" src="https://user-images.githubusercontent.com/1938181/150350900-4147775e-82ab-4100-8e80-82c49ef193bf.png"> - Added an indication of when the advanced sorting capabilities were enabled or disabled: <img width="200" alt="Screenshot 2022-01-20 at 16 49 23" src="https://user-images.githubusercontent.com/1938181/150351879-7974ebcc-487e-4368-90f5-728308c8155c.png"><img width="200" alt="Screenshot 2022-01-20 at 16 52 29" src="https://user-images.githubusercontent.com/1938181/150351929-852d6d18-0c18-4d88-9fce-cf13c05ed80e.png"> - Added sorting by all rules table columns. - Removed Idle modal and the `idleTimeout` UI setting. Added a saved object migration to remove the unused setting. Notable changes from a technical standpoint: - Automatic query cancellation; Removed all AbortSignal-related manual logic. - Re-fetching by a timer logic has been removed in favor of react-query built-in implementation. - Removed manual logic of keeping `lastUpdated` date up to date along with `isLoading` and `isRefreshing` flags. It also has been delegated to react-query. - Refetch behavior slightly changed. We re-fetch table data only when the browser tab is active and do not re-fetch in the background. Additionally, we re-fetch when the browser tab becomes active. ## Scalability Risks It's worth noting performance deterioration with an increasing number of parallel requests: **20 rules per request (200 requests, 50 in parallel)** ``` Summary: Total: 10.4503 secs Slowest: 1.6089 secs Fastest: 0.3398 secs Average: 1.0381 secs Requests/sec: 19.1383 ``` **100 rules per request (200 requests, 50 in parallel)** ``` Summary: Total: 14.0456 secs Slowest: 1.9323 secs Fastest: 0.9952 secs Average: 1.3991 secs Requests/sec: 14.2393 ``` **500 rules per request (200 requests, 50 in parallel)** ``` Summary: Total: 32.6509 secs Slowest: 4.8964 secs Fastest: 0.5494 secs Average: 3.2379 secs Requests/sec: 6.1254 ``` **1000 rules per request (200 requests, 50 in parallel)** ``` Summary: Total: 47.8000 secs Slowest: 6.1776 secs Fastest: 1.2028 secs Average: 4.7547 secs Requests/sec: 4.1841 ``` ### JSON response parsing time Time spent by Kibana parsing Elasticsearch response JSON. | 20 rules | 100 rules | 500 rules | 2500 rules -- | -- | -- | -- | -- mean, ms | 9.920000 | 22.540000 | 62.380000 | 195.260000 max, ms | 11.000000 | 29.200000 | 76.300000 | 232.800000

banderror · 2022-02-02T12:47:27Z

UPD on the action items and things left to do in this ticket.

LDU approach. At this point, with the current implementation and our plans, we're considering this optimization not critical. We will revisit this idea later. Reasons:
- In-memory table is disabled by default in the UI, so there likely won't be a lot of users who work with it. Those of them who experience issues with performance will be able to switch it off and continue with the server-side implementation.
- Implementation of this approach wouldn't be trivial because we have two sources of data that both have to be checked: rule SOs and sidecar rule execution SOs.
- We're planning to start working on proper server-side filtering and sorting in the 8.2 release cycle: [RAM] [META] Make rule params searchable #123982

There are some follow-up items, most of which we're going to start addressing in the next 8.2 dev cycle:

React-query optimizations related to in-memory cache updates instead of invalidations (see this comment for more details):
- [Security Solution][Detections] Return updated rules from the "bulk actions" endpoint #125500 - this should be addressed and shipped in 8.1.0 due to some breaking changes in the bulk actions response structure
- [Security Solution] Update rules cache instead of invalidating it #125574 - we will address it in 8.2
Test coverage for the rule management logic, extracting the logic further from components. These items can require a lot of effort and we might want to just start working on them in 8.2:
- [Security Solution] Extract frontend rule management logic from React components #92169
- [Security Solution] Cover frontend rule management logic with tests #125593

banderror · 2022-02-14T22:37:00Z

@xcrzx I updated the above comment with follow-up tickets, and I think at this point we can close this one. Please check the new tickets and let me know if you'd like to change anything in them.

botelastic bot added the needs-team Issues missing a team label label Nov 24, 2021

banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0 Team:Detection Rule Management Security Detection Rule Management Team labels Nov 24, 2021

botelastic bot removed the needs-team Issues missing a team label label Nov 24, 2021

banderror assigned xcrzx Nov 24, 2021

banderror added Feature:Rule Management Security Solution Detection Rule Management area Feature:Rule Monitoring Security Solution Detection Rule Monitoring area labels Nov 24, 2021

xcrzx mentioned this issue Nov 25, 2021

[Security Solution] In-memory rules table implementation #119611

Merged

xcrzx mentioned this issue Feb 3, 2022

[Security Solution] Fix rules table performance regression #124344

Merged

4 tasks

banderror closed this as completed Feb 14, 2022

banderror mentioned this issue Feb 14, 2022

[Security Solution] Add new filters to Rule Management and Monitoring tables #125596

Closed

joepeeples mentioned this issue Feb 22, 2022

[DOCS] Rules - Enhanced sorting & filtering elastic/security-docs#1598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Detections] In-memory Rule Management and Monitoring tables #119601

[Security Solution][Detections] In-memory Rule Management and Monitoring tables #119601

banderror commented Nov 24, 2021 •

edited

Loading

banderror commented Dec 16, 2021 •

edited by xcrzx

Loading

banderror commented Feb 2, 2022 •

edited

Loading

banderror commented Feb 14, 2022

[Security Solution][Detections] In-memory Rule Management and Monitoring tables #119601

[Security Solution][Detections] In-memory Rule Management and Monitoring tables #119601

Comments

banderror commented Nov 24, 2021 • edited Loading

Summary

To do

banderror commented Dec 16, 2021 • edited by xcrzx Loading

banderror commented Feb 2, 2022 • edited Loading

banderror commented Feb 14, 2022

banderror commented Nov 24, 2021 •

edited

Loading

banderror commented Dec 16, 2021 •

edited by xcrzx

Loading

banderror commented Feb 2, 2022 •

edited

Loading