Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] In-memory Rule Management and Monitoring tables #119601

Closed
3 tasks done
banderror opened this issue Nov 24, 2021 · 3 comments
Closed
3 tasks done
Assignees
Labels
Feature:Rule Management Security Solution Detection Rule Management area Feature:Rule Monitoring Security Solution Detection Rule Monitoring area Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0

Comments

@banderror
Copy link
Contributor

banderror commented Nov 24, 2021

Summary

Previously we considered an option of implementing in-memory filtering, sorting and searching in the browser for our Rule Management and Monitoring tables: #89877

This PR was abandoned because:

  • Our rules/_find and rules/_find_statuses endpoints were extremely slow when the page size was 100+ items. We had N+1 problem and were fetching rule status and actions SOs per each rule in separate requests to Elasticsearch. This was leading to the fact that in-browser app in attempt to load all the 500+ rules was generating 1500+ requests to ES under the hood. If we multiply that by the number of simultaneous users and add the fact that some of them have more than 500 rules, it becomes clear that it wasn't a scalable solution.
  • Saved objects API didn't support aggregations back then. Now it does, and we don't fetch N status SOs of N rules in N queries anymore.

Since now SO API supports aggs, and we're in the process of getting rid of legacy SOs, we are going to reconsider the in-memory approach.

To do

  • Implement a POC for the in-memory implementation of the Rule Management and Monitoring tables.
  • Add support for sorting by all the existing columns of the Rule Management and Monitoring tables.
  • Test performance of the POC on a normal (600) and large (few thousand) amount of rules.
    • Measure the full page load time and subsequent table refresh time
    • Measure event loop blocking on the server (could be caused by JSON (de)serialization of a large amount of rules)
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 24, 2021
@banderror banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0 Team:Detection Rule Management Security Detection Rule Management Team labels Nov 24, 2021
@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 24, 2021
@banderror banderror added Feature:Rule Management Security Solution Detection Rule Management area Feature:Rule Monitoring Security Solution Detection Rule Monitoring area labels Nov 24, 2021
@banderror
Copy link
Contributor Author

banderror commented Dec 16, 2021

@FrankHassanabad had valid concerns about this approach a few weeks ago, and I finally compiled a summary of our chat and decisions made so far.

Concerns:

  • Table loading and subsequent refresh polling. Full reload of all rules can generate significant load on the cluster and saturate the network, especially if being done often and regularly in the background.
  • Users can open many tabs at a time with Kibana and Rule Management Page and keep them open for a long time. These tabs will be refreshing the rules in the background.
  • If we load all rules in the browser memory, users with low scale front end browsers and memory chips can experience issues and crashes.
  • @FrankHassanabad: “I would strongly recommend making both code paths work with a switch so if customers no longer work anymore you can turn it off and they don’t get sorting”

Action items:

  • Tables will first load all the data (static rules data + dynamic rule monitoring data), then regularly poll and refresh only the data that has changed since the last refresh (LDU approach). This should substantially decrease the amount of data passed from ES to Kibana and then from Kibana to the end user.
    • Do that in a follow-up PR
    • We will need to handle rule deletions when refreshing the table. There's no easy way of detecting that a rule or a SO in general was deleted. So when the table is already loaded, rules deleted after that by other users will continue to be shown in the table. Some options:
      • Soft deletion. @FrankHassanabad: "These issues could be alleviated if SO objects allowed soft deletes and then at a later point in time (even if it was minutes, hours, days, weeks, etc...) would eventually remove the objects deleted or if it behind the scenes at least recorded the delete through a copy and a "shadow type". This would allow better UI/UX such as undo. It would probably require an RFC and a lot of education on the merits of it across teams. It might be better to take a "history approach RFC/feature" that includes recording deletes as an easier avenue to get your shadow objects/types."
      • Some kind of log of deletions (could leverage event log for that and use it as a temp workaround until soft deletion is implemented for all SOs) that we could use to return ids of deleted rules to the FE to update the tables.
      • Invalidate the cache of loaded rules every X minutes to force a full reload - probably the easiest one.
  • We’ll add a min value for the refresh interval. If the user specifies a smaller value, it will be set to the min. Min could be 30, 60, 120 or whatever seconds — we’ll try to pick this value based on perf testing results.
    • UPD: 60 seconds is the current minimum. Nothing to do here.
  • (Maybe) We will consider “incremental” data loading for the tables - loading rules in N smaller requests and not a single request. For example, the 1st request loads only the data needed for the 1st page of the table to be rendered asap, the 2nd request loads data for the next 1000 rules, etc until all the rules are loaded. Until all the data is loaded, the table is visible but not fully functional (e.g. sorting is disabled).
    • Potential reason 1: to reduce Time To Interactive if loading with a single request is too slow.
    • Potential reason 2: to unblock the Kibana's event loop on the server if large amount of rules being transferred will lead to that.
    • Instead of incremental loading introduce a configurable threshold value (let's say 2000 rules) and fallback users with more than 2000 rules to paginated version of the table. Test that approach and reconsider if doesn't play well.
  • On the FE, state will be stored outside of the components tree and will be persistent across page navigation. The app won’t be reloading data when it’s already loaded in memory and is not stale.
  • The app will start polling on table mount, stop on unmount. If the user navigates to a different page, there won’t be any polling in the background.
  • The app will stop polling when the user leaves the browser tab. So only the currently active tab will affect the network and generate load on the cluster.
  • We will add a feature switch for the in-memory table. When disabled, the app will work as it works now - based on the server-side implementation.

Other notes:

  • Rules data doesn’t take much space. In most cases, all rules take less than 1Mb of data. CPU time-wise, for 1200 rules JSON parsing time takes 80-90 ms compared to the total page load time of 5000-6000 ms.

vitaliidm pushed a commit that referenced this issue Jan 25, 2022
…119611)

**Addresses: #119601

## Summary

With this implementation, we load detection rules into the in-memory cache on the initial page load. This change has several notable effects visible to users:

- Table pagination is now almost instant as we don't need to make an additional HTTP request to render the table.
- Table filters get applied instantly when the cache is hot as we use filter parameters as the cache key. If a user has already filtered rules by a specific parameter, we will first display the rules using cached data and re-fetch for new data in the background.
- Table sorting and filtration also happen without visible delays as we don't need to make an additional HTTP request for that.
- Faster navigation between pages once rules data is loaded. We do not re-fetch all rules if a user leaves the rules management page and then returns.
- Introduced an adjustable threshold that represents the maximum number of rules for which the in-memory implementation is enabled (by default 3000). <img width="1503" alt="Screenshot 2022-01-20 at 16 47 45" src="https://user-images.githubusercontent.com/1938181/150350900-4147775e-82ab-4100-8e80-82c49ef193bf.png">
- Added an indication of when the advanced sorting capabilities were enabled or disabled: 
<img width="200" alt="Screenshot 2022-01-20 at 16 49 23" src="https://user-images.githubusercontent.com/1938181/150351879-7974ebcc-487e-4368-90f5-728308c8155c.png"><img width="200" alt="Screenshot 2022-01-20 at 16 52 29" src="https://user-images.githubusercontent.com/1938181/150351929-852d6d18-0c18-4d88-9fce-cf13c05ed80e.png">
- Added sorting by all rules table columns.
- Removed Idle modal and the `idleTimeout` UI setting. Added a saved object migration to remove the unused setting.

Notable changes from a technical standpoint:

- Automatic query cancellation; Removed all AbortSignal-related manual logic.
- Re-fetching by a timer logic has been removed in favor of react-query built-in implementation.
- Removed manual logic of keeping `lastUpdated` date up to date along with `isLoading` and `isRefreshing` flags. It also has been delegated to react-query.
- Refetch behavior slightly changed. We re-fetch table data only when the browser tab is active and do not re-fetch in the background. Additionally, we re-fetch when the browser tab becomes active.

## Scalability Risks

It's worth noting performance deterioration with an increasing number of parallel requests:

**20 rules per request (200 requests, 50 in parallel)**
```
Summary:
  Total:	10.4503 secs
  Slowest:	1.6089 secs
  Fastest:	0.3398 secs
  Average:	1.0381 secs
  Requests/sec:	19.1383
  ```
  
**100 rules per request (200 requests, 50 in parallel)**
  ```
Summary:
  Total:	14.0456 secs
  Slowest:	1.9323 secs
  Fastest:	0.9952 secs
  Average:	1.3991 secs
  Requests/sec:	14.2393
```

**500 rules per request (200 requests, 50 in parallel)**

```
Summary:
  Total:	32.6509 secs
  Slowest:	4.8964 secs
  Fastest:	0.5494 secs
  Average:	3.2379 secs
  Requests/sec:	6.1254
```

**1000 rules per request (200 requests, 50 in parallel)**

```
Summary:
  Total:	47.8000 secs
  Slowest:	6.1776 secs
  Fastest:	1.2028 secs
  Average:	4.7547 secs
  Requests/sec:	4.1841
```

### JSON response parsing time

Time spent by Kibana parsing Elasticsearch response JSON.

  | 20 rules | 100 rules | 500 rules | 2500 rules
-- | -- | -- | -- | --
mean, ms | 9.920000 | 22.540000 | 62.380000 | 195.260000
max, ms | 11.000000 | 29.200000 | 76.300000 | 232.800000
@banderror
Copy link
Contributor Author

banderror commented Feb 2, 2022

UPD on the action items and things left to do in this ticket.

  • LDU approach. At this point, with the current implementation and our plans, we're considering this optimization not critical. We will revisit this idea later. Reasons:
    • In-memory table is disabled by default in the UI, so there likely won't be a lot of users who work with it. Those of them who experience issues with performance will be able to switch it off and continue with the server-side implementation.
    • Implementation of this approach wouldn't be trivial because we have two sources of data that both have to be checked: rule SOs and sidecar rule execution SOs.
    • We're planning to start working on proper server-side filtering and sorting in the 8.2 release cycle: [RAM] [META] Make rule params searchable #123982

There are some follow-up items, most of which we're going to start addressing in the next 8.2 dev cycle:

@banderror
Copy link
Contributor Author

@xcrzx I updated the above comment with follow-up tickets, and I think at this point we can close this one. Please check the new tickets and let me know if you'd like to change anything in them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Rule Management Security Solution Detection Rule Management area Feature:Rule Monitoring Security Solution Detection Rule Monitoring area Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0
Projects
None yet
Development

No branches or pull requests

2 participants