[SecuritySolution][EntityAnalytics] Risk Scoring Preview API #155966

rylnd · 2023-04-27T03:14:17Z

Summary

This PR adds a new Risk Scoring API endpoint. Its functionality is meant to replace the current transform-based solution.

Contents of this PR:

New feature flag: riskScoringRoutesEnabled
A new POST endpoint at /internal/risk_scores/preview
An OpenAPI doc for the endpoint
Unit and integration tests

Current behavior, and short-term plans

The endpoint as specified in this branch is read-only. When the endpoint is hit, it triggers some aggregations in elasticsearch, and a formatted response is returned; there is no persistence at this time. This endpoint was originally written as a POC to demonstrate the new Risk Engine's functionality, but it will now drive the Preview Risk Scoring feature.

The main path for the Risk Engine is going to be a scheduled task that calculates Risk Scores and writes them to a persistent datastream that we own. (https://github.com/elastic/security-team/issues/6450). To accomplish this, we will decompose the full functionality of this endpoint into constituent pieces (i.e. calculate | persist, get)

How to review

I've created a Postman collection that can be used to exercise this endpoint. It was generated by Postman from the OpenAPI spec, and modified by me to contain a valid subset of request parameters; please peruse the spec and/or feel free to generate your own scripts/tools from the spec.

curl -L -H 'Authorization: 10c7f646373aa116' -o 'Risk Scoring API.postman_collection.json' https://upload.elastic.co/d/007a57857fc40c791835629ea6dd692d2a8a290860f2917329d688be78c03b1d

Review against the PR instance

I've created a demo instance containing the code on this branch, along with some realistic(ish) alert data (~200k alerts). While you can use this instance as a convenience, you will need to set up kibana-remote-dev and forward ports in order to be able to access the instance's API from a local machine:

Configure kibana-remote-dev with your SSH key and GitHub token.
Configure kibana-remote-dev to specify GITHUB_USERNAME=rylnd

This allows you to bypass kibana-remote-dev code that assumes projects are owned by you

Forward local ports to my instance: ./ports rd-rylnd-pr-155966-risk-score-api
Use postman to talk to http://localhost:5601, which will be forwarded to the cloud instance via the previous command

Review manually

Check out this branch
Enable the feature flag
Populate some event data and generate some alerts
Navigate to the new endpoint, and observe that the host.names and user.names from those alerts have been aggregated into these "risk scores" in the response
Play with the request options to see how these affect the scores (and see docs/test for more details on how those work)

What to review

Are the scores internally consistent? I.e. do they add up as expected? Does the corresponding "level" make sense?
Do parameters apply as expected? E.g. do weights predictably scale the results?
Are there discrepancies between the spec and the actual implementation?
Does pagination make sense? (i.e. the after_keys stuff)?

TODO (for @rylnd)

Add descriptions to the OpenAPI docs
Remove remaining TODOs from code

Checklist

Unit or functional tests were updated or added to match the most common scenarios
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list

For maintainers

This was checked for breaking API changes and was labeled appropriately

Related ticket: https://github.com/elastic/security-team/issues/4211

Adds a POST route hidden behind a feature flag. There's no actual logic in the route yet, just: * schemae for the request/response (first pass) * constants, other boilerplate necessary for the route * adds a feature flag to control the route

The API now calculates and returns scores if there's data. The happy path works but I can't guarantee much else. Most parameters are not respected.

Expresses the idea that the response can either be the simple id/index/score tuples generated by the query, or the full ECS alert/input documents depending on the parameters specified in the request.

The new parameter, enrich_inputs, allows the full documents to be returned in the response. This also simplifies the response to be a single array, as opposed to grouped by entity type as before. In order to prevent an otherwise unnecessary pass over the response data, we no longer convert the risk inputs' sort field into a top-level "score" field. We can add this back as a convenience, later.

I wrote this function and intended to extract inputs, call this function, and then enrich the original inputs with the results. I then realized that most of this could be done for us by specifying `_source: true` in the riskiest_inputs subaggregation. For now, this function is unused, but if in the future we need e.g. runtime field support for these docs, we might need to reevaluate/use this.

This is mostly a passthrough, not much to say here

Not the kibana user. Whoops!

* DRYs up some code that is (currently) identical

If specified along with the `filter` parameter, it will be appended as a range filter to the end of the other filters.

This is due to a hard cap on the number of iterations a painless script can perform. See elastic/elasticsearch#28946 for details.

For now, this will only include a single message if the number of inputs found exceeded the number that we can process. In the future, it will include detailed boosting explanations. I think.

It's not the best message for e.g. painless errors, but it's better than nothing.

Multiple filters are expressed as a boolean query object. * Removes "default" filters from route invocation, as an empty filter is not valid within elasticsearch DSL.

If that's not provided, or the dataView isn't found, we instead default to the space-aware alerts index. Previous to this commit, we were using alerts from all spaces.

I have no idea if this will affect performance, but we don't need it.

If 'host' or 'user' is provided, only identifiers of that type will be returned. If unspecified, both will be returned. If an unknown value is provided, no results are currently returned.

If specified, the API will return the full request/response being executed. This also nests the risk scores under a `scores` key in the response.

WafaaNasr · 2023-06-08T11:27:59Z

x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml

+      description: "Configuration used to tune risk scoring. Weights can be used to change the score contribution of risk inputs for hosts and users at both a global level and also for Risk Input categories (e.g. 'alerts')."
+      type: object
+      required:
+        - type


Just to understand 😊:

When the weight property is not provided in the request, how is the score calculated?

Additionally, I noticed that when users include the weight property, they are required to specify a type property. By examining the schema, it appears that the type property can be any string. I'm curious about how we handle this type property. Is it unique for each calculation, or do users define it globally and reference it here?

In https://github.com/elastic/kibana/pull/155966/files#diff-13daee434883598231f7bf1e9acb58059d4ba751440ddcefa2c478453e9405c0R12, I observed that we define two types, namely RISK_CATEGORY_WEIGHT_TYPE and GLOBAL_IDENTIFIER_WEIGHT_TYPE. I'm uncertain about the behavior if a user were to provide a different value for the type property. Should we include validation to handle such cases?

When the weight property is not provided in the request, how is the score calculated?

If a weight is not specified, it's assumed to be 1. You can see that behavior in the tests for buildWeightingOfScoreByCategory. In the case of a no global weighting, we just omit it, since multiplying by 1 is a no-op.

Should we include validation to handle such cases?

I've written some custom io-ts for most of these parameters, now; please give this another pass to see if there are any outstanding questions.

Well done 👏🏻 thank you!!

WafaaNasr · 2023-06-08T13:44:20Z

x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml

+    RiskScoresRequest:
+      type: object
+      properties:
+        after_keys:


After some investigation 😅, I noticed that the after_key is being used for pagination in the Elasticsearch aggregation query. This led me to question whether this API is intended for internal use only and primarily targeted at users who are knowledgeable about Elasticsearch queries?

Correct, I don't expect users to be consuming this API themselves (although they can). The pagination will be handled by the client.

Update: after more discussion I've decided:

This endpoint is going to remain internal for now, but

In developing the public API I think it makes sense to restrict users to requesting a single type of entity, which in turn will make the pagination much simpler. I'll probably update this endpoint in the same way unless the UI ends up needing to make a single request for some reason.

Once the above is done, we can decide whether this endpoint should be made public, or now.

This adds more realistic validations and accompanying tests to explain the logic.

nkhristinin

LGTM!

These enums were moved to a more central location.

* Updates functions and helpers to accept these generated types instead of our static ones * Updates some helper methods to be type predicates * Exports more generated types

This defines identifierWeights slightly differently to allow arbitrary access by identifierType, e.g. `weight[identifierType]`. This change has no effect on the validation (as evidenced by the unit tests), but causes the generated type to effectively say "you can always access host and user keys, although they may be undefined"

* Renames from category_weights to risk_weights to be more general and more consistent with common/risk_engine * Moves a few helpers specific to after_keys and identifier_types into helpers.ts

As opposed to defaulting to a different data source, which would be unexpected.

WafaaNasr · 2023-06-15T13:16:04Z

x-pack/plugins/security_solution/common/risk_engine/utils.ts

+  enumName: string,
+  theEnum: Record<string, EnumType>
+): t.Type<EnumType, EnumType, unknown> {
+  const isEnumValue = (input: unknown): input is EnumType =>


WafaaNasr

LGTM, Thanks Ryland!!
Appreciate addressing the comments, organising the commits, and the overallPR structure.

kibana-ci · 2023-06-15T16:49:18Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 8b3f52b

Failed CI Steps

Test Failures

[job] [logs] Security Solution Tests #2 / Detection response view Open in timeline "after each" hook for "opens timeline with correct query count for open alerts by rule table"
[job] [logs] Security Solution Tests #2 / Detection response view Open in timeline opens timeline with correct query count for open alerts by rule table
[job] [logs] FTR Configs #20 / graph app feature controls security global graph all privileges allows creating a new workspace

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`infra`	1460	1461	+1
`lists`	259	260	+1
`securitySolution`	4143	4144	+1
total			+3

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`securitySolution`	112	113	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`infra`	2.0MB	2.0MB	+176.0B
`lists`	142.1KB	142.3KB	+178.0B
`securitySolution`	10.8MB	10.8MB	+178.0B
total			+532.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`securitySolution`	51.3KB	51.3KB	+28.0B

Unknown metric groups

API count

id	before	after	diff
`@kbn/securitysolution-io-ts-types`	66	67	+1
`securitySolution`	156	157	+1
total			+2

ESLint disabled line counts

id	before	after	diff
`enterpriseSearch`	13	15	+2
`securitySolution`	410	414	+4
total			+6

Total ESLint disabled count

id	before	after	diff
`enterpriseSearch`	14	16	+2
`securitySolution`	493	497	+4
total			+6

History

💛 Build #135027 was flaky 95895f6
💚 Build #134996 succeeded 370540b
💔 Build #134860 failed c53a7ed
💔 Build #134594 failed 302a153
💛 Build #131023 was flaky c367435
💛 Build #130650 was flaky c1b4048

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @rylnd

## Summary * Introduces a new API, POST `/api/risk_scores/calculate`, that triggers the code introduced here * As with the [preview route](#155966), this endpoint is behind the `riskScoringRoutesEnabled` feature flag * We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly a convenience/checkpoint for testing the existing code. The next PR will introduce a scheduled Task Manager task that invokes this code periodically. * Updates to the /preview route: * `data_view_id` is now a required parameter on both endpoints. If a dataview is not found by that ID, the id is used as the general index pattern to the query. * Response has been updated to be more similar to the [ECS risk fields](elastic/ecs#2236) powering this data. * Mappings created by the [Data Client](#158422) have been updated to be aligned to the ECS risk fields (linked above) * Adds/updates the [OpenAPI spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml) for these endpoints; useful starting point if you're trying to get oriented here. ## Things to review * [PR Demo environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home) * Preview API and related UI still works as expected * Calculation/Persistence API correctly bootstraps/persists data * correct mappings/ILM are created * things work in non-default spaces ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### Risk Matrix Delete this section if it is not applicable to this PR. Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release. When forming the risk matrix, consider some of the following examples and how they may potentially impact the change: | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Multiple Spaces—unexpected behavior in non-default Kibana Space. | Low | High | Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces. | | Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. | High | Low | Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure. | | Code should gracefully handle cases when feature X or plugin Y are disabled. | Medium | High | Unit tests will verify that any feature flag or plugin combination still results in our service operational. | | [See more potential risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) | ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

## Summary * Introduces a new API, POST `/api/risk_scores/calculate`, that triggers the code introduced here * As with the [preview route](elastic#155966), this endpoint is behind the `riskScoringRoutesEnabled` feature flag * We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly a convenience/checkpoint for testing the existing code. The next PR will introduce a scheduled Task Manager task that invokes this code periodically. * Updates to the /preview route: * `data_view_id` is now a required parameter on both endpoints. If a dataview is not found by that ID, the id is used as the general index pattern to the query. * Response has been updated to be more similar to the [ECS risk fields](elastic/ecs#2236) powering this data. * Mappings created by the [Data Client](elastic#158422) have been updated to be aligned to the ECS risk fields (linked above) * Adds/updates the [OpenAPI spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml) for these endpoints; useful starting point if you're trying to get oriented here. ## Things to review * [PR Demo environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home) * Preview API and related UI still works as expected * Calculation/Persistence API correctly bootstraps/persists data * correct mappings/ILM are created * things work in non-default spaces ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### Risk Matrix Delete this section if it is not applicable to this PR. Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release. When forming the risk matrix, consider some of the following examples and how they may potentially impact the change: | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Multiple Spaces—unexpected behavior in non-default Kibana Space. | Low | High | Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces. | | Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. | High | Low | Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure. | | Code should gracefully handle cases when feature X or plugin Y are disabled. | Medium | High | Unit tests will verify that any feature flag or plugin combination still results in our service operational. | | [See more potential risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) | ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

rylnd added Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: entity_analytics 8.9 candidate Feature:Entity Analytics Security Solution Entity Analytics features v8.9.0 labels Apr 27, 2023

rylnd self-assigned this Apr 27, 2023

rylnd mentioned this pull request Apr 27, 2023

[Draft][SecuritySolution][EntityAnalytics] Risk Scoring API #155965

Closed

5 tasks

SourinPaul added the AET label Apr 27, 2023

rylnd force-pushed the risk_score_api branch from c149da9 to e5d3106 Compare May 1, 2023 22:22

rylnd added 21 commits May 8, 2023 20:51

Add basic route for requesting risk scores

46ac1b4

Adds a POST route hidden behind a feature flag. There's no actual logic in the route yet, just: * schemae for the request/response (first pass) * constants, other boilerplate necessary for the route * adds a feature flag to control the route

Fleshing out the logic of calculating risk scores

6f27d82

The API now calculates and returns scores if there's data. The happy path works but I can't guarantee much else. Most parameters are not respected.

Add typings and simple version of top (riskiest) inputs

62c660e

Expresses the idea that the response can either be the simple id/index/score tuples generated by the query, or the full ECS alert/input documents depending on the parameters specified in the request.

Camel case on the outgoing response

a089398

Respect filters param in endpoint

16a0d5e

This is mostly a passthrough, not much to say here

Request risk inputs via current user

d2d81ba

Not the kibana user. Whoops!

Adds raw score and risk level to response

cd1e5b1

* DRYs up some code that is (currently) identical

Endpoint now respects the range parameters

cf86293

If specified along with the `filter` parameter, it will be appended as a range filter to the end of the other filters.

Limit the number of risk inputs processed to the first million

96481be

This is due to a hard cap on the number of iterations a painless script can perform. See elastic/elasticsearch#28946 for details.

Add scoring notes to the response

9f0080b

For now, this will only include a single message if the number of inputs found exceeded the number that we can process. In the future, it will include detailed boosting explanations. I think.

Return an error response if our scoring query blows up

ebea202

It's not the best message for e.g. painless errors, but it's better than nothing.

Add note from marshall about potential performance improvement

4223ba2

Filters passed via the API are a single object

b35f3e5

Multiple filters are expressed as a boolean query object. * Removes "default" filters from route invocation, as an empty filter is not valid within elasticsearch DSL.

Rename parameter to reflect (lack of) plurality

229e1af

Retrieve the alerts index via our app client

1b49fc3

Retrieve the index pattern from the provided dataViewId

715d8c8

If that's not provided, or the dataView isn't found, we instead default to the space-aware alerts index. Previous to this commit, we were using alerts from all spaces.

Exclude _source from top-level risk score query

c05497b

I have no idea if this will affect performance, but we don't need it.

Respect identifier_type parameter in UI

bbda87e

If 'host' or 'user' is provided, only identifiers of that type will be returned. If unspecified, both will be returned. If an unknown value is provided, no results are currently returned.

Add "debug" mode to risk score API

5393610

If specified, the API will return the full request/response being executed. This also nests the risk scores under a `scores` key in the response.

Rename request schema for accuracy

7aec3de

WafaaNasr reviewed Jun 8, 2023

View reviewed changes

rylnd added 2 commits June 12, 2023 18:10

More strict schema validations on weighting

7dd7266

This adds more realistic validations and accompanying tests to explain the logic.

Merge branch 'main' into risk_score_api

2448f1c

rylnd requested a review from a team as a code owner June 12, 2023 23:17

Adds a few more tests around allowed 'type's and 'value's

302a153

nkhristinin approved these changes Jun 13, 2023

View reviewed changes

rylnd added 10 commits June 13, 2023 10:33

Fix imports

c53a7ed

These enums were moved to a more central location.

Use our new schema in the API

830edeb

Replace our static type for risk weights with the io-ts type

6dcce07

* Updates functions and helpers to accept these generated types instead of our static ones * Updates some helper methods to be type predicates * Exports more generated types

Move IdentifierType definition to one from io-ts

0d6c252

Move after_keys to io-ts

fc98b91

Some reorganization for consistency

f6671bb

* Renames from category_weights to risk_weights to be more general and more consistent with common/risk_engine * Moves a few helpers specific to after_keys and identifier_types into helpers.ts

Ensure that we're validating our date range

a31d7eb

Merge branch 'main' into risk_score_api

370540b

Return a 404 if specified dataview is not found

95895f6

As opposed to defaulting to a different data source, which would be unexpected.

WafaaNasr reviewed Jun 15, 2023

View reviewed changes

WafaaNasr approved these changes Jun 15, 2023

View reviewed changes

Merge branch 'main' into risk_score_api

8b3f52b

rylnd requested a review from a team as a code owner June 15, 2023 15:03

rylnd merged commit ae068a6 into main Jun 15, 2023

rylnd deleted the risk_score_api branch June 15, 2023 19:16

kibanamachine added the backport:skip This commit does not require backporting label Jun 15, 2023

rylnd mentioned this pull request Jul 24, 2023

Risk Score Persistence API #161503

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SecuritySolution][EntityAnalytics] Risk Scoring Preview API #155966

[SecuritySolution][EntityAnalytics] Risk Scoring Preview API #155966

rylnd commented Apr 27, 2023 •

edited

Loading

WafaaNasr Jun 8, 2023

WafaaNasr Jun 8, 2023

rylnd Jun 14, 2023 •

edited

Loading

WafaaNasr Jun 15, 2023

WafaaNasr Jun 8, 2023

rylnd Jun 14, 2023

rylnd Jun 15, 2023

nkhristinin left a comment

WafaaNasr Jun 15, 2023

WafaaNasr left a comment

kibana-ci commented Jun 15, 2023

API count

ESLint disabled line counts

Total ESLint disabled count

[SecuritySolution][EntityAnalytics] Risk Scoring Preview API #155966

[SecuritySolution][EntityAnalytics] Risk Scoring Preview API #155966

Conversation

rylnd commented Apr 27, 2023 • edited Loading

Summary

Contents of this PR:

Current behavior, and short-term plans

How to review

Review against the PR instance

Review manually

What to review

TODO (for @rylnd)

Checklist

For maintainers

WafaaNasr Jun 8, 2023

Choose a reason for hiding this comment

WafaaNasr Jun 8, 2023

Choose a reason for hiding this comment

rylnd Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

WafaaNasr Jun 15, 2023

Choose a reason for hiding this comment

WafaaNasr Jun 8, 2023

Choose a reason for hiding this comment

rylnd Jun 14, 2023

Choose a reason for hiding this comment

rylnd Jun 15, 2023

Choose a reason for hiding this comment

nkhristinin left a comment

Choose a reason for hiding this comment

WafaaNasr Jun 15, 2023

Choose a reason for hiding this comment

WafaaNasr left a comment

Choose a reason for hiding this comment

kibana-ci commented Jun 15, 2023

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Module Count

Public APIs missing comments

Async chunks

Page load bundle

API count

ESLint disabled line counts

Total ESLint disabled count

History

rylnd commented Apr 27, 2023 •

edited

Loading

rylnd Jun 14, 2023 •

edited

Loading