Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SecuritySolution][EntityAnalytics] Risk Scoring Preview API #155966

Merged
merged 103 commits into from
Jun 15, 2023
Merged

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Apr 27, 2023

Summary

This PR adds a new Risk Scoring API endpoint. Its functionality is meant to replace the current transform-based solution.

Contents of this PR:

  • New feature flag: riskScoringRoutesEnabled
  • A new POST endpoint at /internal/risk_scores/preview
  • An OpenAPI doc for the endpoint
  • Unit and integration tests

Current behavior, and short-term plans

The endpoint as specified in this branch is read-only. When the endpoint is hit, it triggers some aggregations in elasticsearch, and a formatted response is returned; there is no persistence at this time. This endpoint was originally written as a POC to demonstrate the new Risk Engine's functionality, but it will now drive the Preview Risk Scoring feature.

The main path for the Risk Engine is going to be a scheduled task that calculates Risk Scores and writes them to a persistent datastream that we own. (https://github.com/elastic/security-team/issues/6450). To accomplish this, we will decompose the full functionality of this endpoint into constituent pieces (i.e. calculate | persist, get)

How to review

I've created a Postman collection that can be used to exercise this endpoint. It was generated by Postman from the OpenAPI spec, and modified by me to contain a valid subset of request parameters; please peruse the spec and/or feel free to generate your own scripts/tools from the spec.

curl -L -H 'Authorization: 10c7f646373aa116' -o 'Risk Scoring API.postman_collection.json' https://upload.elastic.co/d/007a57857fc40c791835629ea6dd692d2a8a290860f2917329d688be78c03b1d

Review against the PR instance

I've created a demo instance containing the code on this branch, along with some realistic(ish) alert data (~200k alerts). While you can use this instance as a convenience, you will need to set up kibana-remote-dev and forward ports in order to be able to access the instance's API from a local machine:

  1. Configure kibana-remote-dev with your SSH key and GitHub token.
  2. Configure kibana-remote-dev to specify GITHUB_USERNAME=rylnd
  • This allows you to bypass kibana-remote-dev code that assumes projects are owned by you
  1. Forward local ports to my instance: ./ports rd-rylnd-pr-155966-risk-score-api
  2. Use postman to talk to http://localhost:5601, which will be forwarded to the cloud instance via the previous command

Review manually

  1. Check out this branch
  2. Enable the feature flag
  3. Populate some event data and generate some alerts
  4. Navigate to the new endpoint, and observe that the host.names and user.names from those alerts have been aggregated into these "risk scores" in the response
  5. Play with the request options to see how these affect the scores (and see docs/test for more details on how those work)

What to review

  • Are the scores internally consistent? I.e. do they add up as expected? Does the corresponding "level" make sense?
  • Do parameters apply as expected? E.g. do weights predictably scale the results?
  • Are there discrepancies between the spec and the actual implementation?
  • Does pagination make sense? (i.e. the after_keys stuff)?

TODO (for @rylnd)

  • Add descriptions to the OpenAPI docs
  • Remove remaining TODOs from code

Checklist

  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list

For maintainers

Related ticket: https://github.com/elastic/security-team/issues/4211

@rylnd rylnd added Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: entity_analytics 8.9 candidate Feature:Entity Analytics Security Solution Entity Analytics features v8.9.0 labels Apr 27, 2023
@rylnd rylnd self-assigned this Apr 27, 2023
@SourinPaul SourinPaul added the AET label Apr 27, 2023
@rylnd rylnd force-pushed the risk_score_api branch from c149da9 to e5d3106 Compare May 1, 2023 22:22
rylnd added 21 commits May 8, 2023 20:51
Adds a POST route hidden behind a feature flag. There's no actual logic
in the route yet, just:

* schemae for the request/response (first pass)
* constants, other boilerplate necessary for the route
* adds a feature flag to control the route
The API now calculates and returns scores if there's data. The happy
path works but I can't guarantee much else. Most parameters are not
respected.
Expresses the idea that the response can either be the simple
id/index/score tuples generated by the query, or the full ECS
alert/input documents depending on the parameters specified in the
request.
The new parameter, enrich_inputs, allows the full documents to be
returned in the response.

This also simplifies the response to be a single array, as opposed to
grouped by entity type as before.

In order to prevent an otherwise unnecessary pass over the response
data, we no longer convert the risk inputs' sort field into a top-level
"score" field. We can add this back as a convenience, later.
I wrote this function and intended to extract inputs, call this
function, and then enrich the original inputs with the results. I then
realized that most of this could be done for us by specifying `_source:
true` in the riskiest_inputs subaggregation.

For now, this function is unused, but if in the future we need e.g.
runtime field support for these docs, we might need to reevaluate/use
this.
This is mostly a passthrough, not much to say here
Not the kibana user. Whoops!
* DRYs up some code that is (currently) identical
If specified along with the `filter` parameter, it will be appended as a
range filter to the end of the other filters.
This is due to a hard cap on the number of iterations a painless script
can perform. See elastic/elasticsearch#28946
for details.
For now, this will only include a single message if the number of inputs
found exceeded the number that we can process. In the future, it will
include detailed boosting explanations. I think.
It's not the best message for e.g. painless errors, but it's better than nothing.
Multiple filters are expressed as a boolean query object.

* Removes "default" filters from route invocation, as an empty filter is
  not valid within elasticsearch DSL.
If that's not provided, or the dataView isn't found, we instead default
to the space-aware alerts index. Previous to this commit, we were using
alerts from all spaces.
I have no idea if this will affect performance, but we don't need it.
If 'host' or 'user' is provided, only identifiers of that type will be
returned. If unspecified, both will be returned. If an unknown value is
provided, no results are currently returned.
If specified, the API will return the full request/response being
executed.

This also nests the risk scores under a `scores` key in the response.
description: "Configuration used to tune risk scoring. Weights can be used to change the score contribution of risk inputs for hosts and users at both a global level and also for Risk Input categories (e.g. 'alerts')."
type: object
required:
- type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand 😊:

When the weight property is not provided in the request, how is the score calculated?

Additionally, I noticed that when users include the weight property, they are required to specify a type property. By examining the schema, it appears that the type property can be any string. I'm curious about how we handle this type property. Is it unique for each calculation, or do users define it globally and reference it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://github.com/elastic/kibana/pull/155966/files#diff-13daee434883598231f7bf1e9acb58059d4ba751440ddcefa2c478453e9405c0R12, I observed that we define two types, namely RISK_CATEGORY_WEIGHT_TYPE and GLOBAL_IDENTIFIER_WEIGHT_TYPE. I'm uncertain about the behavior if a user were to provide a different value for the type property. Should we include validation to handle such cases?

Copy link
Contributor Author

@rylnd rylnd Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the weight property is not provided in the request, how is the score calculated?

If a weight is not specified, it's assumed to be 1. You can see that behavior in the tests for buildWeightingOfScoreByCategory. In the case of a no global weighting, we just omit it, since multiplying by 1 is a no-op.

Should we include validation to handle such cases?

I've written some custom io-ts for most of these parameters, now; please give this another pass to see if there are any outstanding questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done 👏🏻 thank you!!

RiskScoresRequest:
type: object
properties:
after_keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some investigation 😅, I noticed that the after_key is being used for pagination in the Elasticsearch aggregation query. This led me to question whether this API is intended for internal use only and primarily targeted at users who are knowledgeable about Elasticsearch queries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I don't expect users to be consuming this API themselves (although they can). The pagination will be handled by the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: after more discussion I've decided:

  1. This endpoint is going to remain internal for now, but
  2. In developing the public API I think it makes sense to restrict users to requesting a single type of entity, which in turn will make the pagination much simpler. I'll probably update this endpoint in the same way unless the UI ends up needing to make a single request for some reason.
  3. Once the above is done, we can decide whether this endpoint should be made public, or now.

rylnd added 2 commits June 12, 2023 18:10
This adds more realistic validations and accompanying tests to explain
the logic.
@rylnd rylnd requested a review from a team as a code owner June 12, 2023 23:17
Copy link
Contributor

@nkhristinin nkhristinin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

rylnd added 10 commits June 13, 2023 10:33
These enums were moved to a more central location.
* Updates functions and helpers to accept these generated types instead
  of our static ones
* Updates some helper methods to be type predicates
* Exports more generated types
This defines identifierWeights slightly differently to allow arbitrary
access by identifierType, e.g. `weight[identifierType]`. This change has
no effect on the validation (as evidenced by the unit tests), but causes
the generated type to effectively say "you can always access host and
user keys, although they may be undefined"
* Renames from category_weights to risk_weights to be more general and
  more consistent with common/risk_engine
* Moves a few helpers specific to after_keys and identifier_types into
  helpers.ts
As opposed to defaulting to a different data source, which would be
unexpected.
enumName: string,
theEnum: Record<string, EnumType>
): t.Type<EnumType, EnumType, unknown> {
const isEnumValue = (input: unknown): input is EnumType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@WafaaNasr WafaaNasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks Ryland!!
Appreciate addressing the comments, organising the commits, and the overallPR structure.

@rylnd rylnd requested a review from a team as a code owner June 15, 2023 15:03
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Security Solution Tests #2 / Detection response view Open in timeline "after each" hook for "opens timeline with correct query count for open alerts by rule table"
  • [job] [logs] Security Solution Tests #2 / Detection response view Open in timeline opens timeline with correct query count for open alerts by rule table
  • [job] [logs] FTR Configs #20 / graph app feature controls security global graph all privileges allows creating a new workspace

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
infra 1460 1461 +1
lists 259 260 +1
securitySolution 4143 4144 +1
total +3

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
securitySolution 112 113 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
infra 2.0MB 2.0MB +176.0B
lists 142.1KB 142.3KB +178.0B
securitySolution 10.8MB 10.8MB +178.0B
total +532.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
securitySolution 51.3KB 51.3KB +28.0B
Unknown metric groups

API count

id before after diff
@kbn/securitysolution-io-ts-types 66 67 +1
securitySolution 156 157 +1
total +2

ESLint disabled line counts

id before after diff
enterpriseSearch 13 15 +2
securitySolution 410 414 +4
total +6

Total ESLint disabled count

id before after diff
enterpriseSearch 14 16 +2
securitySolution 493 497 +4
total +6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @rylnd

@rylnd rylnd merged commit ae068a6 into main Jun 15, 2023
@rylnd rylnd deleted the risk_score_api branch June 15, 2023 19:16
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 15, 2023
@rylnd rylnd mentioned this pull request Jul 24, 2023
4 tasks
rylnd added a commit that referenced this pull request Jul 28, 2023
## Summary

* Introduces a new API, POST `/api/risk_scores/calculate`, that triggers
the code introduced here
* As with the [preview
route](#155966), this endpoint is
behind the `riskScoringRoutesEnabled` feature flag
* We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly
a convenience/checkpoint for testing the existing code. The next PR will
introduce a scheduled Task Manager task that invokes this code
periodically.
* Updates to the /preview route:
* `data_view_id` is now a required parameter on both endpoints. If a
dataview is not found by that ID, the id is used as the general index
pattern to the query.
* Response has been updated to be more similar to the [ECS risk
fields](elastic/ecs#2236) powering this data.
* Mappings created by the [Data
Client](#158422) have been updated
to be aligned to the ECS risk fields (linked above)
* Adds/updates the [OpenAPI
spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml)
for these endpoints; useful starting point if you're trying to get
oriented here.


## Things to review
* [PR Demo
environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home)
* Preview API and related UI still works as expected
* Calculation/Persistence API correctly bootstraps/persists data
    * correct mappings/ILM are created
    * things work in non-default spaces
   



### Checklist


- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
achyutjhunjhunwala pushed a commit to achyutjhunjhunwala/kibana that referenced this pull request Jul 28, 2023
## Summary

* Introduces a new API, POST `/api/risk_scores/calculate`, that triggers
the code introduced here
* As with the [preview
route](elastic#155966), this endpoint is
behind the `riskScoringRoutesEnabled` feature flag
* We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly
a convenience/checkpoint for testing the existing code. The next PR will
introduce a scheduled Task Manager task that invokes this code
periodically.
* Updates to the /preview route:
* `data_view_id` is now a required parameter on both endpoints. If a
dataview is not found by that ID, the id is used as the general index
pattern to the query.
* Response has been updated to be more similar to the [ECS risk
fields](elastic/ecs#2236) powering this data.
* Mappings created by the [Data
Client](elastic#158422) have been updated
to be aligned to the ECS risk fields (linked above)
* Adds/updates the [OpenAPI
spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml)
for these endpoints; useful starting point if you're trying to get
oriented here.


## Things to review
* [PR Demo
environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home)
* Preview API and related UI still works as expected
* Calculation/Persistence API correctly bootstraps/persists data
    * correct mappings/ILM are created
    * things work in non-default spaces
   



### Checklist


- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
ThomThomson pushed a commit to ThomThomson/kibana that referenced this pull request Aug 1, 2023
## Summary

* Introduces a new API, POST `/api/risk_scores/calculate`, that triggers
the code introduced here
* As with the [preview
route](elastic#155966), this endpoint is
behind the `riskScoringRoutesEnabled` feature flag
* We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly
a convenience/checkpoint for testing the existing code. The next PR will
introduce a scheduled Task Manager task that invokes this code
periodically.
* Updates to the /preview route:
* `data_view_id` is now a required parameter on both endpoints. If a
dataview is not found by that ID, the id is used as the general index
pattern to the query.
* Response has been updated to be more similar to the [ECS risk
fields](elastic/ecs#2236) powering this data.
* Mappings created by the [Data
Client](elastic#158422) have been updated
to be aligned to the ECS risk fields (linked above)
* Adds/updates the [OpenAPI
spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml)
for these endpoints; useful starting point if you're trying to get
oriented here.


## Things to review
* [PR Demo
environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home)
* Preview API and related UI still works as expected
* Calculation/Persistence API correctly bootstraps/persists data
    * correct mappings/ILM are created
    * things work in non-default spaces
   



### Checklist


- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.9 candidate backport:skip This commit does not require backporting Feature:Entity Analytics Security Solution Entity Analytics features release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: entity_analytics v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants