Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RiskScoreEndgineData client + install ds and other resources for risk scoring #158422

Merged
merged 28 commits into from
Jun 21, 2023

Conversation

nkhristinin
Copy link
Contributor

@nkhristinin nkhristinin commented May 24, 2023

Risc score resources bootstrap

Screenshot 2023-06-12 at 14 46 56

ES PR: elastic/elasticsearch#96348

This PR introduces RiskEngineDataClient, which purpose to install resources per namespace, including ilm policy, component template, index template and datastream for risk score.

Some view demo/overview of the steps we do to initialise RiskEngineDataClient and resources

Screen.Recording.2023-05-26.at.15.31.36.mp4

For default space, it installs indexes when the security_soluition plugin is set up.
For other spaces, it initialises the resource when you call getWriter.

This data client was passed to RequestContextFactory

So in any request, it can be called like

 const riskEngineDataClient = (await context.securitySolution).getRiskEngineDataClient();
 const spaceId = (await context.securitySolution).getSpaceId();
 const riskEngineDataClientWriter = riskEngineDataClient.getWriter({ namespace: spaceId });

What is generated

  1. ILM policy

GET _ilm/policy/.risk-score-ilm-policy

{
  ".risk-score-ilm-policy": {
    "version": 1,
    "modified_date": "2023-05-25T10:52:36.592Z",
    "policy": {
      "phases": {
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_size": "50gb"
            }
          }
        }
      },
      "_meta": {
        "managed": true
      }
    },
    "in_use_by": {
      "indices": [
        ".ds-risk-score.risk-score-default-2023.05.25-000001"
      ],
      "data_streams": [
        "risk-score.risk-score-default"
      ],
      "composable_templates": [
        ".risk-score.risk-score-default-index-template"
      ]
    }
  }
}
  1. Component template for risk score mappings

GET _component_template/risk-score-mappings

{
  "component_templates": [
    {
      "name": "risk-score-mappings",
      "component_template": {
        "template": {
          "settings": {},
          "mappings": {
            "dynamic": "strict",
            "properties": {
              "identifierField": {
                "type": "keyword"
              },
              "otherScore": {
                "type": "float"
              },
              "alertsScore": {
                "type": "float"
              },
              "@timestamp": {
                "type": "date"
              },
              "level": {
                "type": "keyword"
              },
              "riskiestInputs": {
                "type": "nested",
                "properties": {
                  "index": {
                    "type": "keyword"
                  },
                  "id": {
                    "type": "keyword"
                  },
                  "riskScore": {
                    "type": "float"
                  }
                }
              },
              "identifierValue": {
                "type": "keyword"
              },
              "totalScore": {
                "type": "float"
              },
              "totalScoreNormalized": {
                "type": "float"
              }
            }
          }
        },
        "_meta": {
          "managed": true
        }
      }
    }
  ]
}
  1. Index template

GET _index_template/.risk-score.risk-score-default-index-template

{
  "index_templates": [
    {
      "name": ".risk-score.risk-score-default-index-template",
      "index_template": {
        "index_patterns": [
          "risk-score.risk-score-default"
        ],
        "template": {
          "settings": {
            "index": {
              "lifecycle": {
                "name": ".risk-score-ilm-policy"
              },
              "mapping": {
                "total_fields": {
                  "limit": "1000"
                }
              },
              "hidden": "true",
              "auto_expand_replicas": "0-1"
            }
          },
          "mappings": {
            "_meta": {
              "managed": true,
              "namespace": "default",
              "kibana": {
                "version": "8.9.0"
              }
            },
            "dynamic": false
          }
        },
        "composed_of": [
          "risk-score-mappings"
        ],
        "_meta": {
          "managed": true,
          "namespace": "default",
          "kibana": {
            "version": "8.9.0"
          }
        },
        "data_stream": {
          "hidden": true,
          "allow_custom_routing": false
        }
      }
    }
  ]
}
  1. Data stream

GET risk-score.risk-score-default - where is default is space name

return

{
  ".ds-risk-score.risk-score-default-2023.05.25-000001": {
    "aliases": {},
    "mappings": {
      "dynamic": "false",
      "_meta": {
        "namespace": "default",
        "kibana": {
          "version": "8.9.0"
        },
        "managed": true
      },
      "_data_stream_timestamp": {
        "enabled": true
      },
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "alertsScore": {
          "type": "float"
        },
        "identifierField": {
          "type": "keyword"
        },
        "identifierValue": {
          "type": "keyword"
        },
        "level": {
          "type": "keyword"
        },
        "otherScore": {
          "type": "float"
        },
        "riskiestInputs": {
          "type": "nested",
          "properties": {
            "id": {
              "type": "keyword"
            },
            "index": {
              "type": "keyword"
            },
            "riskScore": {
              "type": "float"
            }
          }
        },
        "totalScore": {
          "type": "float"
        },
        "totalScoreNormalized": {
          "type": "float"
        }
      }
    },
    "settings": {
      "index": {
        "lifecycle": {
          "name": ".risk-engine-ilm-policy"
        },
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "mapping": {
          "total_fields": {
            "limit": "1000"
          }
        },
        "hidden": "true",
        "number_of_shards": "1",
        "auto_expand_replicas": "0-1",
        "provided_name": ".ds-risk-score.risk-score-default-2023.05.25-000001",
        "creation_date": "1685009904171",
        "number_of_replicas": "0",
        "uuid": "_5yc7wG4Sxy88zIVqxC7yg",
        "version": {
          "created": "8090099"
        }
      }
    },
    "data_stream": "risk-score.risk-score-default"
  }
}

@nkhristinin nkhristinin changed the title Risc score resources bootstrap RiskScoreEndgineData client + install ds and other resources for risk scoring May 26, 2023

const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

export const retryTransientEsErrors = async <T>(
Copy link
Contributor

@vitaliidm vitaliidm May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered using p-retry utility, already available in Kibana repo?
It has rich configuration options within it, for example: https://github.com/elastic/kibana/blob/8.8/src/plugins/content_management/server/event_stream/es/init/es_event_stream_initializer.ts#L48-L60

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry_es.ts is copy past from alerting plugin, with the intention just use the creation of DS in the future from this plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this file to be more accurate/similar to its origin: retry_transient_es_errors.ts

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin nkhristinin marked this pull request as ready for review June 6, 2023 14:30
@nkhristinin nkhristinin requested a review from a team as a code owner June 6, 2023 14:30
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin nkhristinin added release_note:skip Skip the PR/issue when compiling release notes v8.9.0 Team:Detection Engine Security Solution Detection Engine Area labels Jun 9, 2023
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

namespace,
};

await createOrUpdateIlmPolicy({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can perform some of these tasks in parallel via Promise.all, we should.

* 2.0.
*/

// This file is a copy of x-pack/plugins/alerting/server/alerts_service/lib/create_concrete_write_index.ts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would prefer this file be named less ambiguously: utils/create_datastream.ts

await updateIndexMappings({ logger, esClient, totalFieldsLimit, concreteIndices });
}

// check if a concrete write ds already exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// check if a concrete write ds already exists
// check if a datastream write index already exists


const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

export const retryTransientEsErrors = async <T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this file to be more accurate/similar to its origin: retry_transient_es_errors.ts

const es = getService('es');

describe('install risk engine resources', () => {
it('should install resources on startup', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a test where these resources already exist on startup!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if we already have such tests? I am not sure that we can do that right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of a test that loads a legacy signals index.

We don't have the ability to load index templates or ILM via the es_archiver service, but you could certainly do that manually with the es service. There are lots of integration tests that use esClient.indices.putIndexTemplate etc.

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@rylnd rylnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I had a few nits about explicit typings and more tests, but nothing that should block this from getting merged. Nice work!


private async initialiseWriter(namespace: string) {
const writer: Writer = {
bulk: async () => {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, just a suggestion. Being explicit is usually the safest approach, but you're correct that this will be implemented momentarily.

const es = getService('es');

describe('install risk engine resources', () => {
it('should install resources on startup', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of a test that loads a legacy signals index.

We don't have the ability to load index templates or ILM via the es_archiver service, but you could certainly do that manually with the es service. There are lots of integration tests that use esClient.indices.putIndexTemplate etc.

}
}

const isDataStreamsExist = dataStreams.length > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super duper nit: omitting the is here makes this more legible; we already have a verb in Exist so the is is redundant.

Suggested change
const isDataStreamsExist = dataStreams.length > 0;
const dataStreamsExist = dataStreams.length > 0;


const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

export const retryTransientEsErrors = async <T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to copy this code, we might as well also copy the tests.

return writer;
}

public async initializeResources({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explicitly type the return value on this public method.

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Security Solution Tests #6 / Exceptions viewer read only "before each" hook for "Cannot take actions on exception"

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
securitySolution 113 114 +1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
securitySolution 29 30 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
securitySolution 52.0KB 52.0KB +26.0B
Unknown metric groups

API count

id before after diff
securitySolution 157 158 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 13 15 +2
securitySolution 411 420 +9
total +11

Total ESLint disabled count

id before after diff
enterpriseSearch 14 16 +2
securitySolution 492 501 +9
total +11

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@nkhristinin nkhristinin merged commit dba2e9b into elastic:main Jun 21, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 21, 2023
@rylnd rylnd mentioned this pull request Jul 24, 2023
4 tasks
rylnd added a commit that referenced this pull request Jul 28, 2023
## Summary

* Introduces a new API, POST `/api/risk_scores/calculate`, that triggers
the code introduced here
* As with the [preview
route](#155966), this endpoint is
behind the `riskScoringRoutesEnabled` feature flag
* We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly
a convenience/checkpoint for testing the existing code. The next PR will
introduce a scheduled Task Manager task that invokes this code
periodically.
* Updates to the /preview route:
* `data_view_id` is now a required parameter on both endpoints. If a
dataview is not found by that ID, the id is used as the general index
pattern to the query.
* Response has been updated to be more similar to the [ECS risk
fields](elastic/ecs#2236) powering this data.
* Mappings created by the [Data
Client](#158422) have been updated
to be aligned to the ECS risk fields (linked above)
* Adds/updates the [OpenAPI
spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml)
for these endpoints; useful starting point if you're trying to get
oriented here.


## Things to review
* [PR Demo
environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home)
* Preview API and related UI still works as expected
* Calculation/Persistence API correctly bootstraps/persists data
    * correct mappings/ILM are created
    * things work in non-default spaces
   



### Checklist


- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
achyutjhunjhunwala pushed a commit to achyutjhunjhunwala/kibana that referenced this pull request Jul 28, 2023
## Summary

* Introduces a new API, POST `/api/risk_scores/calculate`, that triggers
the code introduced here
* As with the [preview
route](elastic#155966), this endpoint is
behind the `riskScoringRoutesEnabled` feature flag
* We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly
a convenience/checkpoint for testing the existing code. The next PR will
introduce a scheduled Task Manager task that invokes this code
periodically.
* Updates to the /preview route:
* `data_view_id` is now a required parameter on both endpoints. If a
dataview is not found by that ID, the id is used as the general index
pattern to the query.
* Response has been updated to be more similar to the [ECS risk
fields](elastic/ecs#2236) powering this data.
* Mappings created by the [Data
Client](elastic#158422) have been updated
to be aligned to the ECS risk fields (linked above)
* Adds/updates the [OpenAPI
spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml)
for these endpoints; useful starting point if you're trying to get
oriented here.


## Things to review
* [PR Demo
environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home)
* Preview API and related UI still works as expected
* Calculation/Persistence API correctly bootstraps/persists data
    * correct mappings/ILM are created
    * things work in non-default spaces
   



### Checklist


- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
ThomThomson pushed a commit to ThomThomson/kibana that referenced this pull request Aug 1, 2023
## Summary

* Introduces a new API, POST `/api/risk_scores/calculate`, that triggers
the code introduced here
* As with the [preview
route](elastic#155966), this endpoint is
behind the `riskScoringRoutesEnabled` feature flag
* We intend to __REMOVE__ this endpoint before 8.10 release; it's mainly
a convenience/checkpoint for testing the existing code. The next PR will
introduce a scheduled Task Manager task that invokes this code
periodically.
* Updates to the /preview route:
* `data_view_id` is now a required parameter on both endpoints. If a
dataview is not found by that ID, the id is used as the general index
pattern to the query.
* Response has been updated to be more similar to the [ECS risk
fields](elastic/ecs#2236) powering this data.
* Mappings created by the [Data
Client](elastic#158422) have been updated
to be aligned to the ECS risk fields (linked above)
* Adds/updates the [OpenAPI
spec](https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/server/lib/risk_engine/schema/risk_score_apis.yml)
for these endpoints; useful starting point if you're trying to get
oriented here.


## Things to review
* [PR Demo
environment](https://rylnd-pr-161503-risk-score-task-api.kbndev.co/app/home)
* Preview API and related UI still works as expected
* Calculation/Persistence API correctly bootstraps/persists data
    * correct mappings/ILM are created
    * things work in non-default spaces
   



### Checklist


- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Detection Engine Security Solution Detection Engine Area v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants