Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] max_signals validation follow up fixes #182643

Closed
wants to merge 3 commits into from

Conversation

jpdjere
Copy link
Contributor

@jpdjere jpdjere commented May 6, 2024

Follow up to: #179680

Summary

Changes

  • Sets API contract max value of max_signals to 1000 via Zod validation. This aligns the API contract with Kibana's behaviour: even if a user sets max_signals to a value over 1000, only a maximum of 1000 alerts are generated. This change aligns the contract with the actual behaviour.

image

  • Cleans up error messages in toast for Zod validation error messages to remove the [request body] prefix.

Before:
image

After:
image

Docs note

FYI @joepeeples

If we en up merging this PR with the API validation for 1000 being the max value for max_signals, we'll have to revisit the docs that are being added in this PR, which explains that values higher than 1000 will be ignored by Kibana, which will only generate a maximum of 1000 alerts.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@jpdjere jpdjere self-assigned this May 6, 2024
@jpdjere jpdjere added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team Feature:Rule Creation Security Solution Detection Rule Creation workflow Feature:Rule Details Security Solution Detection Rule Details page Feature:Rule Edit Security Solution Detection Rule Editing workflow labels May 6, 2024
@jpdjere jpdjere marked this pull request as ready for review May 6, 2024 12:29
@jpdjere jpdjere requested review from a team as code owners May 6, 2024 12:29
@jpdjere jpdjere requested review from maximpn and dhurley14 May 6, 2024 12:29
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

@joepeeples
Copy link
Contributor

If we end up merging this PR with the API validation for 1000 being the max value for max_signals, we'll have to revisit the docs that are being added in this PR, which explains that values higher than 1000 will be ignored by Kibana, which will only generate a maximum of 1000 alerts.

@jpdjere Does this mean that 1000 would become a hard-coded limit, regardless of the value of the global config setting xpack.alerting.rules.run.alerts.max? The docs PR just uses 1000 as an example, as the actual limit could be whatever the global config setting is. But if your PR adds a new hard-coded limit we'd definitely need to explain that differently in docs.

Also would it be possible to tweak the error messages in the images above? We don't really use the name max_signals or signals in the UI, so users might not immediately connect it with the UI setting Max alerts per run. They might also not immediately get what 1000 (400) is supposed to mean. So maybe something like:

Max alerts per run: Value must be less than or equal to 1000. (Error code 400)

@banderror banderror requested review from dplumlee and removed request for maximpn May 6, 2024 14:46
@dplumlee
Copy link
Contributor

dplumlee commented May 6, 2024

Re: the 1000 limit via API validation - is it definitely the case that users can't set xpack.alerting.rules.run.alerts.max greater than 1000?

Comment on lines +247 to +250
function postprocessErrorString(str: string): string {
// Remove the `[request body]` prefix added by Zod for request validation errors
return str.replace(/\[request body\]:/g, '');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove the [request body] part of the string? How can we distinguish errors in the body from errors in the query part of the request?

@@ -200,7 +200,7 @@ export type AlertsIndexNamespace = z.infer<typeof AlertsIndexNamespace>;
export const AlertsIndexNamespace = z.string();

export type MaxSignals = z.infer<typeof MaxSignals>;
export const MaxSignals = z.number().int().min(1);
export const MaxSignals = z.number().int().min(1).max(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be coordinated with a detection-rules release as some default to 10k as well, right?

Or do the pre-built rules flow through a different part of the API?

I'm sure you're tracking this, so don't mind me....just saw these errors with my rules this morning so that's how If found myself here 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question, I didn't know we had detection rules with that high of a max_signals count. That could potentially cause problems when trying to duplicate rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @spong for this catch 🙏

Additionally, the way it's implemented is a breaking change in the API. If the value is higher than 1000, attempts to create or update a rule will fail with 400, which wasn't the case before. There shouldn't be a max limit at the schema level, we should calculate the final value of max_signals to use in rule executors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question, I didn't know we had detection rules with that high of a max_signals count. That could potentially cause problems when trying to duplicate rules.

Yep, I think rule installation and upgrade will be broken as well for those prebuilt rules that have this value higher than 1000.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are 18 rules in the detection-rules repo that have max_signals set to 10k.

Do I get it right that even if we don't merge this PR, in the default Kibana configuration where xpack.alerting.rules.run.alerts.max == 1000 all these rules, including the Endpoint Security one, will be executing with warnings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be the expected action for rule executions with max_signals > 1000 now so those 18 rules would run with warnings out of the box

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Not a strong opinion, but I think this UX might not be the most optimal. If the expectation from the TRaDE team is that those 18 rules should promote as many external alerts as possible (hense the 10k setting), I would suggest to:

  1. Either override the system's max limit with the rule's max_signals, for these rules. This could be done via an additional rule parameter max_signals_force: boolean and an additional checkbox in the UI. (not sure that it's a good idea).
  2. Or not generate a Warning status if max_signals > xpack.alerting.rules.run.alerts.max AND it's a prebuilt rule. (looks like a good tradeoff to me).
  3. Or maybe set the max_signals value for those 18 rules to 1000, if the TRaDE team would be ok with that. (might be not an option if the goal is to promote as many external alerts as possible and let users increase the xpack.alerting.rules.run.alerts.max setting for that).

Maybe there's something else we could do. But I think we should try to make sure rules work out of the box with default settings. The Endpoint Security rule is the most basic one, it is installed and enabled automatically with Elastic Defend, and it would be weird to see it not succeeding, unless the user changes some system setting in the Kibana config, for all Kibana instances in their cluster.

@jpdjere @dplumlee Could we please sync on this in the simplified protections channel with the TRADE team, Kseniia and Alex and see what they have to say?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think options 2 or 3 would be my preferences off the top of my head. Something to consider is that even if we don't display a warning, the rule would still be hard capped by the xpack.alerting.rules.run.alerts.max config value out of the box and, should the rule ever reach that value (however unlikely), would throw an error from the alerting side of things letting us know we hit it. With that in mind, maybe option 3 would be best, but we'd need to check with TRaDE team for the intent behind that 10k setting.

Copy link
Contributor

@banderror banderror May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like option 3 is viable based on Justin's response:

As an example if an environment had say 100k hosts and an upstream integration, such as elastic endpoint happens to generate a batch of alerts over that time, we want to allow them all to match
As you can see, the larger the deployment, the greater the likelihood of hitting that capacity.

So I guess my vote goes to the 2nd option + documenting this in multiple places + maybe even showing a warning on the Rule Details page (not a warning rule status, but a static callout or something like that with a link to the public docs). I added a topic to discuss this to the Simplified Protection's agenda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -200,7 +200,7 @@ export type AlertsIndexNamespace = z.infer<typeof AlertsIndexNamespace>;
export const AlertsIndexNamespace = z.string();

export type MaxSignals = z.infer<typeof MaxSignals>;
export const MaxSignals = z.number().int().min(1);
export const MaxSignals = z.number().int().min(1).max(1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question, I didn't know we had detection rules with that high of a max_signals count. That could potentially cause problems when trying to duplicate rules.

@@ -200,7 +200,7 @@ export type AlertsIndexNamespace = z.infer<typeof AlertsIndexNamespace>;
export const AlertsIndexNamespace = z.string();

export type MaxSignals = z.infer<typeof MaxSignals>;
export const MaxSignals = z.number().int().min(1);
export const MaxSignals = z.number().int().min(1).max(1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the argument for changing this to max of 1000? In addition to Garrett's comment, I'm not entirely sure we can't still set xpack.alerting.rules.run.alerts.max to something greater than 1000, I think we just "don't support it" as in we can't guarantee things don't break on your cluster. Is there something else that would limit it outside of that config value? I get the argument that it's weird to give users the ability to put insanely big numbers if we just use the minimum of 2 values but we could do that same kind of wacky stuff with other form components too that wouldn't actually end up having an impact on the rule. And we also have the same workflow if the config value was set to 10 and the user entered a value like 50 - we would just use the value they didn't enter.

The initial reason we uncapped the limit was to make it easier to still create/edit rules with the default or existing values even after the alerting config value had changed. Now users can change unrelated fields in existing rules without having to go in and also modify max_signals in order to get the rule update to pass API validation. I also understand that setting it that high makes it a very rare edge case so I don't care too much about it (outside of it potentially causing issues with prebuilt rules), but just wanted to explain the genesis of why it was originally set that way!

@@ -200,7 +200,7 @@ export type AlertsIndexNamespace = z.infer<typeof AlertsIndexNamespace>;
export const AlertsIndexNamespace = z.string();

export type MaxSignals = z.infer<typeof MaxSignals>;
export const MaxSignals = z.number().int().min(1);
export const MaxSignals = z.number().int().min(1).max(1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @spong for this catch 🙏

Additionally, the way it's implemented is a breaking change in the API. If the value is higher than 1000, attempts to create or update a rule will fail with 400, which wasn't the case before. There shouldn't be a max limit at the schema level, we should calculate the final value of max_signals to use in rule executors.

@jpdjere
Copy link
Contributor Author

jpdjere commented May 7, 2024

@dplumlee @banderror @joepeeples

I had misunderstood the behaviour of Kibana: I thought that 1000 was a hard limit in the rule executors, which overrided both the rule's value and the Kibana config. I understand now that it's just the default Kibana config and that the config can be set to a value higher than that.

Given the feedback here, I'll discard setting the validation on the API schema layer.

Do I get it right that even if we don't merge this PR, in the default Kibana configuration where xpack.alerting.rules.run.alerts.max == 1000 all these rules, including the Endpoint Security one, will be executing with warnings?

I'll give this a try with an actual prebuilt rule.

There shouldn't be a max limit at the schema level, we should calculate the final value of max_signals to use in rule executors.

@banderror
So you think we should limit the max value of max_signals on the rule executors? That is, if we have the scenario where for the value for max_signals is:

  • rule value: 50
  • Kibana config: 5000

then calculated value in the rule executor should be 1000? Knowing now that there are Prebuilt Rules with a max_signals of 10,000, this doesn't look very clean now.

@dplumlee
Copy link
Contributor

dplumlee commented May 7, 2024

So you think we should limit the max value of max_signals on the rule executors? That is, if we have the scenario where for the value for max_signals is:

rule value: 50
Kibana config: 5000
then calculated value in the rule executor should be 1000? Knowing now that there are Prebuilt Rules with a max_signals of 10,000, this doesn't look very clean now.

@jpdjere I think he's saying we already calculate the effective "max signals" during rule execution from logic added in the previous PR. Right now we take the minimum of max_signals and xpack.alerting.rules.run.alerts.max and write a warning if it doesn't use the rule's own max_signals value. So for this case, we'd write a maximum of 50 alerts with no warning thrown.

@jpdjere
Copy link
Contributor Author

jpdjere commented May 7, 2024

Thanks @dplumlee!

Update on testing a prebuilt rule

Do I get it right that even if we don't merge this PR, in the default Kibana configuration where xpack.alerting.rules.run.alerts.max == 1000 all these rules, including the Endpoint Security one, will be executing with warnings?

First, as soon as I install a rule with a default max_signals of 10,000 in its security-asset (like Ransomware - Detected - Elastic Endgame), in a Kibana instance with the default config (so that the config defaults to 1000) the warning is shown: The rule's max alerts per run setting (10000) is greater than the Kibana alerting limit (1000). The rule will only write a maximum of 1000 alerts per rule run.
image

Second, I indexed 10,000 documents to force the rule to create that amount of alerts, but I got an unexpected result: it indexed only 100 rules (not 1,000 and not 10,000). I did it two times, and each time it created 100 alerts:
image
See the 201 alerts indicator at the top of the table (the extra one is just me testing via Dev Tools)

I'll delve a little deeper to understand what's going on here.

@banderror
Copy link
Contributor

I'll delve a little deeper to understand what's going on here.

Thank you @jpdjere!

So you think we should limit the max value of max_signals on the rule executors? That is, if we have the scenario where for the value for max_signals is:

rule value: 50
Kibana config: 5000
then calculated value in the rule executor should be 1000? Knowing now that there are Prebuilt Rules with a max_signals of > 10,000, this doesn't look very clean now.

I meant what Davis says in #182643 (comment). For a 10k example it should be:

rule value: 10000
Kibana config: 1000
then calculated value in the rule executor should be 1000

@jpdjere
Copy link
Contributor Author

jpdjere commented May 8, 2024

Ok, so, interesting discovery about the amount of alerts that are actually created. As I mentioned in this comment I have been testing the Ransomware - Detected - Elastic Endgame prebuilt rule, that comes with 10,000 as value for max_signals out of the box.

With the current behaviour of the app, the amount of alerts that we thought should be created is 1,000, which is the default value of the xpack.alerting.rules.run.alerts.max Kibana config. So I tried indexing 10,000 documents that should cause the rule to generate 10,000 alerts, but only see a maximum of 100 alerts generated each time. Both checking the index via the Dev Tools and of course the UI as well confirm this.

Digging into the code I found out:

1. In createSecurityRuleTypeWrapper we create searchAfterSize, the min between the calculated max_signals value and the constant DEFAULT_SEARCH_AFTER_PAGE_SIZE, which is 100.

const searchAfterSize = Math.min(maxSignals, DEFAULT_SEARCH_AFTER_PAGE_SIZE);

That is passed over to the rule executor:

const runResult = await type.executor({
...options,
services,
state: runState,
runOpts: {
completeRule,
inputIndex,
exceptionFilter,
unprocessedExceptions,
inputIndexFields,
runtimeMappings: {
...runtimeMappings,
...timestampRuntimeMappings,
},
searchAfterSize,
tuple,
bulkCreate,
wrapHits,
wrapSequences,
listClient,
ruleDataClient,
mergeStrategy,
primaryTimestamp,
secondaryTimestamp,
ruleExecutionLogger,
aggregatableTimestampField,
alertTimestampOverride,
alertWithSuppression,
refreshOnIndexingAlerts: refresh,
publicBaseUrl,
experimentalFeatures,
},
});

2. In the query executor, the searchAfterSize is passed as the pageSize option to searchAfterAndBulkCreate method, which is called when there is no alert suppression configured:

const result =
ruleParams.alertSuppression?.groupBy != null && hasPlatinumLicense
? await groupAndBulkCreate({
runOpts,
services,
spaceId,
filter: esFilter,
buildReasonMessage: buildReasonMessageForQueryAlert,
bucketHistory,
groupByFields: ruleParams.alertSuppression.groupBy,
eventsTelemetry,
experimentalFeatures,
})
: {
...(await searchAfterAndBulkCreate({
tuple: runOpts.tuple,
exceptionsList: runOpts.unprocessedExceptions,
services,
listClient: runOpts.listClient,
ruleExecutionLogger: runOpts.ruleExecutionLogger,
eventsTelemetry,
inputIndexPattern: runOpts.inputIndex,
pageSize: runOpts.searchAfterSize,
filter: esFilter,
buildReasonMessage: buildReasonMessageForQueryAlert,
bulkCreate: runOpts.bulkCreate,
wrapHits: runOpts.wrapHits,
runtimeMappings: runOpts.runtimeMappings,
primaryTimestamp: runOpts.primaryTimestamp,
secondaryTimestamp: runOpts.secondaryTimestamp,
})),

3. The method searchAfterAndBulkCreate calls searchAfterAndBulkCreateFactory, which in turn calls singleSearchAfter passing the pageSize (in our example, still 100):

const { searchResult, searchDuration, searchErrors } = await singleSearchAfter({
searchAfterSortIds: sortIds,
index: inputIndexPattern,
runtimeMappings,
from: tuple.from.toISOString(),
to: tuple.to.toISOString(),
services,
ruleExecutionLogger,
filter,
pageSize: Math.ceil(Math.min(tuple.maxSignals, pageSize)),
primaryTimestamp,
secondaryTimestamp,
trackTotalHits,
sortOrder,
additionalFilters,
});
mergedSearchResults = mergeSearchResults([mergedSearchResults, searchResult]);

searchResult is merged into mergedSearchResults, and then used to calculate includedEvents:

const [includedEvents, _] = await filterEventsAgainstList({
listClient,
exceptionsList,
ruleExecutionLogger,
events: mergedSearchResults.hits.hits,
});
// only bulk create if there are filteredEvents leftover
// if there isn't anything after going through the value list filter
// skip the call to bulk create and proceed to the next search_after,
// if there is a sort id to continue the search_after with.
if (includedEvents.length !== 0) {
const enrichedEvents = await enrichment(includedEvents);
const bulkCreateResult = await bulkCreateExecutor({
enrichedEvents,
toReturn,
});

The problem that I see here is that what is passed to filterEventsAgainstList is mergedSearchResults.hits.hits, which has a maximum size of 100, the pageSize. The rest of the events seem to be "discarded":

image

In the screenshot above, you can see that mergedSearchResults.hits.hits has a size of 100, while the total amount of events indexed is 148 (the number I was trying while debugging).

4. Back in searchAfterAndBulkCreate, the bulkCreateExecutor is created and those 100 docs passed to the bulkCreate method. The second argument of that method is maxAlerts which uses the value of maxSignals to calculate it. But by this time, we have less documents than the limit, so it doesn't really matter. 100 alerts are created

const bulkCreateExecutor: SearchAfterAndBulkCreateFactoryParams['bulkCreateExecutor'] = async ({
enrichedEvents,
toReturn,
}) => {
const wrappedDocs = wrapHits(enrichedEvents, buildReasonMessage);
const bulkCreateResult = await bulkCreate(
wrappedDocs,
tuple.maxSignals - toReturn.createdSignalsCount,
createEnrichEventsFunction({
services,
logger: ruleExecutionLogger,
})
);
addToSearchAfterReturn({ current: toReturn, next: bulkCreateResult });
return bulkCreateResult;
};


I'm trying to understand if this is by design or I bumped into a bug here. Maybe @marshallmain or @vitaliidm or someone else from Detection Engine team can guide us here.

It looks from the code above that any value of maxSignals over 100 doesn't really matter cause it gets "overriden" by the page size. It doesn't feel it should be this way for us.

cc @dplumlee @banderror

@marshallmain
Copy link
Contributor

marshallmain commented May 8, 2024

@jpdjere We have an API integration test that tries to ensure that max_signals values greater than 100 are respected, can you verify if that test is working on this branch?

It's expected that a maximum of 100 alerts would be created on the first iteration of the loop in searchAfterAndBulkCreateFactory, but if the number of alerts created is less than max_signals it should continue searching for more pages. There's a known bug where if many documents have identical timestamps then the paging does not work correctly and less alerts are created than you would expect. How are you creating the source documents for testing?

@jpdjere
Copy link
Contributor Author

jpdjere commented May 10, 2024

@marshallmain Thanks for your reply. Yes, indeed that bug was the issue. I was running a script to index my docs and all had identical timestamps, randomizing their secs and milisecs did the trick.

So I guess that leaves us back with the question of how we want to handle this issue. @banderror's three alternatives:

Either override the system's max limit with the rule's max_signals, for these rules. This could be done via an additional rule parameter max_signals_force: boolean and an additional checkbox in the UI. (not sure that it's a good idea).
Or not generate a Warning status if max_signals > xpack.alerting.rules.run.alerts.max AND it's a prebuilt rule. (looks like a good tradeoff to me).
Or maybe set the max_signals value for those 18 rules to 1000, if the TRaDE team would be ok with that. (might be not an option if the goal is to promote as many external alerts as possible and let users increase the xpack.alerting.rules.run.alerts.max setting for that).

are probably a good starting point to resume the conversation with the TRaDE team and Product in the next Simplified Protections meeting next week.

I will revert this PR back to draft since the changes are not what we want, keeping it open because of the discussion.

@jpdjere jpdjere marked this pull request as draft May 10, 2024 10:50
@banderror
Copy link
Contributor

Closing this PR, see #173593 (comment)

@banderror banderror closed this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Rule Creation Security Solution Detection Rule Creation workflow Feature:Rule Details Security Solution Detection Rule Details page Feature:Rule Edit Security Solution Detection Rule Editing workflow release_note:skip Skip the PR/issue when compiling release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants