Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966

Merged
merged 12 commits into from
Feb 25, 2021

Conversation

marshallmain
Copy link
Contributor

@marshallmain marshallmain commented Feb 19, 2021

Summary

The goal here is to move the gap detection and remediation logic out of searchAfterBulkCreate so that in the future it can be shared among all rule types. In the process I refactored some of the logic to avoid calculating values multiple times in different places. Examining the code also exposed some bugs which I will comment on in the diff below and should be fixed in the refactored code.

Follow up work:

  • Establish a now timestamp that is used throughout rule executor to prevent subtle bugs from calculations using different values for now
  • Use gap remediation logic for all rule types

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@@ -64,16 +62,6 @@ export const searchAfterAndBulkCreate = async ({
// to ensure we don't exceed maxSignals
let signalsCreatedCount = 0;

const totalToFromTuples = getSignalTimeTuples({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than calculate the tuples inside searchAfterBulkCreate, they are calculated in signal_rule_alert_type.ts and passed into searchAfterBulkCreate.

maxSignals,
buildRuleMessage,
});
const remainingGap = getRemainingGap({ tuples, previousStartedAt });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRemainingGap calculates the difference between the from date for the earliest tuple and the previousStartedAt date (aka last time the rule ran). The advantage of this approach is we don't have to know about internals of how the tuples are computed to determine if there is a gap.

test('it returns null if the interval is an invalid string such as "invalid"', () => {
const gap = getGapBetweenRuns({
previousStartedAt: nowDate.clone().toDate(),
interval: 'invalid', // if not set to "x" where x is an interval such as 6m
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getGapBetweenRuns now takes an intervalDuration which has already been parsed, so this case is not needed.

const dateMathRuleParamsFrom = dateMath.parse(ruleParamsFrom);
if (dateMathRuleParamsFrom != null && intervalMoment != null) {
const momentUnit = shorthandMap[unit].momentString as moment.DurationInputArg2;
const gapDiffInUnits = dateMathRuleParamsFrom.diff(calculatedFromAsMoment, momentUnit);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gapDiffInUnits was always an integer, so if we compute the difference using units like hours then it can be truncated to 0, even though we should have a decimal gap (e.g. 0.1 hours)

buildRuleMessage('failed to calculate maxCatchup, ratio, or gapDiffInUnits')
);
}
let tempTo = dateMath.parse(ruleParamsFrom);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempTo is converted from a "now" based timestamp to a concrete moment here. However, time keeps advancing before ruleParamsFrom is converted into a concrete timestamp for the current rule run tuple at the end. Thus we can end up with a small gap between some of the tuples.

}`;
logger.debug(buildRuleMessage(`calculatedFrom: ${calculatedFrom}`));

const intervalMoment = moment.duration(parseInt(interval, 10), unit);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit is pulled off of the from parameter, but is being used to parse the interval here. If interval and from use different units, interval wouldn't be parsed correctly.

}),
];
try {
const intervalDuration = parseInterval(interval);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to minimize the parsing of strings into dates and instead parse once and pass the parsed value around. I also tried to reduce the number of places we can return null or undefined, and instead do up-front checks so the rest of the code can rely on the values existing.

Here, for example, parseInterval can throw, in which case we know that the rest of code path here that relies on intervalDuration won't be able to complete so we skip straight to the catch rather than catching the error inside parseInterval and returning null.

if (duration !== null) {
return duration.subtract(interval);
} else {
return null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a way for duration to be null, however if there is a way I'll revert this change.

@marshallmain
Copy link
Contributor Author

@elasticmachine merge upstream

@marshallmain marshallmain marked this pull request as ready for review February 20, 2021 05:57
@marshallmain marshallmain requested a review from a team as a code owner February 20, 2021 05:57
@marshallmain marshallmain added Feature:Detection Rules Anything related to Security Solution's Detection Rules release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detections and Resp Security Detection Response Team v7.13.0 v8.0.0 labels Feb 20, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@marshallmain
Copy link
Contributor Author

@elasticmachine merge upstream

@spong spong requested a review from a team February 22, 2021 18:19
if (remainingGap.asMilliseconds() > 0) {
const gapString = remainingGap.humanize();
const gapMessage = buildRuleMessage(
`${gapString} (${remainingGap.asMilliseconds()}ms) has passed since last rule execution, and signals may have been missed.`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more clear with this wording (it's not the number of milliseconds that has passed... it's the gap).

ratio: null,
gapDiffInUnits: null,
};
const ratio = Math.ceil(gapInMilliseconds / intervalDuration.asMilliseconds());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can intervalDuration.asMilliseconds() be 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that gets prevented in the alerting framework, but it's a good point. I added a check to prevent divide by 0 here.

try {
const intervalDuration = parseInterval(interval);
const gap = getGapBetweenRuns({ previousStartedAt, intervalDuration, from, to });
const catchup = getGapMaxCatchupRatio({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it more clear that this is the number of intervals to use to catch up?

mergeReturns,
mergeSearchResults,
} from './utils';
import { SearchAfterAndBulkCreateParams, SearchAfterAndBulkCreateReturnType } from './types';

// search_after through documents and re-index using bulk endpoint.
export const searchAfterAndBulkCreate = async ({
gap,
previousStartedAt,
tuples: totalToFromTuples,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe a more descriptive name than tuples? e.g. timeRangeTuples ...

const gapString = remainingGap.humanize();
const gapMessage = buildRuleMessage(
`${gapString} (${remainingGap.asMilliseconds()}ms) were not queried between this rule execution and the last execution, so signals may have been missed.`,
'Consider increasing your look behind time or adding more Kibana instances.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

now: nowDate.clone(),
});
expect(gap).toBeNull();
expect(gap.asMilliseconds()).toEqual(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

const someTuple = someTuples[1];
expect(moment(someTuple.to).diff(moment(someTuple.from), 's')).toEqual(10);
const someTuple = tuples[1];
expect(moment(someTuple.to).diff(moment(someTuple.from), 's')).toEqual(55);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the tuples were reversed from the previous functionality, essentially?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation had a special case for if the gap was <1 interval which was the scenario in this test. In this special case it would create only one extra tuple which was scaled so it covered only the gap duration rather than a full rule interval duration. The refactored implementation always uses interval + lookback as the tuple duration for consistency, so the difference to - from here changed from 10s (the gap duration) to 55s (interval + lookback)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the appropriate avenue to pursue. If a rule is scheduled to run weekly but only had a gap of an hour, it's going to schedule a second search over a weeks worth of data that it had already searched over instead of just the hours gap? That doesn't seem efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment in the code explaining some of the reasoning behind this change. The gist of it is that for some rule types a consistent query duration is important, and the overlap between consecutive rule runs affects the behavior as well. Since we have to be able to handle up to 4 extra gap-covering queries in a rule execution anyway I think the benefit of having consistent query durations outweighs the cost of extending the partial time duration to 1 full time duration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest re-working it in such a way that the default is the original functionality and the extended lookback is updated to use the overlap if the given rule type requires it. This way the lookback does not default to a full rule interval if the gap is less than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the dupes be ignored? In the case of threshold, they will. If they're ignored for all rule types, it seems fine to me to leave the consistent query duration... if we're experiencing gaps frequently, then probably something else is wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All rule types have duplicate detection and must have it due to the expected overlap between query time ranges, so dupes will be ignored. @madirey my reasoning is the same, the gaps should be very infrequent so ensuring correctness by making the code easy to reason about was a higher priority to me than optimizing for each case.

const intervalInMilliseconds = intervalDuration.asMilliseconds();
let currentTo = to;
let currentFrom = from;
// This loop will create tuples with overlapping time ranges, the same way rule runs have overlapping time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@marshallmain
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💛 Build succeeded, but was flaky


Test Failures

Kibana Pipeline / general / "before all" hook for "should contain notes".Timeline notes tab "before all" hook for "should contain notes"

Link to Jenkins

Stack Trace

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

AssertionError: Timed out retrying after 60000ms: Expected to find element: `[data-test-subj="add-a-note"] textarea`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Timeline notes tab`

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
    at Object.addNotesToTimeline (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/notes_tab.spec.ts:15917:8)
    at Context.eval (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/notes_tab.spec.ts:15046:28)

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@FrankHassanabad FrankHassanabad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

There was long lengthy conversations outside of github about a few items I'll add here.

It seems like maybe now that the rules have grown in variety and types we might need individual rules be able to select their backtracking choices rather than a 1 size fits all.

Examples are indicator matches and KQL searches need to only clear the gap and not go back and de-duplicate beyond the gap where thresholds and EQL need to clear the gap + re-look at their last time segment to ensure they didn't miss something within their max spans/aggs because of a time boundary.

As pointed out in the comments, rules which don't need to go back to the last segment and have a long interval of say 30 minutes but a short gap of say < 1 minute will end up incurring a higher cost from querying and de-duplicating if they fall behind momentarily vs before.

However, a good default such as the 4 segment backtracking one where it's gap + last segment is a good default if a rule doesn't have a preference for backtracking as it is the lowest common denominator in that it will work for all rule types with regards to correctness.

As rules and rule types increase we will more than likely want individual rules to choose what's the best for them but have an easy to choose fall back such as the 4 segment back-tracking one. We might always stay with this one strategy but as usual we re-visit decisions and things from time to time as feedback rolls in and adjust as needed.

In the discussions, I think everyone is on the same page that rules shouldn't be clearing gaps other than as rare events because that is usually a sign the rule runs are already behind because the system is over-subscribed. Feedback from most teams and forum posters so far to date is that either the rules are running fine or they go 💥 kaboom rather quickly and then adjustments are made.

Typically only when Kibana is rebooted or brought down for maintenance or unexpected surges in operation that are very short lived happen do we expect the gaps to be showing up. Either that or the system is over-subscribed in which case it should be upgraded/fixed/maintained/tuned if possible to remove the gap messages.

@marshallmain marshallmain merged commit f0838e6 into elastic:master Feb 25, 2021
marshallmain added a commit to marshallmain/kibana that referenced this pull request Feb 25, 2021
…ration for sharing between rule types (elastic#91966)

* Pull gap detection logic out in preparation for sharing between rule types

* Remove comments and unused import

* Remove unncessary function, cleanup, comment

* Update comment

* Address PR comments

* remove unneeded mocks

* Undo change to parseInterval

* Remove another unneeded mock

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
marshallmain added a commit that referenced this pull request Feb 25, 2021
…ration for sharing between rule types (#91966) (#92822)

* Pull gap detection logic out in preparation for sharing between rule types

* Remove comments and unused import

* Remove unncessary function, cleanup, comment

* Update comment

* Address PR comments

* remove unneeded mocks

* Undo change to parseInterval

* Remove another unneeded mock

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
jloleysens added a commit to jloleysens/kibana that referenced this pull request Feb 25, 2021
…tiple-searchable-snapshot-actions

* 'master' of github.com:elastic/kibana:
  [Rollup] Fix use of undefined value in JS import (elastic#92791)
  [ILM] Fix replicas not showing  (elastic#92782)
  [Event Log] Extended README.md with the documentation for a REST API and Start plugin contract. (elastic#92562)
  [XY] Enables page reload toast for the legacyChartsLibrary setting (elastic#92811)
  [Security Solution][Case] Improve hooks (elastic#89580)
  [Security Solution] Update wordings and breadcrumb for timelines page (elastic#90809)
  [Security Solution] Replace EUI theme with mocks in jest suites (elastic#92462)
  docs: ✏️ use correct heading level (elastic#92806)
  [ILM ] Fix logic for showing/hiding recommended allocation on Cloud (elastic#90592)
  [Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types (elastic#91966)
  [core.savedObjects] Remove _shard_doc tiebreaker since ES now adds it automatically. (elastic#92295)
  docs: ✏️ fix links in embeddable plugin readme (elastic#92778)

# Conflicts:
#	x-pack/plugins/index_lifecycle_management/public/application/sections/edit_policy/components/phases/shared_fields/searchable_snapshot_field/searchable_snapshot_field.tsx
marshallmain added a commit to marshallmain/kibana that referenced this pull request Mar 5, 2021
…ration for sharing between rule types (elastic#91966)

* Pull gap detection logic out in preparation for sharing between rule types

* Remove comments and unused import

* Remove unncessary function, cleanup, comment

* Update comment

* Address PR comments

* remove unneeded mocks

* Undo change to parseInterval

* Remove another unneeded mock

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
# Conflicts:
#	x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts
marshallmain added a commit that referenced this pull request Mar 5, 2021
…ration for sharing between rule types (#91966) (#93787)

* Pull gap detection logic out in preparation for sharing between rule types

* Remove comments and unused import

* Remove unncessary function, cleanup, comment

* Update comment

* Address PR comments

* remove unneeded mocks

* Undo change to parseInterval

* Remove another unneeded mock

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
# Conflicts:
#	x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Detection Rules Anything related to Security Solution's Detection Rules release_note:skip Skip the PR/issue when compiling release notes Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.12.0 v7.13.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants