Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] RAC framework calls to ES should wrap their own errors #106315

Closed
Tracked by #101016
pmuellr opened this issue Jul 20, 2021 · 9 comments · Fixed by #136225
Closed
Tracked by #101016

[RAC] RAC framework calls to ES should wrap their own errors #106315

pmuellr opened this issue Jul 20, 2021 · 9 comments · Fixed by #136225
Assignees
Labels
Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jul 20, 2021

From issue #101753, we realized that the RAC framework is making ES calls during the rule executor, which are throwing exceptions, and not caught. Those get bubbled up back to the rule executor, which assumes the error came from the rule executor. In this case, the rule executor was actually RAC, which did some work, called the original rule executor, and then did some more work, before returning control back to the alerting framework.

The problem with this is that it can cause some confusion during diagnosis of errors. It would really be best if the RAC ES calls were "wrapped", to catch exceptions, such that:

  • the RAC code logs an error/warning itself, with it's own plugin id (implicit in the logging tags)
  • the calculated error message provides a prefix describing what it's doing semantically
  • can feel free to throw the exception back to the alerting framework

This will help us when we see these exceptions to narrow down the place it occurred.

As an example, re: #101753, I believe the exception happened when executing the following code:

if (!aliasExists) {
try {
await clusterClient.indices.create({
index: concreteIndexName,
body: {
aliases: {
[alias]: {
is_write_index: true,
},
},
},
});
} catch (err) {
// something might have created the index already, that sounds OK
if (err?.meta?.body?.error?.type !== 'resource_already_exists_exception') {
throw err;
}
}
}

For that case, instead of just the throw err, it probably should log a message indicating the problem (debug is fine if it feels like it will be too noisy, but this one seems worthy of a logged message), and then create a new error with the any details found in the originating error, with an error message describing the error occurred while trying to create an alerting index.

There weren't too many ES calls that I could see, directly using the clusterClient; they seemed to be in the following three files:

@pmuellr pmuellr added the Feature:RAC label obsolete label Jul 20, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Jul 20, 2021
@timroes timroes added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Jul 21, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Jul 21, 2021
@pmuellr pmuellr added the Theme: rac label obsolete label Jul 27, 2021
@gmmorris
Copy link
Contributor

gmmorris commented Sep 3, 2021

Looking at the code I can't see a Rule Registry Logger at all.
Feels like the registry should have its own context, separate from the Alerting plugin and the Rule Type's plugin (which is wrapped in the registry).

@banderror
Copy link
Contributor

@gmmorris Rule Registry has its own logger and it's used to log some info and debug messages:

this.logger = initContext.logger.get();

If you enable it in the dev config, you will see some logs about installation of ES resources like component templates etc:

logging.events:
  {
    log: ['ruleRegistry', 'error', 'fatal'],
    request: ['fatal'],
    error: '*',
    ops: __no-ops__,
  }

Just to make sure I got it correctly, is this exactly what you meant by "the registry should have its own context", or something else?

@banderror
Copy link
Contributor

banderror commented Sep 3, 2021

@pmuellr I'd like to clarify the expected behaviour.

  • the RAC code logs an error/warning itself, with it's own plugin id (implicit in the logging tags)
  • the calculated error message provides a prefix describing what it's doing semantically
  • can feel free to throw the exception back to the alerting framework

What do you mean by the RAC code? I think we have several different citizens:

  • a generic indexing implementation (RuleDataService + RuleDataClient) used by all of the RAC-involved plugins
  • a somewhat-generic base executor implementations (lifecycle executor used by Observability plugins, persistence executor used by Security)
  • a specific rule type implementations in the RAC-involved plugins

RuleDataService and RuleDataClient currently write some logs, but they don't wrap any exceptions and don't have (almost) their own Error classes. I guess what's missing here is they could and should present every particular error with a custom Error exception with clear message describing the case. They will be able to handle some of the cases and not re-throw, while most of the exceptions will be either thrown or re-thrown to the outside world (most likely to the "somewhat-generic base executor implementations").

the calculated error message provides a prefix describing what it's doing semantically

Something like that?

} catch (e) {
  this.options.logger.error(e);

  const reason = e?.message || 'Unknown reason';
  throw new Error(`Failure installing ${resources}. ${reason}`);
}

If we talk about the rest of the "citizens", does the same approach to exceptions and logging apply?

Just to double-check, you're fine with exceptions thrown from whatever executors as long as they can be easily identified as "exceptions from rule registry" or "from security solution rule types" etc, right?

And finally, are you concerned only about the code that runs within the Alerting Framework executors, or any code in general that calls ES APIs from "RAC"? For example, we have RBAC and RBAC-related endpoints that call ES outside of the rule types.

@gmmorris
Copy link
Contributor

gmmorris commented Sep 6, 2021

Rule Registry has its own logger and it's used to log some info and debug messages:

oops, sorry @banderror , I don't know how I missed that!

Just to make sure I got it correctly, is this exactly what you meant by "the registry should have its own context", or something else?

Yeah, I think that should do it - as long as you catch the errors and log them yourself, they should have the context of the RulesRegistry.

@gmmorris
Copy link
Contributor

gmmorris commented Sep 6, 2021

Just to double-check, you're fine with exceptions thrown from whatever executors as long as they can be easily identified as "exceptions from rule registry" or "from security solution rule types" etc, right?

Yes, throwing the error tells the framework that the rule failed, and that's valuable.
What we want to easily identify is whether the exception took place in the rule itself, the rule registry or the framework. This should make it easier to identify the right team to help investigate and reduce our mean time to resolution.

@banderror banderror added the Team:Detections and Resp Security Detection Response Team label Sep 8, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@banderror banderror added the Team:Detection Alerts Security Detection Alerts Area Team label Oct 11, 2021
@banderror
Copy link
Contributor

Hey everyone, FYI ownership of this ticket and other tickets related to rule_registry (like #101016) now goes to the Detection Alerts area (Team:Detection Alerts label). Please ping @peluja1012 and @marshallmain if you have any questions.

@marshallmain
Copy link
Contributor

marshallmain commented Apr 19, 2022

Transferring again to @elastic/response-ops as they now own the rule registry implementation.

@marshallmain marshallmain added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Alerts Security Detection Alerts Area Team labels Apr 19, 2022
@mikecote mikecote added Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry and removed Theme: rac label obsolete Feature:RAC label obsolete labels May 16, 2022
@ymao1 ymao1 self-assigned this Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

8 participants