Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC][Rule Registry] Index bootstrapping: make it more robust #106425

Open
Tracked by #101016
banderror opened this issue Jul 21, 2021 · 4 comments
Open
Tracked by #101016

[RAC][Rule Registry] Index bootstrapping: make it more robust #106425

banderror opened this issue Jul 21, 2021 · 4 comments
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete

Comments

@banderror
Copy link
Contributor

Parent ticket: #101016

Summary

We identified a few weak spots in the current implementation of index bootstrapping, that could represent potential issues and bugs. Let's address them to make the logic more resilient.

1. Index name creation & update logic

Currently, index name creation and update logic is split into two pieces.

rule_registry/server/rule_data_client/index.ts contains index name creation logic:

function createWriteTargetIfNeeded() {
  // ...
  const concreteIndexName = `${alias}-000001`;
  // ...
}

rule_registry/server/rule_data_plugin_service/utils.ts contains index name update logic:

function incrementIndexName(oldIndex: string) {
  const baseIndexString = oldIndex.slice(0, -6);
  const newIndexNumber = Number(oldIndex.slice(-6)) + 1;
  if (isNaN(newIndexNumber)) {
    return undefined;
  }
  return baseIndexString + String(newIndexNumber).padStart(6, '0');
}

incrementIndexName accepts index names created by createWriteTargetIfNeeded and returns undefined if the name doesn't match the pattern. However, the relationship between the two functions is not explicit. And it could lead to bugs if they become unsynchronized. A more robust option would be to have a single function that appends -000001 if the name doesn't contain it or increments the number otherwise.

Also, it should be flexible and not rely on 6 last digits, because there can be edge cases in general (#102586 (comment)).

2. Index mapping update logic

Index mapping update logic has several white spots. Add my concerns as comments to the updateAliasWriteIndexMapping method.

function updateAliasWriteIndexMapping({ index, alias }: { index: string; alias: string }) {
  const clusterClient = await this.getClusterClient();

  const simulatedIndexMapping = await clusterClient.indices.simulateIndexTemplate({
    name: index,
  });
  const simulatedMapping = get(simulatedIndexMapping, ['body', 'template', 'mappings']);
  try {
    await clusterClient.indices.putMapping({
      index,
      body: simulatedMapping,
    });
    return;
  } catch (err) {
    if (err.meta?.body?.error?.type !== 'illegal_argument_exception') {
     /* 
     * This part is unclear. Why do we skip the rollover if we've caught 'illegal_argument_exception'?
     */
      this.options.logger.error(`Failed to PUT mapping for alias ${alias}: ${err.message}`);
      return;
    }
    const newIndexName = incrementIndexName(index);
    if (newIndexName == null) { 
      /* 
       * I think we should not fail here. If the index name had no "-000001" suffix, we should append it.
       */
      this.options.logger.error(`Failed to increment write index name for alias: ${alias}`);
      return;
    }
    try {
     /* 
      * We don't check response here. It could return "{ acknowledged: false }". 
      * Should we consider a rollover to be successful in that case? 
      */
      await clusterClient.indices.rollover({
        alias,
        new_index: newIndexName,
      });
    } catch (e) {
      if (e?.meta?.body?.error?.type !== 'resource_already_exists_exception') {
      /* 
       * This part is also unclear. We log only 'resource_already_exists_exception' but silence all other exceptions. 
       * It looks incorrect.
       */
        this.options.logger.error(`Failed to rollover index for alias ${alias}: ${e.message}`);
      }
      
       /* 
       * There could be many reasons for the first attempt to update mappings to fail (network, etc.). 
       * I think we should retry in case of failure. 
       */
    }
  }
}

3. Error handling during bootstrapping

Currently, indices bootstrapping considered to be successful even if some requests failed. There is no way to programmatically retrieve the outcome of index template initialization as we only log errors to console. See initialisation in ecurity_solution/server/plugin.ts for example:

const initializeRuleDataTemplatesPromise = initializeRuleDataTemplates().catch((err) => {
  this.logger!.error(err);
});

It could lead to a situation when index bootstrapping fails, but RuleDataClient doesn't know anything about it and writes documents into an index that doesn't have proper mappings. In the worst case, dynamic mappings would be applied to the indexed documents, leading to unexpected behavior, like broken aggregations in the application code.

A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the rule_registry library level.

4. Retry logic during bootstrapping

We should also add some retry logic to the bootstrapping logic.

I talked to Elasticsearch team, and they recommended to add retries to component and index template updates, or IMO wherever else we could get { acknowledged: false } from ES during index bootstrapping.

Here's the log of our discussion:


Q: In Kibana (specifically, RAC project), we started to use the new template APIs for bootstrapping our indices (PR for reference #102586). This code is not well tested and not production ready, but it’s going to be used from Security and Observability solutions in the upcoming releases, so I’d like to make sure that we implemented it correctly. Specifically, we started to use:

We make sure to wait for the responses from these endpoints before any other code will be able to write anything to the corresponding indices. So any writing happens only if these calls succeed.

In those responses, we normally get 200 OK with the body:

{ acknowledged: true }

I wasn’t able to find it in the docs or elsewhere, so I’d like to double-check two things.

The first thing is about acknowledged: true:

  • The meaning - does it mean that the request is accepted, but work needs to be done asynchronously in the background?
  • Can we get acknowledged: false in a 200 OK response?
  • What errors can we get there in general, can we refer to some spec with listed errors and response structure?

The second thing is about race conditions between index mgmt and indexing requests. Is it possible to get a race condition between successfully acknowledged template mgmt requests, and the following (potentially, immediately after that) attempt to write a document to an index which name matches the name of the updated index template?

  • Here we have 2 cases in the code:
    • Initial index bootstrapping. This index doesn’t exist yet, and we create it explicitly in the code of Kibana.
    • Subsequent index upgrade (when we changed our templates). The index already exists. We do 2 things: first, we try to update its mappings/settings directly (to get them from the index template, we use POST /_index_template/_simulate API); second, if this fails, we do a rollover of the alias.
  • So basically, when creating initial index OR updating existing one OR rolling over alias, is it possible that the concrete index won’t get the most recent mappings and settings from the (just updated) templates?
  • If yes, is there a way to handle it? E.g. wait for completion of those index mgmt requests?

When developing locally, we’ve got into some weird broken state where the concrete index ended up with dynamic mappings instead of mappings from the index template. Most likely, this was caused by our incorrectly implemented code + constant server reloads during development. However, I decided to double-check if any race conditions are possible in general.


A:

The first thing is about acknowledged: true:

  • The meaning - does it mean that the request is accepted, but work needs to be done asynchronously in the background?

That means that the change has been acknowledged by the master and accepted with a cluster state update

  • Can we get acknowledged: false in a 200 OK response?
  • What errors can we get there in general, can we refer to some spec with listed errors and response structure?

You can get "false" for the acknowledged, if, when publishing the new cluster state, the request times out before enough nodes have acknowledged and accepted the cluster state

there is not any listed errors, so generally a good rule to consider would be to retry in that case

  • So basically, when creating initial index OR updating existing one OR rolling over alias, is it possible that the concrete index won’t get the most recent mappings and settings from the (just updated) templates?

if the template update response have "acknowledged: true", then retrieving those settings through something like the simulate API or creating a new index will use the just updated template


Q: If this question even makes sense, how many nodes is enough though? Does acknowledged: true guarantee that any subsequent write request will be handled by a node that received the most recent cluster state with the new mappings etc?

Just want to make sure there’s no way to get a race condition, leading to some weird state like let’s say a node which doesn’t know about a newly created index, processes write request and creates an index with dynamic mappings instead of mappings specified in the templates.


A: You shouldn't have to worry about this assuming the request is acknowledged: true, because that means the master node has acknowledged it, and all those requests (index creation, mappings updates) go through the master node

Further details

@banderror banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete labels Jul 21, 2021
@banderror banderror self-assigned this Jul 21, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@banderror banderror changed the title [RAC] Make index bootstrapping logic more robust [RAC][Rule Registry] Index bootstrapping: make it more robust Sep 1, 2021
@banderror banderror removed their assignment Sep 1, 2021
@banderror banderror removed the v7.16.0 label Sep 1, 2021
@banderror banderror added the Team:Detection Alerts Security Detection Alerts Area Team label Oct 11, 2021
@banderror
Copy link
Contributor Author

Hey everyone, I removed this ticket from the backlog of the Detection Rules area.

We (@elastic/security-detections-response-rules) are not the owners anymore (however feel free to still ping us if you have any tech questions about the ticket). Ownership of this ticket and other tickets related to rule_registry (like #101016) now goes to the Detection Alerts area (Team:Detection Alerts label). Please ping @peluja1012 and @marshallmain if you have any questions.

@marshallmain
Copy link
Contributor

Transferring again to @elastic/response-ops as they now own the rule registry implementation.

@marshallmain marshallmain added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Alerts Security Detection Alerts Area Team labels Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests

3 participants