Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] Add mapping update logic to RuleDataClient #102586

Merged

Conversation

marshallmain
Copy link
Contributor

@marshallmain marshallmain commented Jun 17, 2021

Summary

Part of effort for #101016

This PR introduces a function updateIndexMappingsMatchingPattern on the RuleDataPluginService that other plugins can call to update all the write indices matching a pattern. It is intended to be called after the plugin finishes creating/updating the component and index templates. As part of the encapsulation work that @banderror is doing we'll probably want to move the usage of this function from other plugins back into the RuleDataPluginService. So when a plugin declares templates that it wants created the RuleDataPluginService will take care of creating the templates and updating any indices that match the template automatically.

The general strategy behind the mapping updates mirrors Fleet's strategy: use PUT mapping to update the mapping in place if possible, and if that fails rollover the index to ensure that the new write index has the proper mappings.

Race conditions and mitigations:
It's very likely that multiple Kibana instances will be executing this upgrade code simultaneously. There's a good chance that 2 instances would make PUT mapping calls at the same time, both get conflicts, and both rollover the index. This approach resolves that problem by specifying the next_index param in the rollover API call. Even if N instances executing this upgrade logic all fail simultaneously at the PUT mapping step for an alias, then only one will succeed at rolling the index.

On a particular Kibana instance, the template and mapping update flow will be:

  1. Update component templates
  2. Update index template that depends on component templates
  • After this point, a rollover from any source (ILM or other Kibana instance) updates the mapping and we will not trigger another rollover because the PUT mapping call in step 5 will succeed
  1. Find write indices and associated aliases that match the index template we updated in step 2. This step locks in the current write index name which will be used to generate the next write index name, sort of like a transaction.
  • If an alias is rolled over from another source after this step, the rollover for that alias in step 6 will fail. This prevents multiple rollovers from happening in the case where PUT mapping fails on multiple Kibana instances simultaneously.
  1. (For each write index) Simulate the index mapping to get an updated mapping based on the new templates for the write index
  2. (For each write index) Attempt to replace the write index mapping with the updated mapping using PUT mapping API
  3. (For each write index) If PUT mapping fails with a conflict, compute the next write index by incrementing the counter on the end of the current write index and call the rollover API, providing the computed next write index as the next_index param in the API call

The key to avoiding races that could cause multiple rollovers due to multiple Kibana instances attempting this update process simultaneously is that the rollover API will return a resource_already_exists_exception if the provided next_index already exists. There are a number of possible cases to consider regarding the index being rolled over from other sources during this update process, however they should all be handled correctly.

  • If Elasticsearch ILM policy naturally causes a rollover anytime before step 2, then the process proceeds as normal. Nothing prevents us from updating the templates and rolling over again if necessary. If one Kibana instance completes its mapping update process and rolls over before another instance reaches step 2, the second instance will succeed at the PUT mapping step and not rollover.
  • If an index is rolled over between steps 2 and 3, then we will find the new write index in step 3 and PUT mapping will succeed so the function will return. No additional rollover and the mapping is updated.
  • If an index is rolled over after step 3 but before step 6, then the rollover in step 6 will fail with resource_already_exists_exception. No additional rollover and the mapping is updated.
  • If PUT mapping fails and no other process rolls the index between steps 3 and 6, then the index will rollover and the mapping is updated.

Next consolidation steps:

  • Encapsulate component template and mapping update logic inside RuleDataPluginService
  • Implement static typing for documents read/written by RuleDataClient based on the index template

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@dgieselaar
Copy link
Member

Two questions:

  • Why does it need versioning, and what is the advantage over the current approach? Can you compare the two? What other options were considered? (The current one never updates, so obviously things need to be improved, but interested in the why here)
  • Why does the index need to be created when the writer is requested? I'm worried about the fact that it's so implicit. If we can avoid creating the index right up until the last moment, we probably should.

const writeIndexMapping = await clusterClient.indices.get({
index: simulatedRollover.old_index,
});
const currentVersions = get(writeIndexMapping, [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use get instead of the optional changing operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see much difference since currentVersions is any either way. I chose this way to be consistent with how we're getting targetVersions below - simulatedIndex is typed as an empty object so the easiest way to get a field from it was just to use get.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add (minimal) types for writeIndexMapping and simulatedIndex then. Once the client itself adds them we can remove it again. That way, if the types change, or are incorrect, we are notified via a TypeScript error.

Comment on lines 126 to 137
const targetVersions = get(simulatedIndex, ['template', 'mappings', '_meta', 'versions']);
const componentTemplateRemoved =
Object.keys(currentVersions).length > Object.keys(targetVersions).length;
const componentTemplateUpdated = Object.entries(targetVersions).reduce(
(acc, [templateName, targetTemplateVersion]) => {
const currentTemplateVersion = get(currentVersions, templateName);
const templateUpdated =
currentTemplateVersion == null || currentTemplateVersion < Number(targetTemplateVersion);
return acc || templateUpdated;
},
false
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all you have to do here is check whether there's at least one component template that is newer than the previous version, or wasn't present, and I think we can use a for loop here to improve readability. Is there something we need to do when a component template is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if someone removes a template then they intend for it not to be in the mapping anymore, so it would be important to rollover the index to remove those fields from the current write index. This ensures that any changes to the mappings will be propagated to the indices as soon as possible, whereas if we don't rollover the index on template removal then the changes wouldn't be propagated until the index rolls over naturally based on the ILM policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't break if the fields from the removed component template stick around though. I'm a bit wary of switching between branches and continuously rolling over. Things like that can be hard to repair. That's one very plausible scenario. The other thing to consider is if there is a bug here somewhere the consequences are painful because we've (imho needlessly) chosen to be pretty aggressive in rolling over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will break in some cases. Certain queries like EQL and SQL can fail if mappings for a field are different across different indices, so it can be important to remove bad mappings as quickly as possible.

My issue with not rolling over aggressively is that it effectively leaves the index in kind of a "pending upgrade" state for a while after upgrade if component templates are removed. In this state, the templates have been updated but the write index won't be updated for some indeterminate amount of time - until the alias naturally rolls over due to the ILM policy. This effectively prevents us from removing any component templates and relying on the fields to no longer be mapped and will be confusing from a support perspective since the upgrade won't be truly finished until the natural rollover happens.

What's the use case for switching between branches? Switching back and forth between branches with different mappings but sharing the same ES cluster is going to break in other ways even without this template removal logic. If all component template versions on one branch are <= to the versions on the other branch, then it won't update the mappings and features may not work as expected. If some component templates are higher versioned on each branch, then it will rollover on every branch change, just as with the component removal case - this actually seems desirable to me from a dev perspective since it means the write index will have the proper mappings for the branch. Saved object migrations would have a similar issue with schema changes and switching branches - there's no downgrade capability, so once you upgrade the .kibana index on one branch there's no going back.

Comment on lines 140 to 148
await clusterClient.indices.rollover({
alias,
body: {
conditions: {
max_age: '5s',
},
},
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this via a task? I don't really trust this (but I haven't thought about it long enough to be able to articulate my concerns better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could look into alternatives to make this more robust. One option might be to specify the target index explicitly in the rollover request - I'll have to try it out and see if that makes the rollover request fail in the case where the target index already exists. If it does, then that should solve the race condition here since the rollover would only happen if another codepath hasn't rolled the index in between computing needRollover and actually rolling the index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be race conditions on a single kibana instance, but also by virtue of multiple kibana instances trying to roll over the index. It doesn't seem like a very unlikely scenario. A task will give us the guarantee that it will only happen once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 19b9eaf - by explicitly specifying the target index name pulled from the simulated rollover, we ensure that the rollover only happens if no other code path has rolled over the alias in the meantime.

Comment on lines -144 to -158
throw new Error(
'No mappings would be generated for this index, possibly due to failed/misconfigured bootstrapping'
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this should be removed. I don't see the same check in rolloverAliasIfNeeded , and I think we should be explicit about this, because teams new to RAC will run into it - I did, too, and it is a little annoying to figure out and repair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of check seems like it belongs with the template bootstrapping logic IMO. At this point we've already created the index, so simulating it seems redundant. Can we move this check so it happens sometime before the index is created, maybe at index template creation time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with moving it, but I do would like to keep it, somewhere. But let's figure out first where this all needs to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be always some common mappings specified (currently they are specified in templates installed in init() - TECHNICAL_COMPONENT_TEMPLATE_NAME and ECS_COMPONENT_TEMPLATE_NAME). We just have to make sure that final index templates always reference (composed_of) these common component templates.

I'd say this is a matter of having an API that doesn't provide 100% flexibility in index bootstrapping and makes sure that index template is always created and it always references the common templates - doesn't matter how the client of the library uses it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this check to index template creation time in 4ccf63a

@marshallmain
Copy link
Contributor Author

Why does it need versioning, and what is the advantage over the current approach? Can you compare the two? What other options were considered? (The current one never updates, so obviously things need to be improved, but interested in the why here)

Versioning is critical so that we can (1) detect when index mappings are outdated and automatically update them to the current version, and (2) we can offer a migration path similar to the detection alerts migration API for updating existing alerts. (1) is the more common case, but (2) is critical in some cases for fixing mapping mistakes where a field is mapped as the wrong type (we made this mistake before, although we didn't migrate and just accepted a minor feature breakage) or if we add a new feature/visualization that we want to work on historical alerts, not just newly written alerts. In most releases we've added new fields to the .siem-signals mapping and rolling over the index makes those new fields available to search on all alerts created after the rollover.

I considered a few other implementation strategies for certain parts. One option was to use a single version number for an index template (rather than stamping it with all the component versions as well) and build that version number by combining the version of each template. Losing the individual component versions is not ideal, though, and would potentially complicate the alert migration process in the future because updating a component template version from "1" to "2" might change the index template version for one template from "1" to "2" if that's the first update for that template, while the same component update for an index template that has other changes accrued already might change from version "5" to "6". By keeping all the component versions, we can keep the index versions in sync more easily.

I also considered other places where we could execute the rollover logic, the main alternative being at plugin setup. The primary issue with that is we'd have to find all the .alerts indices (including different namespaces) and check them all, probably blocking any rule data clients from executing reads/writes until all the indices have been checked and rolled over if necessary. We'd also have to somehow wait until plugins that consume the rule registry create their templates before executing the rollover logic, which means more synchronization logic. Finally, we'd potentially be doing extra work on indices that aren't used anymore. Rolling over the indices inside getWriter fits in nicely with the just-in-time architecture that we use for index creation and contains the logic inside the rule registry so other plugins don't have to worry about it.

Why does the index need to be created when the writer is requested? I'm worried about the fact that it's so implicit. If we can avoid creating the index right up until the last moment, we probably should.

I don't have a strong preference on this either way. Sometimes we've had users ask about why the endpoint logs indices don't exist yet after they install Agent and Endpoint and we have to explain that the indices aren't created until an alert is actually written. This can also cause some issues with rules that search those indices, since the rules return index_not_found instead of 0 hits. In the RAC context, this could cause confusion if users want to set up rules-on-rules (second order rules) before the first order rule generates any alerts - since the first order rules wouldn't create an index until the first alert, the second order rules would get index_not_found exceptions that we'd have to handle properly or they could cause rule failures. That is probably a rare case but my thinking was if a rule gets to the stage where it's creating a writer using getWriter then it's reasonable to think the index would exist even if it's empty. But again, I don't think it makes a big difference to the security solution so I'm happy to change it back.

async rolloverAliasIfNeeded({ namespace }: { namespace?: string }) {
const clusterClient = await this.getClusterClient();
const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
const { body: simulatedRollover } = await clusterClient.indices.rollover({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to simulate a rollover? Can't we get the current write index, simulate the index template, and compare the two? (And I think those can happen in parallel)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to make sequential calls either way - one call to retrieve the concrete index name for the alias write index, then a second set of calls to get the mapping for the concrete index and the simulated mapping for the new index. If we do it with the simulated rollover, elasticsearch handles extracting the current write index and the new index name for us and makes them available as old_index and new_index. If we avoid using simulated rollover, then we have to manually fetch the alias and extract the write index and also manually compute the next index name. So we don't necessarily need to simulate a rollover, but it's simpler this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marshallmain could you please add this clarification to the code as a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanatory comment in a45a70a

@marshallmain marshallmain marked this pull request as ready for review June 21, 2021 16:10
@marshallmain marshallmain requested review from a team as code owners June 21, 2021 16:10
@marshallmain marshallmain added release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detections and Resp Security Detection Response Team v7.14.0 labels Jun 21, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@marshallmain marshallmain requested a review from a team June 21, 2021 16:11
@marshallmain
Copy link
Contributor Author

@elasticmachine merge upstream

@botelastic botelastic bot added the Team:APM All issues that need APM UI Team support label Jun 21, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this is very interesting implementation overall, feel like it's better than what I had in my draft PR (#101453) in terms of how the logic of "soft migration" is implemented (updating templates, then rolling over the alias).

However, the way we currently split parts into templates seems to me a bit random and I'm concerned about divergence/inconsistencies in index bootstrapping between solutions. In my draft I introduced a more "standardized" system of component templates (#101016 (comment)), wondering what everyone thinks on it.

const { namespace } = options;
const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
await this.rolloverAliasIfNeeded({ namespace });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolloverAliasIfNeeded accepts namespace and calls const alias = getNamespacedAlias({ alias: this.options.alias, namespace }); inside. Can we just pass the alias already created above?

const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
await this.rolloverAliasIfNeeded(alias);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rolloverAliasIfNeeded is called at the point when we get a writer, but index creation (createWriteTargetIfNeeded) is deferred to the point when we actually index something, and get an exception saying that the index does not exist?

Can/should we combine the two into a single rolloverAliasOrCreateInitialIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can just pass in the alias directly, good point.

The call to createWriteTargetIfNeeded is deferred at @dgieselaar 's request to avoid populating the cluster with more empty indices than necessary. We can defer it since attempting to write to an index that doesn't exist returns an error that we can catch, then create the index and write again. We can't defer the call to rolloverAliasIfNeeded because there most likely won't be an error to catch if we try to write to an index whose mapping hasn't been updated - for most of the mapping updates we make (e.g. adding a new field to the mapping), it will still be able to write the document but the fields newly added to the template simply won't be mapped in the new documents. By rolling over before returning the writer, we can ensure that when we do write documents the write index has the updated mappings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't defer the call to rolloverAliasIfNeeded because there most likely won't be an error to catch if we try to write to an index whose mapping hasn't been updated - for most of the mapping updates we make (e.g. adding a new field to the mapping), it will still be able to write the document but the fields newly added to the template simply won't be mapped in the new documents.

👍 yeah, understood that part, agree

As for deferring createWriteTargetIfNeeded until the very last moment, I'm personally not sure if this is super necessary because I suspect the time between the moment of getting a writer and attempting to write something will be very short. Exception would be if we get a writer but don't write anything - in this case we'd create the initial index, but it would stay empty for a while.

Both options work for me, I think this is not super important, just thought that maybe combining them into a rolloverAliasOrCreateInitialIndex would make the code slightly simpler.

@@ -70,9 +70,10 @@ export class RuleDataClient implements IRuleDataClient {
};
}

getWriter(options: { namespace?: string } = {}): RuleDataWriter {
async getWriter(options: { namespace?: string } = {}): Promise<RuleDataWriter> {
const { namespace } = options;
const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why we use parameter object + destructuring in private functions when there's only 2 parameters? Wouldn't it be cleaner to getNamespacedAlias(this.options.alias, namespace)?

I understand when some public API accepts options via an object (e.g. the async getWriter(options: { namespace?: string } = {}): Promise<RuleDataWriter> { above). You probably want extensibility etc. But for internal or private functions this is unnecessary if the number of parameters is <= 3 imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion either way. Sometimes it's nice to see the parameter names at the function call site as a reminder of what each one is. I just left it the way it is to follow the existing pattern.

Comment on lines 106 to 114
async rolloverAliasIfNeeded({ namespace }: { namespace?: string }) {
const clusterClient = await this.getClusterClient();
const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
let simulatedRollover: IndicesRolloverResponse;
try {
({ body: simulatedRollover } = await clusterClient.indices.rollover({
alias,
dry_run: true,
}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought on race conditions between different parts of index bootstrapping in this implementation.

If I noticed that correctly, in this implementation there are 3 places where certain parts of index bootstrapping logic are implemented:

  1. RuleDataPluginService.init(). This guy is called externally from the rule_registry plugin setup phase.

    const ruleDataService = new RuleDataPluginService({
      logger,
      isWriteEnabled: config.write.enabled,
      index: config.index,
      getClusterClient: async () => {
        const deps = await startDependencies;
        return deps.core.elasticsearch.client.asInternalUser;
      },
    });
    
    ruleDataService.init().catch((originalError) => {
      const error = new Error('Failed installing assets');
      // @ts-ignore
      error.stack = originalError.stack;
      logger.error(error);
    });
  2. Bootstrapping specific to a security solution log (e.g. log of detection alerts). Is called from the security_solution setup phase.

      const ready = once(async () => {
        const componentTemplateName = ruleDataService.getFullAssetName(
          'security-solution-mappings'
        );
    
        if (!ruleDataService.isWriteEnabled()) {
          return;
        }
    
        await ruleDataService.createOrUpdateComponentTemplate({
          template: {
            name: componentTemplateName,
            body: {
              template: {
                settings: {
                  number_of_shards: 1,
                },
                mappings: {}, // TODO: Add mappings here via `mappingFromFieldMap()`
              },
            },
          },
          templateVersion: 1,
        });
    
        await ruleDataService.createOrUpdateIndexTemplate({
          template: {
            name: ruleDataService.getFullAssetName('security-solution-index-template'),
            body: {
              index_patterns: [ruleDataService.getFullAssetName('security-solution*')],
              composed_of: [
                ruleDataService.getFullAssetName(TECHNICAL_COMPONENT_TEMPLATE_NAME),
                ruleDataService.getFullAssetName(ECS_COMPONENT_TEMPLATE_NAME),
                componentTemplateName,
              ],
            },
          },
          templateVersion: 1,
        });
      });
    
      ready().catch((err) => {
        this.logger!.error(err);
      });
    
      ruleDataClient = new RuleDataClient({
        alias: plugins.ruleRegistry.ruleDataService.getFullAssetName('security-solution'),
        getClusterClient: async () => {
          const coreStart = await start();
          return coreStart.elasticsearch.client.asInternalUser;
        },
        ready,
      });
  3. Bootstrapping specific to a concrete log within a namespace. This is done here in RuleDataClient: rolloverAliasIfNeeded and createWriteTargetIfNeeded.

We need to guarantee that 1 needs to be executed once and before 2, which in turn needs to be executed once and before 3.

While I think this seems to be implemented correctly, it took a while for me to trace all those parts of bootstrapping, dependencies between them, how/where different ready functions are being passed and called, etc. So imho we have 3 places where bootstrapping happens, but the order is a bit implicit, and it's hard to construct the full picture in the head.

Wondering if we could make it at least slightly more explicit.

Off the top of my head:

  • We could rename getClusterClient() to waitUntilBootstrappingFinished() or smth like that. It would still return cluster client.
  • We could pass cluster client as a parameter to rolloverAliasIfNeeded() and other functions that depend on it.
  • We could explicitly call waitUntilBootstrappingFinished() in public methods when appropriate and only once.

Ideally would be nice to encapsulate the whole bootstrapping (specific to a log) in a single class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree that there's room for improvement on the handling of ready signals here. To me, the biggest source of confusion/possible future bugs is that the solution plugins must define a ready signal as part of step 2 (Bootstrapping specific to a security solution log) and that ready signal is then passed in to the RuleDataClient. So steps 1 and 3 are handled by the rule registry infrastructure, but solution devs still have to have fairly detailed knowledge of how the bootstrap process works to understand the requirements around "ready signal" they must implement as part of step 2.

To solve this, I envision adding a function to the RuleDataPluginService that would look something like async bootstrapSolutionTemplates(logName, componentTemplates[], indexTemplate): Promise<RuleDataClient>. This function would be responsible for executing the tasks in step 2 and then constructing a RuleDataClient to read and write from indices based on those templates. Then in the solution plugin setup we would call const ruleDataClientPromise = ruleDataService.bootstrapSolutionTemplates(...), purposely not awaiting the call to not block during setup. However, we can pass the ruleDataClientPromise to rule types that need it and they can await the promise inside the executors ensuring that bootstrapSolutionTemplates finishes before they use the RuleDataClient.

I think with this approach, we'd still need the ready signal internal to the RuleDataPluginService to ensure that calls to bootstrapSolutionTemplates don't execute until after RuleDataPluginService.init() completes. But, since the RuleDataClient would come from bootstrapSolutionTemplates, we wouldn't need a ready signal inside the RuleDataClient at all - as soon as the RuleDataClient is created by bootstrapSolutionTemplates it's ready to be used.

Thoughts? Do you think this would sufficiently encapsulate and simplify the bootstrap process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm hoping we can defer the actual implementation of this encapsulation to a follow up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts? Do you think this would sufficiently encapsulate and simplify the bootstrap process?

Yeah, this could be an option (if you really don't like the declarative way of resolving log definitions :P ). Injecting a hook. Btw, bootstrapSolutionTemplates() could be sync and return RuleDataClient right away. It would call the bootstrapping internally in parallel and make sure the ready signal is implemented properly.

One thing I'd keep in mind when implementing this is #102586 (comment) - how do we make sure that inside the hook a developer doesn't have to reference common templates manually.

Also I'm hoping we can defer the actual implementation of this encapsulation to a follow up PR

Yeah, of course, let's merge this PR sooner! 🚢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of bootstrapSolutionTemplates as a way to specify an index mapping in a declarative way rather than as a hook. Can we pair on it soon and discuss? I might be misunderstanding what declarative is in this context.

Comment on lines 163 to 165
if (err?.meta?.body?.error?.type !== 'resource_already_exists_exception') {
throw err;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there any doc describing exceptions from different methods of ElasticsearchClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any...it would be amazing if we had one though.

Comment on lines -144 to -158
throw new Error(
'No mappings would be generated for this index, possibly due to failed/misconfigured bootstrapping'
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be always some common mappings specified (currently they are specified in templates installed in init() - TECHNICAL_COMPONENT_TEMPLATE_NAME and ECS_COMPONENT_TEMPLATE_NAME). We just have to make sure that final index templates always reference (composed_of) these common component templates.

I'd say this is a matter of having an API that doesn't provide 100% flexibility in index bootstrapping and makes sure that index template is always created and it always references the common templates - doesn't matter how the client of the library uses it.

x-pack/plugins/security_solution/server/plugin.ts Outdated Show resolved Hide resolved
Comment on lines 248 to 249
ruleDataService.getFullAssetName(TECHNICAL_COMPONENT_TEMPLATE_NAME),
ruleDataService.getFullAssetName(ECS_COMPONENT_TEMPLATE_NAME),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an offtopic, but not sure where to better discuss this.

What's the intent behind this separation to "technical" template and "ecs" template? Why we need to have 2 templates? Imho everything in them is just a bunch of common mappings, and it could be specified in a single template.

I also checked the templates, and it's weird to me:

export const technicalComponentTemplate: ClusterPutComponentTemplateBody = {
  template: {
    settings: {
      number_of_shards: 1,
    },
    mappings: mappingFromFieldMap(technicalRuleFieldMap),
  },
};

export const technicalRuleFieldMap = {
  ...pickWithPatterns(
    ecsFieldMap,
    TIMESTAMP,
    EVENT_KIND,
    EVENT_ACTION,
    RULE_UUID,
    RULE_ID,
    RULE_NAME,
    RULE_CATEGORY,
    TAGS
  ),
  [PRODUCER]: { type: 'keyword' },
  [ALERT_UUID]: { type: 'keyword' },
  [ALERT_ID]: { type: 'keyword' },
  [ALERT_START]: { type: 'date' },
  [ALERT_END]: { type: 'date' },
  [ALERT_DURATION]: { type: 'long' },
  [ALERT_SEVERITY_LEVEL]: { type: 'keyword' },
  [ALERT_SEVERITY_VALUE]: { type: 'long' },
  [ALERT_STATUS]: { type: 'keyword' },
  [ALERT_EVALUATION_THRESHOLD]: { type: 'scaled_float', scaling_factor: 100 },
  [ALERT_EVALUATION_VALUE]: { type: 'scaled_float', scaling_factor: 100 },
} as const;
export const ecsComponentTemplate: ClusterPutComponentTemplateBody = {
  template: {
    settings: {
      number_of_shards: 1,
    },
    mappings: merge(
      {},
      mappingFromFieldMap(ecsFieldMap),
      mappingFromFieldMap(technicalRuleFieldMap)
    ),
  },
};
  • So ecs template contains ALL the ecs fields + technical fields.
  • Technical template contains technical fields.
  • Technical fields are a subset of ecs fields + some extensions to ecs (our custom fields).

And the resulting index template references all of this, thus essentially includes ALL standard ECS fields + our extensions. Is this our plan to include the whole ECS?

Also, wouldn't it be better to have 2 separate field maps (one for a common subset of ECS fields and another for extensions to ECS), and then merge them into a single common .alerts-mappings component template?

This is what I basically did in the draft (#101453) and described in #101016 (comment). Would be happy to get any comments on this.

Copy link
Contributor Author

@marshallmain marshallmain Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Observability indices don't want the full ECS mapping, we need to keep the base .alerts component template (the technical fields) as minimal as possible and allow solutions to add various components as needed.

Yeah it seems odd that the ecsComponentTemplate contains the technicalRuleFieldMap - maybe that's an oversight? I think the overall idea is that the technicalComponentTemplate contains the fields that are common across all rules in the RAC initiative, so we may want to enforce that that template is used in all index templates created by the RuleDataPluginService (maybe by automatically dropping it into the composed_of field). The ecsComponentTemplate with all ECS fields is useful for security solution templates, since we include the entire ECS mapping so that when we copy documents into the .alerts index the ECS fields in the original documents will be mapped in the alert documents as well. We'll most likely have security specific fields to map in the .alerts-security indices also, and we could put those in a third component template (first 2 are technical and ECS components).

Re: merging the field maps, I would like to leave the merging of component templates to Elasticsearch, similar to how it's described in #101016 (comment) with a variety of separate component templates, and avoid constructs like

mappings: merge(
  {},
  mappingFromFieldMap(ecsFieldMap),
  mappingFromFieldMap(technicalRuleFieldMap)
),

where we explicitly merge mappings into single component. I like the pattern we use to build the security solution index template right now where we compose the full index template using the base technical template, the ECS template, and our (currently empty) security solution-specific component template:

{
  name: ruleDataService.getFullAssetName('security-solution-index-template'),
  body: {
    index_patterns: [ruleDataService.getFullAssetName('security-solution*')],
    composed_of: [
      ruleDataService.getFullAssetName(TECHNICAL_COMPONENT_TEMPLATE_NAME),
      ruleDataService.getFullAssetName(ECS_COMPONENT_TEMPLATE_NAME),
      componentTemplateName,
    ],
  },
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yeah, maybe there's an oversight and some confusion. I see your points. Let's maybe discuss this separately and address in a follow-up PR (I could take over it).

'versions',
]);
const componentTemplateRemoved =
Object.keys(currentVersions).length > Object.keys(targetVersions).length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line fails with TypeError: Cannot convert undefined or null to object when currentVersions is undefined. Let's add an additional guard here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, how's that possible. Both current and simulated indices should have this versions object in _meta (I guess?). Maybe this is an indication of a bug here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it happened when switching from another branch to this one while using the same ES cluster? I'm extracting this logic into a function and making it a little more robust with guards around the cases where currentVersions or targetVersions are undefined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, there was no _meta field in the response. Not sure how it happened, tbh.

const { namespace } = options;
const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
await this.rolloverAliasIfNeeded({ namespace });
Copy link
Contributor

@xcrzx xcrzx Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having await this.rolloverAliasIfNeeded({ namespace }); inside getWriter could carry significant performance penalty.

Let's consider solution-side code that performs many writes during a request or rule execution. A single log write operation could look like that:

await ruleDataClientInstance.getWriter({ namespace }).bulk(logsToWrite);

And I can imagine a rule that performs tens of writes like that during the execution. From the solution developer's perspective, nothing bad happens; they write logs N times. But under the hood, every write request is accompanied by at least three additional requests (total requests = N*4):

// Expected request
await clusterClient.bulk()
// Additional "hidden" requests
await clusterClient.indices.rollover({ dry_run: true })
await clusterClient.indices.get()
await clusterClient.indices.simulateIndexTemplate()

To avoid making these extra requests on every write, the developer should instantiate the writer and reuse it. But I think it's (a) not always convenient to do, and (b) the developer should know implementation details of getWriter and be aware of the possible pitfall.

We can envisage use cases as above and protect API users from harming performance by having a map of instantiated writers (Map<Namespace, RuleDataWriter>()) and returning a writer for a given namespace before even calling rolloverAliasIfNeeded if it is already in the map.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid making these extra requests on every write, the developer should instantiate the writer and reuse it.

That's I'd say how it should be done. If you're in an executor function, you will get a namespace from the rule parameters and get the writer. Then you'll pass it down to your classes and functions as a dependency to write events.

Same if you need to log stuff from route handlers.

having a map of instantiated writers (Map<Namespace, RuleDataWriter>())

Hehe, that's similar to what I had in my draft :) I think this could be handy in general if someone would always use the API like that: await ruleDataClientInstance.getWriter({ namespace }).bulk(logsToWrite);, but maybe YAGNI.

I'd say let's consider that in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is an area that is a little bit tricky. I had in mind adding some dev docs to explain the expected usage pattern (create a writer inside the alert executor and reuse it for the duration of the executor).

Typescript does do a little bit more to guide devs towards that pattern naturally - since getWriter is async, you have to do something like

const writer = await ruleDataClientInstance.getWriter({ namespace });
await writer.bulk(logsToWrite);

which might help clue devs in that writers are meant to be reused.

On the flip side, the potential issue with caching the writers and therefore only running the rollover logic once at instantiation time is that if the mappings change while Kibana is running then we won't detect it unless we add a way to manually re-check the mapping versions for an index. Our built-in mappings won't change this way, but we've had multiple cases of advanced users modifying the mappings to add their own custom fields. If we do add the user-defined templates (#101016 (comment)) then those would be changing while Kibana is running - so we'd want to check the template versions frequently to make sure changes to those templates get propagated to the indices.

So in my mind the RuleDataWriter is designed to live for a few seconds, e.g. for the duration of a rule execution, after which it should be discarded and a new one instantiated. My hope was that with docs we can provide simple guidance about creating and using a single writer within the rule executor and devs that way won't have to worry too much about the internals - they just know that you make 1 writer in the executor.

Maybe we could also add an optional parameter to getWriter to disable the rollover check, for use in route handlers that don't want to pay the overhead of the rollover check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's I'd say how it should be done. If you're in an executor function, you will get a namespace from the rule parameters and get the writer. Then you'll pass it down to your classes and functions as a dependency to write events.

Yeah, it should be done that way. But my point is that the current API design doesn't enforce that. And in fact, it is much easier to use the API in the wrong way. Currently, all classes/functions depend on ruleDataClient instead of writer, like here, and here. And I'm also doing something similar in other places in my PR, just because it's much easier. I already have ruleDataClient as a dependency; why bothering extracting a writer and passing it down as another dependency? And some developers (including me 🙂) prefer to just copy-paste working code from other places instead of reading docs and exploring how things work under the hood. That's why I think adding docs wouldn't make much difference. That said, it's still a good practice to have docs.

I think a well-designed system should make it easy to do the right things and hard to do the wrong things. That's why I'm highlighting the current API as it potentially could be misused. But, as we have plans to improve RuleDataClient in a follow PR, let's discuss the API design separately. It is not a blocker for current changes to be merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to address this concern in the next PR within #101016 (will be a parallel PR to this one). The idea is to bind RuleDataClient to a namespace, so you could get an instance of it only if you know the namespace you're gonna work with:

// somewhere in the setup phase of security_solution
await ruleDataService.bootstrap(detectionAlertsLogDefinition);

// somewhere in an executor or a route handler
const detectionAlertsClient = await ruleDataService.getClient(detectionAlertsLogDefinition, namespace);

// or we could even make it sync
const detectionAlertsClient = ruleDataService.getClient(detectionAlertsLogDefinition, namespace);

const writer = detectionAlertsClient.getWriter();
const reader = detectionAlertsClient.getReader();

@marshallmain @xcrzx

@xcrzx
Copy link
Contributor

xcrzx commented Jun 23, 2021

It looks like there is a race condition somewhere in the index bootstrapping logic.

Steps to reproduce

  1. Delete manually all indices, index templates, and component templates managed by the RuleDataService.
  2. Restart dev server

Expected result
Indices get recreated.

Actual result
Indices were recreated, but the server crashed with:

ResponseError: resource_already_exists_exception: [resource_already_exists_exception] Reason: index [.alerts-security.events-default-000001/62v10EIUSxKpty2O_bKFjg] already exists

After another restart, everything works as expected.

Copy link
Contributor

@dhurley14 dhurley14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some unpredictability with these changes. API integration tests are needed to ensure consistency. We have some available within the RBAC pr as a reference if you want to copy that into this PR. Or just wait until the RBAC PR is merged and add them in at that point.

Rule registry was merged into master without any automated tests (we added some in the RBAC pr) and I think we are falling into a bad habit of continuing to merge code without test coverage. This is a good entry point to begin undoing those bad habits.

@dhurley14 dhurley14 added Feature:RAC label obsolete Theme: rac label obsolete labels Jun 23, 2021
@dgieselaar
Copy link
Member

Rule registry was merged into master without any automated tests

This is incorrect. There are API integration tests in the APM test suite, and we have added unit tests since as well.

@marshallmain
Copy link
Contributor Author

@elasticmachine merge upstream

@marshallmain
Copy link
Contributor Author

@dhurley14 with the implementation changes to do the mapping updates at Kibana startup time instead of each time a RuleDataWriter is fetched, this logic is not very well suited for API integration tests. IMO either manual or automated upgrade tests would be a more effective way to test this logic. Once our upgrade test framework is merged we could consider adding some tests there to ensure that existing index mappings get updated as expected. Would that be an acceptable alternative?

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@ruflin
Copy link
Member

ruflin commented Jul 7, 2021

I opened elastic/elasticsearch#75031 which would allows us to potentially get rid of this logic in Kibana itself in the long term and Elasticsearch would take care of it. Short term, we need to have it Kibana as this PR does it.

Copy link
Member

@jasonrhodes jasonrhodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks so much for being flexible and finding a good solution!

I want to get @leehinman and @danhermann to take a look after this merges and help us confirm that it's okay for us to control the ILM suffix logic and how that will work alongside Elasticsearch's own ILM suffix update logic, but that doesn't need to block this merge.

@danhermann
Copy link

I want to get @leehinman and @danhermann to take a look after this merges and help us confirm that it's okay for us to control the ILM suffix logic and how that will work alongside Elasticsearch's own ILM suffix update logic, but that doesn't need to block this merge.

@jasonrhodes, could you elaborate on your question here? I don't have any context on this particular PR.

@jasonrhodes
Copy link
Member

@danhermann sure, sorry for the context-less ping. We had a good conversation with @jpountz last week about this and he mentioned your names as some folks to check in with on this. Basically, we are manually incrementing what I'm referring to as the "ILM suffix" in order to try to prevent unnecessary rollovers. See:

https://github.com/elastic/kibana/pull/102586/files#diff-95e932cd3b8c7b0abfb362ff46544379888f68ed22b9dc0c86a21b18f2971335R151-R160

I wanted to make sure of mainly two explicit questions:

  1. When we determine the current suffix, we take the last 6 characters. Is that a problem?
  2. Will this logic cause any interference with the Elasticsearch logic for incrementing this value?

Copy link
Contributor

@smith smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APM changes look good

@marshallmain marshallmain added the auto-backport Deprecated - use backport:version if exact versions are needed label Jul 7, 2021
@danhermann
Copy link

@danhermann sure, sorry for the context-less ping. We had a good conversation with @jpountz last week about this and he mentioned your names as some folks to check in with on this. Basically, we are manually incrementing what I'm referring to as the "ILM suffix" in order to try to prevent unnecessary rollovers. See:

https://github.com/elastic/kibana/pull/102586/files#diff-95e932cd3b8c7b0abfb362ff46544379888f68ed22b9dc0c86a21b18f2971335R151-R160

I wanted to make sure of mainly two explicit questions:

  1. When we determine the current suffix, we take the last 6 characters. Is that a problem?
  2. Will this logic cause any interference with the Elasticsearch logic for incrementing this value?

Thanks, @jasonrhodes. I know that for (1), there are some edge cases in which the suffix of an index may not be six numeric characters: the first index not rolled over by ILM and the scenario in which indices have been rolled more than 999999 times. It's probably better to look for the last - character and extract the suffix that way. As for (2), I'd prefer to defer to @dakrone or @andreidan as they know those parts of ILM better than I.

@marshallmain
Copy link
Contributor Author

Thanks for the feedback @danhermann ! One possible alternative that might be more robust is to use the dry_run parameter on the rollover API to get the current and next write index names without actually rolling over. I was using that logic initially but switched to manually computing the next index name because there could potentially be a lot of aliases that would need to be updated on Kibana startup and I wasn't sure how expensive the rollover dry_run API call would be. Do you think that would be a better alternative?

@marshallmain marshallmain merged commit 2c6801e into elastic:master Jul 8, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Jul 8, 2021
* Add component template versioning to RuleDataClient

* Add versioning for index templates

* Address PR comments, add error handling inside rolloverAliasIfNeeded

* Fix security alerts index name, rollover func param, more robust rollover logic

* Add empty mapping check to createOrUpdateIndexTemplate

* Fix error path in createWriteTargetIfNeeded to suppress resource_already_exists_exception

* Add code comments around rollover logic

* Replace numeric versions with semver

* Use optional chaining operator to fetch current write index mapping

* Fix template version number

* Move mapping updates to plugin startup, remove dependency on component versions

* Undo changes to lifecycle and persistance rule type factories

* Remove test code

* Simplify race mitigation logic

* Remove outdated comment

* Add unit tests, log unexpected errors instead of rethrowing

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Jul 8, 2021
* Add component template versioning to RuleDataClient

* Add versioning for index templates

* Address PR comments, add error handling inside rolloverAliasIfNeeded

* Fix security alerts index name, rollover func param, more robust rollover logic

* Add empty mapping check to createOrUpdateIndexTemplate

* Fix error path in createWriteTargetIfNeeded to suppress resource_already_exists_exception

* Add code comments around rollover logic

* Replace numeric versions with semver

* Use optional chaining operator to fetch current write index mapping

* Fix template version number

* Move mapping updates to plugin startup, remove dependency on component versions

* Undo changes to lifecycle and persistance rule type factories

* Remove test code

* Simplify race mitigation logic

* Remove outdated comment

* Add unit tests, log unexpected errors instead of rethrowing

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Marshall Main <55718608+marshallmain@users.noreply.github.com>
jloleysens added a commit to jloleysens/kibana that referenced this pull request Jul 8, 2021
…-of-max-results

* 'master' of github.com:elastic/kibana: (36 commits)
  Lower Kibana app bundle limits (elastic#104688)
  [Security Solutions] Fixes bug with the filter query compatibility for transforms (elastic#104559)
  [RAC] Add mapping update logic to RuleDataClient (elastic#102586)
  Fix import workpad (elastic#104722)
  [canvas] Fix Storybook service decorator (elastic#104750)
  [Detection Rules] Add 7.14 rules (elastic#104772)
  [Enterprise Search] Fix beta notification in sidebar (elastic#104763)
  Fix engine routes that are meta engine or non-meta-engine specific (elastic#104757)
  [Fleet] Fix policy revision number getting bumped for no reason (elastic#104696)
  persistable state migrations (elastic#103680)
  [Fleet] Fix add agent in the package policy table (elastic#104749)
  [DOCS] Creates separate doc for security in production (elastic#103973)
  [SO Migration] fix reindex race on multi-instance mode (elastic#104516)
  [Security Solution] Update text in Endpoint Admin pages (elastic#104649)
  [package testing] Decrease timeout to 2 hours (elastic#104668)
  Fix background styling of waterfall chart sidebar tooltip. (elastic#103997)
  [Fleet + Integrations UI] Integrations UI Cleanup (elastic#104641)
  [Fleet] Link to download page of current stack version on Agent install instructions (elastic#104494)
  [Workplace Search] Fix Media Type field preview is unformatted bug (elastic#104684)
  [ML] add marker body (elastic#104672)
  ...

# Conflicts:
#	x-pack/plugins/fleet/public/search_provider.test.ts
@danhermann
Copy link

Thanks for the feedback @danhermann ! One possible alternative that might be more robust is to use the dry_run parameter on the rollover API to get the current and next write index names without actually rolling over. I was using that logic initially but switched to manually computing the next index name because there could potentially be a lot of aliases that would need to be updated on Kibana startup and I wasn't sure how expensive the rollover dry_run API call would be. Do you think that would be a better alternative?

Calling the rollover API with the dry_run parameter is lighter-weight than without, but in order to validate the request, it still has to retrieve stats on the index being rolled which may require calls between nodes.

@jasonrhodes
Copy link
Member

It's probably better to look for the last - character and extract the suffix that way.

@marshallmain what do you think of this suggestion to avoid a bug based on using the last 6 characters as the ILM suffix?

@banderror
Copy link
Contributor

@jasonrhodes regarding ILM suffix, we also think it would be nice to make it more robust, we'll address it in a follow-up PR.

See 1. Index name creation & update logic in #101016 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:RAC label obsolete release_note:skip Skip the PR/issue when compiling release notes Team:APM All issues that need APM UI Team support Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete v7.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.