-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAC] Add mapping update logic to RuleDataClient #102586
[RAC] Add mapping update logic to RuleDataClient #102586
Conversation
Two questions:
|
const writeIndexMapping = await clusterClient.indices.get({ | ||
index: simulatedRollover.old_index, | ||
}); | ||
const currentVersions = get(writeIndexMapping, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use get
instead of the optional changing operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see much difference since currentVersions
is any
either way. I chose this way to be consistent with how we're getting targetVersions
below - simulatedIndex
is typed as an empty object so the easiest way to get a field from it was just to use get
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add (minimal) types for writeIndexMapping
and simulatedIndex
then. Once the client itself adds them we can remove it again. That way, if the types change, or are incorrect, we are notified via a TypeScript error.
const targetVersions = get(simulatedIndex, ['template', 'mappings', '_meta', 'versions']); | ||
const componentTemplateRemoved = | ||
Object.keys(currentVersions).length > Object.keys(targetVersions).length; | ||
const componentTemplateUpdated = Object.entries(targetVersions).reduce( | ||
(acc, [templateName, targetTemplateVersion]) => { | ||
const currentTemplateVersion = get(currentVersions, templateName); | ||
const templateUpdated = | ||
currentTemplateVersion == null || currentTemplateVersion < Number(targetTemplateVersion); | ||
return acc || templateUpdated; | ||
}, | ||
false | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all you have to do here is check whether there's at least one component template that is newer than the previous version, or wasn't present, and I think we can use a for loop here to improve readability. Is there something we need to do when a component template is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume if someone removes a template then they intend for it not to be in the mapping anymore, so it would be important to rollover the index to remove those fields from the current write index. This ensures that any changes to the mappings will be propagated to the indices as soon as possible, whereas if we don't rollover the index on template removal then the changes wouldn't be propagated until the index rolls over naturally based on the ILM policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't break if the fields from the removed component template stick around though. I'm a bit wary of switching between branches and continuously rolling over. Things like that can be hard to repair. That's one very plausible scenario. The other thing to consider is if there is a bug here somewhere the consequences are painful because we've (imho needlessly) chosen to be pretty aggressive in rolling over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will break in some cases. Certain queries like EQL and SQL can fail if mappings for a field are different across different indices, so it can be important to remove bad mappings as quickly as possible.
My issue with not rolling over aggressively is that it effectively leaves the index in kind of a "pending upgrade" state for a while after upgrade if component templates are removed. In this state, the templates have been updated but the write index won't be updated for some indeterminate amount of time - until the alias naturally rolls over due to the ILM policy. This effectively prevents us from removing any component templates and relying on the fields to no longer be mapped and will be confusing from a support perspective since the upgrade won't be truly finished until the natural rollover happens.
What's the use case for switching between branches? Switching back and forth between branches with different mappings but sharing the same ES cluster is going to break in other ways even without this template removal logic. If all component template versions on one branch are <= to the versions on the other branch, then it won't update the mappings and features may not work as expected. If some component templates are higher versioned on each branch, then it will rollover on every branch change, just as with the component removal case - this actually seems desirable to me from a dev perspective since it means the write index will have the proper mappings for the branch. Saved object migrations would have a similar issue with schema changes and switching branches - there's no downgrade capability, so once you upgrade the .kibana index on one branch there's no going back.
await clusterClient.indices.rollover({ | ||
alias, | ||
body: { | ||
conditions: { | ||
max_age: '5s', | ||
}, | ||
}, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this via a task? I don't really trust this (but I haven't thought about it long enough to be able to articulate my concerns better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could look into alternatives to make this more robust. One option might be to specify the target index explicitly in the rollover request - I'll have to try it out and see if that makes the rollover request fail in the case where the target index already exists. If it does, then that should solve the race condition here since the rollover would only happen if another codepath hasn't rolled the index in between computing needRollover
and actually rolling the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be race conditions on a single kibana instance, but also by virtue of multiple kibana instances trying to roll over the index. It doesn't seem like a very unlikely scenario. A task will give us the guarantee that it will only happen once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 19b9eaf - by explicitly specifying the target index name pulled from the simulated rollover, we ensure that the rollover only happens if no other code path has rolled over the alias in the meantime.
throw new Error( | ||
'No mappings would be generated for this index, possibly due to failed/misconfigured bootstrapping' | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this should be removed. I don't see the same check in rolloverAliasIfNeeded
, and I think we should be explicit about this, because teams new to RAC will run into it - I did, too, and it is a little annoying to figure out and repair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of check seems like it belongs with the template bootstrapping logic IMO. At this point we've already created the index, so simulating it seems redundant. Can we move this check so it happens sometime before the index is created, maybe at index template creation time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with moving it, but I do would like to keep it, somewhere. But let's figure out first where this all needs to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be always some common mappings specified (currently they are specified in templates installed in init()
- TECHNICAL_COMPONENT_TEMPLATE_NAME
and ECS_COMPONENT_TEMPLATE_NAME
). We just have to make sure that final index templates always reference (composed_of
) these common component templates.
I'd say this is a matter of having an API that doesn't provide 100% flexibility in index bootstrapping and makes sure that index template is always created and it always references the common templates - doesn't matter how the client of the library uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this check to index template creation time in 4ccf63a
x-pack/plugins/rule_registry/server/rule_data_plugin_service/index.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/rule_registry/server/rule_data_plugin_service/index.ts
Outdated
Show resolved
Hide resolved
Versioning is critical so that we can (1) detect when index mappings are outdated and automatically update them to the current version, and (2) we can offer a migration path similar to the detection alerts migration API for updating existing alerts. (1) is the more common case, but (2) is critical in some cases for fixing mapping mistakes where a field is mapped as the wrong type (we made this mistake before, although we didn't migrate and just accepted a minor feature breakage) or if we add a new feature/visualization that we want to work on historical alerts, not just newly written alerts. In most releases we've added new fields to the I considered a few other implementation strategies for certain parts. One option was to use a single version number for an index template (rather than stamping it with all the component versions as well) and build that version number by combining the version of each template. Losing the individual component versions is not ideal, though, and would potentially complicate the alert migration process in the future because updating a component template version from "1" to "2" might change the index template version for one template from "1" to "2" if that's the first update for that template, while the same component update for an index template that has other changes accrued already might change from version "5" to "6". By keeping all the component versions, we can keep the index versions in sync more easily. I also considered other places where we could execute the rollover logic, the main alternative being at plugin setup. The primary issue with that is we'd have to find all the .alerts indices (including different namespaces) and check them all, probably blocking any rule data clients from executing reads/writes until all the indices have been checked and rolled over if necessary. We'd also have to somehow wait until plugins that consume the rule registry create their templates before executing the rollover logic, which means more synchronization logic. Finally, we'd potentially be doing extra work on indices that aren't used anymore. Rolling over the indices inside
I don't have a strong preference on this either way. Sometimes we've had users ask about why the endpoint logs indices don't exist yet after they install Agent and Endpoint and we have to explain that the indices aren't created until an alert is actually written. This can also cause some issues with rules that search those indices, since the rules return index_not_found instead of 0 hits. In the RAC context, this could cause confusion if users want to set up rules-on-rules (second order rules) before the first order rule generates any alerts - since the first order rules wouldn't create an index until the first alert, the second order rules would get |
async rolloverAliasIfNeeded({ namespace }: { namespace?: string }) { | ||
const clusterClient = await this.getClusterClient(); | ||
const alias = getNamespacedAlias({ alias: this.options.alias, namespace }); | ||
const { body: simulatedRollover } = await clusterClient.indices.rollover({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to simulate a rollover? Can't we get the current write index, simulate the index template, and compare the two? (And I think those can happen in parallel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to make sequential calls either way - one call to retrieve the concrete index name for the alias write index, then a second set of calls to get the mapping for the concrete index and the simulated mapping for the new index. If we do it with the simulated rollover, elasticsearch handles extracting the current write index and the new index name for us and makes them available as old_index and new_index. If we avoid using simulated rollover, then we have to manually fetch the alias and extract the write index and also manually compute the next index name. So we don't necessarily need to simulate a rollover, but it's simpler this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marshallmain could you please add this clarification to the code as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explanatory comment in a45a70a
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
@elasticmachine merge upstream |
Pinging @elastic/apm-ui (Team:apm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think this is very interesting implementation overall, feel like it's better than what I had in my draft PR (#101453) in terms of how the logic of "soft migration" is implemented (updating templates, then rolling over the alias).
However, the way we currently split parts into templates seems to me a bit random and I'm concerned about divergence/inconsistencies in index bootstrapping between solutions. In my draft I introduced a more "standardized" system of component templates (#101016 (comment)), wondering what everyone thinks on it.
const { namespace } = options; | ||
const alias = getNamespacedAlias({ alias: this.options.alias, namespace }); | ||
await this.rolloverAliasIfNeeded({ namespace }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolloverAliasIfNeeded
accepts namespace
and calls const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
inside. Can we just pass the alias
already created above?
const alias = getNamespacedAlias({ alias: this.options.alias, namespace });
await this.rolloverAliasIfNeeded(alias);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why rolloverAliasIfNeeded
is called at the point when we get a writer, but index creation (createWriteTargetIfNeeded
) is deferred to the point when we actually index something, and get an exception saying that the index does not exist?
Can/should we combine the two into a single rolloverAliasOrCreateInitialIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can just pass in the alias directly, good point.
The call to createWriteTargetIfNeeded
is deferred at @dgieselaar 's request to avoid populating the cluster with more empty indices than necessary. We can defer it since attempting to write to an index that doesn't exist returns an error that we can catch, then create the index and write again. We can't defer the call to rolloverAliasIfNeeded
because there most likely won't be an error to catch if we try to write to an index whose mapping hasn't been updated - for most of the mapping updates we make (e.g. adding a new field to the mapping), it will still be able to write the document but the fields newly added to the template simply won't be mapped in the new documents. By rolling over before returning the writer, we can ensure that when we do write documents the write index has the updated mappings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't defer the call to
rolloverAliasIfNeeded
because there most likely won't be an error to catch if we try to write to an index whose mapping hasn't been updated - for most of the mapping updates we make (e.g. adding a new field to the mapping), it will still be able to write the document but the fields newly added to the template simply won't be mapped in the new documents.
👍 yeah, understood that part, agree
As for deferring createWriteTargetIfNeeded
until the very last moment, I'm personally not sure if this is super necessary because I suspect the time between the moment of getting a writer and attempting to write something will be very short. Exception would be if we get a writer but don't write anything - in this case we'd create the initial index, but it would stay empty for a while.
Both options work for me, I think this is not super important, just thought that maybe combining them into a rolloverAliasOrCreateInitialIndex
would make the code slightly simpler.
@@ -70,9 +70,10 @@ export class RuleDataClient implements IRuleDataClient { | |||
}; | |||
} | |||
|
|||
getWriter(options: { namespace?: string } = {}): RuleDataWriter { | |||
async getWriter(options: { namespace?: string } = {}): Promise<RuleDataWriter> { | |||
const { namespace } = options; | |||
const alias = getNamespacedAlias({ alias: this.options.alias, namespace }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why we use parameter object + destructuring in private functions when there's only 2 parameters? Wouldn't it be cleaner to getNamespacedAlias(this.options.alias, namespace)
?
I understand when some public API accepts options via an object (e.g. the async getWriter(options: { namespace?: string } = {}): Promise<RuleDataWriter> {
above). You probably want extensibility etc. But for internal or private functions this is unnecessary if the number of parameters is <= 3 imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion either way. Sometimes it's nice to see the parameter names at the function call site as a reminder of what each one is. I just left it the way it is to follow the existing pattern.
async rolloverAliasIfNeeded({ namespace }: { namespace?: string }) { | ||
const clusterClient = await this.getClusterClient(); | ||
const alias = getNamespacedAlias({ alias: this.options.alias, namespace }); | ||
let simulatedRollover: IndicesRolloverResponse; | ||
try { | ||
({ body: simulatedRollover } = await clusterClient.indices.rollover({ | ||
alias, | ||
dry_run: true, | ||
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought on race conditions between different parts of index bootstrapping in this implementation.
If I noticed that correctly, in this implementation there are 3 places where certain parts of index bootstrapping logic are implemented:
-
RuleDataPluginService.init()
. This guy is called externally from the rule_registry plugin setup phase.const ruleDataService = new RuleDataPluginService({ logger, isWriteEnabled: config.write.enabled, index: config.index, getClusterClient: async () => { const deps = await startDependencies; return deps.core.elasticsearch.client.asInternalUser; }, }); ruleDataService.init().catch((originalError) => { const error = new Error('Failed installing assets'); // @ts-ignore error.stack = originalError.stack; logger.error(error); });
-
Bootstrapping specific to a security solution log (e.g. log of detection alerts). Is called from the security_solution setup phase.
const ready = once(async () => { const componentTemplateName = ruleDataService.getFullAssetName( 'security-solution-mappings' ); if (!ruleDataService.isWriteEnabled()) { return; } await ruleDataService.createOrUpdateComponentTemplate({ template: { name: componentTemplateName, body: { template: { settings: { number_of_shards: 1, }, mappings: {}, // TODO: Add mappings here via `mappingFromFieldMap()` }, }, }, templateVersion: 1, }); await ruleDataService.createOrUpdateIndexTemplate({ template: { name: ruleDataService.getFullAssetName('security-solution-index-template'), body: { index_patterns: [ruleDataService.getFullAssetName('security-solution*')], composed_of: [ ruleDataService.getFullAssetName(TECHNICAL_COMPONENT_TEMPLATE_NAME), ruleDataService.getFullAssetName(ECS_COMPONENT_TEMPLATE_NAME), componentTemplateName, ], }, }, templateVersion: 1, }); }); ready().catch((err) => { this.logger!.error(err); }); ruleDataClient = new RuleDataClient({ alias: plugins.ruleRegistry.ruleDataService.getFullAssetName('security-solution'), getClusterClient: async () => { const coreStart = await start(); return coreStart.elasticsearch.client.asInternalUser; }, ready, });
-
Bootstrapping specific to a concrete log within a namespace. This is done here in
RuleDataClient
:rolloverAliasIfNeeded
andcreateWriteTargetIfNeeded
.
We need to guarantee that 1
needs to be executed once and before 2
, which in turn needs to be executed once and before 3
.
While I think this seems to be implemented correctly, it took a while for me to trace all those parts of bootstrapping, dependencies between them, how/where different ready
functions are being passed and called, etc. So imho we have 3 places where bootstrapping happens, but the order is a bit implicit, and it's hard to construct the full picture in the head.
Wondering if we could make it at least slightly more explicit.
Off the top of my head:
- We could rename
getClusterClient()
towaitUntilBootstrappingFinished()
or smth like that. It would still return cluster client. - We could pass cluster client as a parameter to
rolloverAliasIfNeeded()
and other functions that depend on it. - We could explicitly call
waitUntilBootstrappingFinished()
in public methods when appropriate and only once.
Ideally would be nice to encapsulate the whole bootstrapping (specific to a log) in a single class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that there's room for improvement on the handling of ready signals here. To me, the biggest source of confusion/possible future bugs is that the solution plugins must define a ready signal as part of step 2 (Bootstrapping specific to a security solution log) and that ready signal is then passed in to the RuleDataClient. So steps 1 and 3 are handled by the rule registry infrastructure, but solution devs still have to have fairly detailed knowledge of how the bootstrap process works to understand the requirements around "ready signal" they must implement as part of step 2.
To solve this, I envision adding a function to the RuleDataPluginService
that would look something like async bootstrapSolutionTemplates(logName, componentTemplates[], indexTemplate): Promise<RuleDataClient>
. This function would be responsible for executing the tasks in step 2 and then constructing a RuleDataClient to read and write from indices based on those templates. Then in the solution plugin setup we would call const ruleDataClientPromise = ruleDataService.bootstrapSolutionTemplates(...)
, purposely not await
ing the call to not block during setup. However, we can pass the ruleDataClientPromise
to rule types that need it and they can await
the promise inside the executors ensuring that bootstrapSolutionTemplates
finishes before they use the RuleDataClient.
I think with this approach, we'd still need the ready signal internal to the RuleDataPluginService to ensure that calls to bootstrapSolutionTemplates
don't execute until after RuleDataPluginService.init()
completes. But, since the RuleDataClient would come from bootstrapSolutionTemplates
, we wouldn't need a ready signal inside the RuleDataClient at all - as soon as the RuleDataClient is created by bootstrapSolutionTemplates
it's ready to be used.
Thoughts? Do you think this would sufficiently encapsulate and simplify the bootstrap process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm hoping we can defer the actual implementation of this encapsulation to a follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts? Do you think this would sufficiently encapsulate and simplify the bootstrap process?
Yeah, this could be an option (if you really don't like the declarative way of resolving log definitions :P ). Injecting a hook. Btw, bootstrapSolutionTemplates()
could be sync and return RuleDataClient right away. It would call the bootstrapping internally in parallel and make sure the ready signal is implemented properly.
One thing I'd keep in mind when implementing this is #102586 (comment) - how do we make sure that inside the hook a developer doesn't have to reference common templates manually.
Also I'm hoping we can defer the actual implementation of this encapsulation to a follow up PR
Yeah, of course, let's merge this PR sooner! 🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of bootstrapSolutionTemplates
as a way to specify an index mapping in a declarative way rather than as a hook. Can we pair on it soon and discuss? I might be misunderstanding what declarative is in this context.
if (err?.meta?.body?.error?.type !== 'resource_already_exists_exception') { | ||
throw err; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is there any doc describing exceptions from different methods of ElasticsearchClient
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any...it would be amazing if we had one though.
throw new Error( | ||
'No mappings would be generated for this index, possibly due to failed/misconfigured bootstrapping' | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be always some common mappings specified (currently they are specified in templates installed in init()
- TECHNICAL_COMPONENT_TEMPLATE_NAME
and ECS_COMPONENT_TEMPLATE_NAME
). We just have to make sure that final index templates always reference (composed_of
) these common component templates.
I'd say this is a matter of having an API that doesn't provide 100% flexibility in index bootstrapping and makes sure that index template is always created and it always references the common templates - doesn't matter how the client of the library uses it.
ruleDataService.getFullAssetName(TECHNICAL_COMPONENT_TEMPLATE_NAME), | ||
ruleDataService.getFullAssetName(ECS_COMPONENT_TEMPLATE_NAME), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably an offtopic, but not sure where to better discuss this.
What's the intent behind this separation to "technical" template and "ecs" template? Why we need to have 2 templates? Imho everything in them is just a bunch of common mappings, and it could be specified in a single template.
I also checked the templates, and it's weird to me:
export const technicalComponentTemplate: ClusterPutComponentTemplateBody = {
template: {
settings: {
number_of_shards: 1,
},
mappings: mappingFromFieldMap(technicalRuleFieldMap),
},
};
export const technicalRuleFieldMap = {
...pickWithPatterns(
ecsFieldMap,
TIMESTAMP,
EVENT_KIND,
EVENT_ACTION,
RULE_UUID,
RULE_ID,
RULE_NAME,
RULE_CATEGORY,
TAGS
),
[PRODUCER]: { type: 'keyword' },
[ALERT_UUID]: { type: 'keyword' },
[ALERT_ID]: { type: 'keyword' },
[ALERT_START]: { type: 'date' },
[ALERT_END]: { type: 'date' },
[ALERT_DURATION]: { type: 'long' },
[ALERT_SEVERITY_LEVEL]: { type: 'keyword' },
[ALERT_SEVERITY_VALUE]: { type: 'long' },
[ALERT_STATUS]: { type: 'keyword' },
[ALERT_EVALUATION_THRESHOLD]: { type: 'scaled_float', scaling_factor: 100 },
[ALERT_EVALUATION_VALUE]: { type: 'scaled_float', scaling_factor: 100 },
} as const;
export const ecsComponentTemplate: ClusterPutComponentTemplateBody = {
template: {
settings: {
number_of_shards: 1,
},
mappings: merge(
{},
mappingFromFieldMap(ecsFieldMap),
mappingFromFieldMap(technicalRuleFieldMap)
),
},
};
- So ecs template contains ALL the ecs fields + technical fields.
- Technical template contains technical fields.
- Technical fields are a subset of ecs fields + some extensions to ecs (our custom fields).
And the resulting index template references all of this, thus essentially includes ALL standard ECS fields + our extensions. Is this our plan to include the whole ECS?
Also, wouldn't it be better to have 2 separate field maps (one for a common subset of ECS fields and another for extensions to ECS), and then merge them into a single common .alerts-mappings
component template?
This is what I basically did in the draft (#101453) and described in #101016 (comment). Would be happy to get any comments on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Observability indices don't want the full ECS mapping, we need to keep the base .alerts
component template (the technical fields) as minimal as possible and allow solutions to add various components as needed.
Yeah it seems odd that the ecsComponentTemplate
contains the technicalRuleFieldMap
- maybe that's an oversight? I think the overall idea is that the technicalComponentTemplate
contains the fields that are common across all rules in the RAC initiative, so we may want to enforce that that template is used in all index templates created by the RuleDataPluginService (maybe by automatically dropping it into the composed_of
field). The ecsComponentTemplate
with all ECS fields is useful for security solution templates, since we include the entire ECS mapping so that when we copy documents into the .alerts
index the ECS fields in the original documents will be mapped in the alert documents as well. We'll most likely have security specific fields to map in the .alerts-security
indices also, and we could put those in a third component template (first 2 are technical and ECS components).
Re: merging the field maps, I would like to leave the merging of component templates to Elasticsearch, similar to how it's described in #101016 (comment) with a variety of separate component templates, and avoid constructs like
mappings: merge(
{},
mappingFromFieldMap(ecsFieldMap),
mappingFromFieldMap(technicalRuleFieldMap)
),
where we explicitly merge mappings into single component. I like the pattern we use to build the security solution index template right now where we compose the full index template using the base technical template, the ECS template, and our (currently empty) security solution-specific component template:
{
name: ruleDataService.getFullAssetName('security-solution-index-template'),
body: {
index_patterns: [ruleDataService.getFullAssetName('security-solution*')],
composed_of: [
ruleDataService.getFullAssetName(TECHNICAL_COMPONENT_TEMPLATE_NAME),
ruleDataService.getFullAssetName(ECS_COMPONENT_TEMPLATE_NAME),
componentTemplateName,
],
},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yeah, maybe there's an oversight and some confusion. I see your points. Let's maybe discuss this separately and address in a follow-up PR (I could take over it).
'versions', | ||
]); | ||
const componentTemplateRemoved = | ||
Object.keys(currentVersions).length > Object.keys(targetVersions).length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line fails with TypeError: Cannot convert undefined or null to object
when currentVersions
is undefined. Let's add an additional guard here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, how's that possible. Both current and simulated indices should have this versions
object in _meta
(I guess?). Maybe this is an indication of a bug here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it happened when switching from another branch to this one while using the same ES cluster? I'm extracting this logic into a function and making it a little more robust with guards around the cases where currentVersions or targetVersions are undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case, there was no _meta
field in the response. Not sure how it happened, tbh.
const { namespace } = options; | ||
const alias = getNamespacedAlias({ alias: this.options.alias, namespace }); | ||
await this.rolloverAliasIfNeeded({ namespace }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having await this.rolloverAliasIfNeeded({ namespace });
inside getWriter
could carry significant performance penalty.
Let's consider solution-side code that performs many writes during a request or rule execution. A single log write operation could look like that:
await ruleDataClientInstance.getWriter({ namespace }).bulk(logsToWrite);
And I can imagine a rule that performs tens of writes like that during the execution. From the solution developer's perspective, nothing bad happens; they write logs N times. But under the hood, every write request is accompanied by at least three additional requests (total requests = N*4):
// Expected request
await clusterClient.bulk()
// Additional "hidden" requests
await clusterClient.indices.rollover({ dry_run: true })
await clusterClient.indices.get()
await clusterClient.indices.simulateIndexTemplate()
To avoid making these extra requests on every write, the developer should instantiate the writer and reuse it. But I think it's (a) not always convenient to do, and (b) the developer should know implementation details of getWriter
and be aware of the possible pitfall.
We can envisage use cases as above and protect API users from harming performance by having a map of instantiated writers (Map<Namespace, RuleDataWriter>()
) and returning a writer for a given namespace before even calling rolloverAliasIfNeeded
if it is already in the map.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid making these extra requests on every write, the developer should instantiate the writer and reuse it.
That's I'd say how it should be done. If you're in an executor function, you will get a namespace
from the rule parameters and get the writer. Then you'll pass it down to your classes and functions as a dependency to write events.
Same if you need to log stuff from route handlers.
having a map of instantiated writers (Map<Namespace, RuleDataWriter>())
Hehe, that's similar to what I had in my draft :) I think this could be handy in general if someone would always use the API like that: await ruleDataClientInstance.getWriter({ namespace }).bulk(logsToWrite);
, but maybe YAGNI.
I'd say let's consider that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is an area that is a little bit tricky. I had in mind adding some dev docs to explain the expected usage pattern (create a writer inside the alert executor and reuse it for the duration of the executor).
Typescript does do a little bit more to guide devs towards that pattern naturally - since getWriter
is async, you have to do something like
const writer = await ruleDataClientInstance.getWriter({ namespace });
await writer.bulk(logsToWrite);
which might help clue devs in that writers are meant to be reused.
On the flip side, the potential issue with caching the writers and therefore only running the rollover logic once at instantiation time is that if the mappings change while Kibana is running then we won't detect it unless we add a way to manually re-check the mapping versions for an index. Our built-in mappings won't change this way, but we've had multiple cases of advanced users modifying the mappings to add their own custom fields. If we do add the user-defined templates (#101016 (comment)) then those would be changing while Kibana is running - so we'd want to check the template versions frequently to make sure changes to those templates get propagated to the indices.
So in my mind the RuleDataWriter is designed to live for a few seconds, e.g. for the duration of a rule execution, after which it should be discarded and a new one instantiated. My hope was that with docs we can provide simple guidance about creating and using a single writer within the rule executor and devs that way won't have to worry too much about the internals - they just know that you make 1 writer in the executor.
Maybe we could also add an optional parameter to getWriter
to disable the rollover check, for use in route handlers that don't want to pay the overhead of the rollover check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's I'd say how it should be done. If you're in an executor function, you will get a
namespace
from the rule parameters and get the writer. Then you'll pass it down to your classes and functions as a dependency to write events.
Yeah, it should be done that way. But my point is that the current API design doesn't enforce that. And in fact, it is much easier to use the API in the wrong way. Currently, all classes/functions depend on ruleDataClient
instead of writer
, like here, and here. And I'm also doing something similar in other places in my PR, just because it's much easier. I already have ruleDataClient
as a dependency; why bothering extracting a writer
and passing it down as another dependency? And some developers (including me 🙂) prefer to just copy-paste working code from other places instead of reading docs and exploring how things work under the hood. That's why I think adding docs wouldn't make much difference. That said, it's still a good practice to have docs.
I think a well-designed system should make it easy to do the right things and hard to do the wrong things. That's why I'm highlighting the current API as it potentially could be misused. But, as we have plans to improve RuleDataClient in a follow PR, let's discuss the API design separately. It is not a blocker for current changes to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to address this concern in the next PR within #101016 (will be a parallel PR to this one). The idea is to bind RuleDataClient
to a namespace, so you could get an instance of it only if you know the namespace you're gonna work with:
// somewhere in the setup phase of security_solution
await ruleDataService.bootstrap(detectionAlertsLogDefinition);
// somewhere in an executor or a route handler
const detectionAlertsClient = await ruleDataService.getClient(detectionAlertsLogDefinition, namespace);
// or we could even make it sync
const detectionAlertsClient = ruleDataService.getClient(detectionAlertsLogDefinition, namespace);
const writer = detectionAlertsClient.getWriter();
const reader = detectionAlertsClient.getReader();
It looks like there is a race condition somewhere in the index bootstrapping logic. Steps to reproduce
Expected result Actual result
After another restart, everything works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some unpredictability with these changes. API integration tests are needed to ensure consistency. We have some available within the RBAC pr as a reference if you want to copy that into this PR. Or just wait until the RBAC PR is merged and add them in at that point.
Rule registry was merged into master without any automated tests (we added some in the RBAC pr) and I think we are falling into a bad habit of continuing to merge code without test coverage. This is a good entry point to begin undoing those bad habits.
This is incorrect. There are API integration tests in the APM test suite, and we have added unit tests since as well. |
@elasticmachine merge upstream |
@dhurley14 with the implementation changes to do the mapping updates at Kibana startup time instead of each time a |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
I opened elastic/elasticsearch#75031 which would allows us to potentially get rid of this logic in Kibana itself in the long term and Elasticsearch would take care of it. Short term, we need to have it Kibana as this PR does it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thanks so much for being flexible and finding a good solution!
I want to get @leehinman and @danhermann to take a look after this merges and help us confirm that it's okay for us to control the ILM suffix logic and how that will work alongside Elasticsearch's own ILM suffix update logic, but that doesn't need to block this merge.
@jasonrhodes, could you elaborate on your question here? I don't have any context on this particular PR. |
@danhermann sure, sorry for the context-less ping. We had a good conversation with @jpountz last week about this and he mentioned your names as some folks to check in with on this. Basically, we are manually incrementing what I'm referring to as the "ILM suffix" in order to try to prevent unnecessary rollovers. See: I wanted to make sure of mainly two explicit questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APM changes look good
Thanks, @jasonrhodes. I know that for (1), there are some edge cases in which the suffix of an index may not be six numeric characters: the first index not rolled over by ILM and the scenario in which indices have been rolled more than 999999 times. It's probably better to look for the last |
Thanks for the feedback @danhermann ! One possible alternative that might be more robust is to use the |
* Add component template versioning to RuleDataClient * Add versioning for index templates * Address PR comments, add error handling inside rolloverAliasIfNeeded * Fix security alerts index name, rollover func param, more robust rollover logic * Add empty mapping check to createOrUpdateIndexTemplate * Fix error path in createWriteTargetIfNeeded to suppress resource_already_exists_exception * Add code comments around rollover logic * Replace numeric versions with semver * Use optional chaining operator to fetch current write index mapping * Fix template version number * Move mapping updates to plugin startup, remove dependency on component versions * Undo changes to lifecycle and persistance rule type factories * Remove test code * Simplify race mitigation logic * Remove outdated comment * Add unit tests, log unexpected errors instead of rethrowing Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
* Add component template versioning to RuleDataClient * Add versioning for index templates * Address PR comments, add error handling inside rolloverAliasIfNeeded * Fix security alerts index name, rollover func param, more robust rollover logic * Add empty mapping check to createOrUpdateIndexTemplate * Fix error path in createWriteTargetIfNeeded to suppress resource_already_exists_exception * Add code comments around rollover logic * Replace numeric versions with semver * Use optional chaining operator to fetch current write index mapping * Fix template version number * Move mapping updates to plugin startup, remove dependency on component versions * Undo changes to lifecycle and persistance rule type factories * Remove test code * Simplify race mitigation logic * Remove outdated comment * Add unit tests, log unexpected errors instead of rethrowing Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Marshall Main <55718608+marshallmain@users.noreply.github.com>
…-of-max-results * 'master' of github.com:elastic/kibana: (36 commits) Lower Kibana app bundle limits (elastic#104688) [Security Solutions] Fixes bug with the filter query compatibility for transforms (elastic#104559) [RAC] Add mapping update logic to RuleDataClient (elastic#102586) Fix import workpad (elastic#104722) [canvas] Fix Storybook service decorator (elastic#104750) [Detection Rules] Add 7.14 rules (elastic#104772) [Enterprise Search] Fix beta notification in sidebar (elastic#104763) Fix engine routes that are meta engine or non-meta-engine specific (elastic#104757) [Fleet] Fix policy revision number getting bumped for no reason (elastic#104696) persistable state migrations (elastic#103680) [Fleet] Fix add agent in the package policy table (elastic#104749) [DOCS] Creates separate doc for security in production (elastic#103973) [SO Migration] fix reindex race on multi-instance mode (elastic#104516) [Security Solution] Update text in Endpoint Admin pages (elastic#104649) [package testing] Decrease timeout to 2 hours (elastic#104668) Fix background styling of waterfall chart sidebar tooltip. (elastic#103997) [Fleet + Integrations UI] Integrations UI Cleanup (elastic#104641) [Fleet] Link to download page of current stack version on Agent install instructions (elastic#104494) [Workplace Search] Fix Media Type field preview is unformatted bug (elastic#104684) [ML] add marker body (elastic#104672) ... # Conflicts: # x-pack/plugins/fleet/public/search_provider.test.ts
Calling the rollover API with the dry_run parameter is lighter-weight than without, but in order to validate the request, it still has to retrieve stats on the index being rolled which may require calls between nodes. |
@marshallmain what do you think of this suggestion to avoid a bug based on using the last 6 characters as the ILM suffix? |
@jasonrhodes regarding ILM suffix, we also think it would be nice to make it more robust, we'll address it in a follow-up PR. See 1. Index name creation & update logic in #101016 (comment) |
Summary
Part of effort for #101016
This PR introduces a function
updateIndexMappingsMatchingPattern
on the RuleDataPluginService that other plugins can call to update all the write indices matching a pattern. It is intended to be called after the plugin finishes creating/updating the component and index templates. As part of the encapsulation work that @banderror is doing we'll probably want to move the usage of this function from other plugins back into the RuleDataPluginService. So when a plugin declares templates that it wants created the RuleDataPluginService will take care of creating the templates and updating any indices that match the template automatically.The general strategy behind the mapping updates mirrors Fleet's strategy: use PUT mapping to update the mapping in place if possible, and if that fails rollover the index to ensure that the new write index has the proper mappings.
Race conditions and mitigations:
It's very likely that multiple Kibana instances will be executing this upgrade code simultaneously. There's a good chance that 2 instances would make PUT mapping calls at the same time, both get conflicts, and both rollover the index. This approach resolves that problem by specifying the
next_index
param in the rollover API call. Even if N instances executing this upgrade logic all fail simultaneously at the PUT mapping step for an alias, then only one will succeed at rolling the index.On a particular Kibana instance, the template and mapping update flow will be:
next_index
param in the API callThe key to avoiding races that could cause multiple rollovers due to multiple Kibana instances attempting this update process simultaneously is that the rollover API will return a
resource_already_exists_exception
if the providednext_index
already exists. There are a number of possible cases to consider regarding the index being rolled over from other sources during this update process, however they should all be handled correctly.resource_already_exists_exception
. No additional rollover and the mapping is updated.Next consolidation steps:
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers