-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
This appears to go back to at least 8.12 ... |
I think there's more to this than getting the code to recognize the conflict. There should not have BEEN a conflict, it seems, for many of them. It's like someone is trying to create a new alert document, but it's already been created ... |
I now suspect this bug has actually exposed a different bug. Since most/all of these message have been kibana/x-pack/plugins/alerting/server/alerts_client/alerts_client.ts Lines 585 to 611 in 16bded2
So now we'll have to see why that is - maybe the alert was not persisted in task state, so it just assumed it was "new". I did however see an alert id suffering this problem, that was eventually in the task state (same uuid), so it may be that it starts in the task state but somehow it got missed when we queried for those alert docs. Some other aspects of the deployment we've seen this:
|
…0857) ## Summary While investigating some issues with the alertsClient, I realized that we weren't writing out any rule information for the logged messages. This made debugging quite difficult, as I wanted to see the rule, so had to search through the alerts indices for the specified alert to get it's rule id, rule type, etc. As an example, see #190376 This PR adds that kind of rule info to the logged messages in alertsClient, as well as the typical sort of tags we write out (rule id, rule type, module).
…stic#190857) ## Summary While investigating some issues with the alertsClient, I realized that we weren't writing out any rule information for the logged messages. This made debugging quite difficult, as I wanted to see the rule, so had to search through the alerts indices for the specified alert to get it's rule id, rule type, etc. As an example, see elastic#190376 This PR adds that kind of rule info to the logged messages in alertsClient, as well as the typical sort of tags we write out (rule id, rule type, module). (cherry picked from commit 07717a4)
…nt (#190857) (#191094) # Backport This will backport the following commits from `main` to `8.15`: - [[ResponseOps][alerting] add rule info to logging in alertsClient (#190857)](#190857) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Patrick Mueller","email":"patrick.mueller@elastic.co"},"sourceCommit":{"committedDate":"2024-08-22T14:29:22Z","message":"[ResponseOps][alerting] add rule info to logging in alertsClient (#190857)\n\n## Summary\r\n\r\nWhile investigating some issues with the alertsClient, I realized that\r\nwe weren't writing out any rule information for the logged messages.\r\nThis made debugging quite difficult, as I wanted to see the rule, so had\r\nto search through the alerts indices for the specified alert to get it's\r\nrule id, rule type, etc.\r\n\r\nAs an example, see https://github.com/elastic/kibana/issues/190376\r\n\r\nThis PR adds that kind of rule info to the logged messages in\r\nalertsClient, as well as the typical sort of tags we write out (rule id,\r\nrule type, module).","sha":"07717a43ab369847d87c8e15071759502a89c48b","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","v8.16.0","v8.15.1"],"title":"[ResponseOps][alerting] add rule info to logging in alertsClient","number":190857,"url":"https://github.com/elastic/kibana/pull/190857","mergeCommit":{"message":"[ResponseOps][alerting] add rule info to logging in alertsClient (#190857)\n\n## Summary\r\n\r\nWhile investigating some issues with the alertsClient, I realized that\r\nwe weren't writing out any rule information for the logged messages.\r\nThis made debugging quite difficult, as I wanted to see the rule, so had\r\nto search through the alerts indices for the specified alert to get it's\r\nrule id, rule type, etc.\r\n\r\nAs an example, see https://github.com/elastic/kibana/issues/190376\r\n\r\nThis PR adds that kind of rule info to the logged messages in\r\nalertsClient, as well as the typical sort of tags we write out (rule id,\r\nrule type, module).","sha":"07717a43ab369847d87c8e15071759502a89c48b"}},"sourceBranch":"main","suggestedTargetBranches":["8.15"],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/190857","number":190857,"mergeCommit":{"message":"[ResponseOps][alerting] add rule info to logging in alertsClient (#190857)\n\n## Summary\r\n\r\nWhile investigating some issues with the alertsClient, I realized that\r\nwe weren't writing out any rule information for the logged messages.\r\nThis made debugging quite difficult, as I wanted to see the rule, so had\r\nto search through the alerts indices for the specified alert to get it's\r\nrule id, rule type, etc.\r\n\r\nAs an example, see https://github.com/elastic/kibana/issues/190376\r\n\r\nThis PR adds that kind of rule info to the logged messages in\r\nalertsClient, as well as the typical sort of tags we write out (rule id,\r\nrule type, module).","sha":"07717a43ab369847d87c8e15071759502a89c48b"}},{"branch":"8.15","label":"v8.15.1","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
Presumably the bulk operations that failed were Here's a repo of running the bulk twice twice:# run this twice, you'll see the error below on second call
POST /test-2024-09-09-a/_bulk
{ "create": { "_id": "1" } }
{ "x": 1 } {
"errors": true,
"took": 0,
"items": [
{
"create": {
"_index": "test-2024-09-09-a",
"_id": "1",
"status": 409,
"error": {
"type": "version_conflict_engine_exception",
"reason": "[1]: version conflict, document already exists (current version [1])",
"index_uuid": "0CZsPRQXQ_GIKWnEVvKsbw",
"shard": "0",
"index": "test-2024-09-09-a"
}
}
}
]
} What I'm not sure of yet is whether the alert doc may have existed from a previous run, or it's this run that is for some reason trying to create it twice. I think it might be best to assume either could happen :-) So, probably fixing this so the conflict is re-tried as an update will keep things running. If it's not clear WHY this is happening by the time I'm done, I'll open another issue to investigate if something is broken ... |
This seems likely PR to cause the problem: #160572 It's where we started using Problem is, we didn't change the places we check for Here's an example: kibana/x-pack/plugins/alerting/server/alerts_client/lib/alert_conflict_resolver.ts Lines 291 to 308 in 9b13685
When Gonna need a helper, I think ... There don't seem to be any problems in the |
resolves: #190376 In PR #160572, we changed from using just the bulk op `index` to using `create` when new alerts are being created. Unfortunately, the code to handle the bulk responses didn't take into account that the bulk responses for `create`s need different handling than `index`s. Specifically, conflicts for `create` were being treated as errors. This PR changes the processing to consider additional ops besides just `index`.
In PR #195948, we are fixing some processing where alerts seem to be getting out of sync when we hit the alert limits. I'm thinking this is going to resolve some of the doc conflicts, so we should check overview after this is released to see if it actually did - hopefully it at least removes some classes of these conflicts ... |
resolves: elastic#190376 In PR elastic#160572, we changed from using just the bulk op `index` to using `create` when new alerts are being created. Unfortunately, the code to handle the bulk responses didn't take into account that the bulk responses for `create`s need different handling than `index`s. Specifically, conflicts for `create` were being treated as errors. This PR changes the processing to consider additional ops besides just `index`.
resolves: elastic#190376 In PR elastic#160572, we changed from using just the bulk op `index` to using `create` when new alerts are being created. Unfortunately, the code to handle the bulk responses didn't take into account that the bulk responses for `create`s need different handling than `index`s. Specifically, conflicts for `create` were being treated as errors. This PR changes the processing to consider additional ops besides just `index`.
Seen in Kibana logs:
We've seen these messages occur often in clusters under stress. However note how the message says there are no conflicts but 59 errors. And note that the errors all look like conflict errors:
version conflict
. This was not a one-time event; one cluster had thousands of these all claiming 0 conflicts, and all the errors seemed to be conflicts.We don't retry errors, but we do retry conflicts. So that means there are a lot of conflicts that we are not resolving, but giving up on.
So apparently we aren't identifying these as conflicts. It looks like that code is here:
kibana/x-pack/plugins/alerting/server/alerts_client/lib/alert_conflict_resolver.ts
Lines 272 to 289 in b037ea7
The text was updated successfully, but these errors were encountered: