Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

Open
pmuellr opened this issue Aug 12, 2024 · 7 comments · May be fixed by #192534
Open

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

pmuellr opened this issue Aug 12, 2024 · 7 comments · May be fixed by #192534
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Aug 12, 2024

Seen in Kibana logs:

Error writing alerts: 71 successful, 0 conflicts, 59 errors:
[f0674156-4d87-4bc4-a073-0395ad6a2454]: version conflict, document already exists (current version [1]);
[9e9246e3-4997-4720-a30b-ef1f8ba56f19]: version conflict, document already exists (current version [1]);
[affee1f4-25f5-4826-9e99-4e18c2540e35]: version conflict, document already exists (current version [1]);

We've seen these messages occur often in clusters under stress. However note how the message says there are no conflicts but 59 errors. And note that the errors all look like conflict errors: version conflict. This was not a one-time event; one cluster had thousands of these all claiming 0 conflicts, and all the errors seemed to be conflicts.

We don't retry errors, but we do retry conflicts. So that means there are a lot of conflicts that we are not resolving, but giving up on.

So apparently we aren't identifying these as conflicts. It looks like that code is here:

function getResponseStats(bulkResponse: BulkResponse): ResponseStatsResult {
const sanitizedResponse = sanitizeBulkErrorResponse(bulkResponse) as BulkResponse;
const stats: ResponseStatsResult = { success: 0, conflicts: 0, errors: 0, messages: [] };
for (const item of sanitizedResponse.items) {
const op = item.create || item.index || item.update || item.delete;
if (op?.error) {
if (op?.status === 409 && op === item.index) {
stats.conflicts++;
} else {
stats.errors++;
stats.messages.push(op?.error?.reason || 'no bulk reason provided');
}
} else {
stats.success++;
}
}
return stats;
}

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Aug 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Aug 13, 2024

This appears to go back to at least 8.12 ...

@pmuellr
Copy link
Member Author

pmuellr commented Aug 15, 2024

I think there's more to this than getting the code to recognize the conflict. There should not have BEEN a conflict, it seems, for many of them. It's like someone is trying to create a new alert document, but it's already been created ...

@pmuellr pmuellr self-assigned this Aug 15, 2024
@pmuellr
Copy link
Member Author

pmuellr commented Aug 15, 2024

I now suspect this bug has actually exposed a different bug.

Since most/all of these message have been version conflict, document already exists, that means we aren't sending the OCC bits (_index, if_seq_no, if_primary_term) when we try to update/create. That determination is made here:

function getBulkMeta(
uuid: string,
index: string | undefined,
seqNo: number | undefined,
primaryTerm: number | undefined,
isUsingDataStreams: boolean
) {
if (index && seqNo != null && primaryTerm != null) {
return {
index: {
_id: uuid,
_index: index,
if_seq_no: seqNo,
if_primary_term: primaryTerm,
require_alias: false,
},
};
}
return {
create: {
_id: uuid,
...(isUsingDataStreams ? {} : { require_alias: true }),
},
};
}
}

So now we'll have to see why that is - maybe the alert was not persisted in task state, so it just assumed it was "new". I did however see an alert id suffering this problem, that was eventually in the task state (same uuid), so it may be that it starts in the task state but somehow it got missed when we queried for those alert docs.

Some other aspects of the deployment we've seen this:

  • metric threshold, 5-level grouping, alerting on no data/no groups on
  • rule has 1,000's of alert docs, many still not recovered, probably due to exceeding alert limits (generated > 1000 alerts during a single run) which may be causing problems - [ResponseOps][alerting] Investigate odd behaviors when max alerts limit reached #190258 .
  • rule has 10's of alert objects in task state, some active, some recovered
  • deployment has had some performance stress
  • flapping may be disabled in the rule's space

pmuellr added a commit that referenced this issue Aug 22, 2024
…0857)

## Summary

While investigating some issues with the alertsClient, I realized that
we weren't writing out any rule information for the logged messages.
This made debugging quite difficult, as I wanted to see the rule, so had
to search through the alerts indices for the specified alert to get it's
rule id, rule type, etc.

As an example, see #190376

This PR adds that kind of rule info to the logged messages in
alertsClient, as well as the typical sort of tags we write out (rule id,
rule type, module).
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Aug 22, 2024
…stic#190857)

## Summary

While investigating some issues with the alertsClient, I realized that
we weren't writing out any rule information for the logged messages.
This made debugging quite difficult, as I wanted to see the rule, so had
to search through the alerts indices for the specified alert to get it's
rule id, rule type, etc.

As an example, see elastic#190376

This PR adds that kind of rule info to the logged messages in
alertsClient, as well as the typical sort of tags we write out (rule id,
rule type, module).

(cherry picked from commit 07717a4)
kibanamachine referenced this issue Aug 22, 2024
…nt (#190857) (#191094)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[ResponseOps][alerting] add rule info to logging in alertsClient
(#190857)](#190857)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Patrick
Mueller","email":"patrick.mueller@elastic.co"},"sourceCommit":{"committedDate":"2024-08-22T14:29:22Z","message":"[ResponseOps][alerting]
add rule info to logging in alertsClient (#190857)\n\n##
Summary\r\n\r\nWhile investigating some issues with the alertsClient, I
realized that\r\nwe weren't writing out any rule information for the
logged messages.\r\nThis made debugging quite difficult, as I wanted to
see the rule, so had\r\nto search through the alerts indices for the
specified alert to get it's\r\nrule id, rule type, etc.\r\n\r\nAs an
example, see https://github.com/elastic/kibana/issues/190376\r\n\r\nThis
PR adds that kind of rule info to the logged messages
in\r\nalertsClient, as well as the typical sort of tags we write out
(rule id,\r\nrule type,
module).","sha":"07717a43ab369847d87c8e15071759502a89c48b","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","v8.16.0","v8.15.1"],"title":"[ResponseOps][alerting]
add rule info to logging in
alertsClient","number":190857,"url":"https://github.com/elastic/kibana/pull/190857","mergeCommit":{"message":"[ResponseOps][alerting]
add rule info to logging in alertsClient (#190857)\n\n##
Summary\r\n\r\nWhile investigating some issues with the alertsClient, I
realized that\r\nwe weren't writing out any rule information for the
logged messages.\r\nThis made debugging quite difficult, as I wanted to
see the rule, so had\r\nto search through the alerts indices for the
specified alert to get it's\r\nrule id, rule type, etc.\r\n\r\nAs an
example, see https://github.com/elastic/kibana/issues/190376\r\n\r\nThis
PR adds that kind of rule info to the logged messages
in\r\nalertsClient, as well as the typical sort of tags we write out
(rule id,\r\nrule type,
module).","sha":"07717a43ab369847d87c8e15071759502a89c48b"}},"sourceBranch":"main","suggestedTargetBranches":["8.15"],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/190857","number":190857,"mergeCommit":{"message":"[ResponseOps][alerting]
add rule info to logging in alertsClient (#190857)\n\n##
Summary\r\n\r\nWhile investigating some issues with the alertsClient, I
realized that\r\nwe weren't writing out any rule information for the
logged messages.\r\nThis made debugging quite difficult, as I wanted to
see the rule, so had\r\nto search through the alerts indices for the
specified alert to get it's\r\nrule id, rule type, etc.\r\n\r\nAs an
example, see https://github.com/elastic/kibana/issues/190376\r\n\r\nThis
PR adds that kind of rule info to the logged messages
in\r\nalertsClient, as well as the typical sort of tags we write out
(rule id,\r\nrule type,
module).","sha":"07717a43ab369847d87c8e15071759502a89c48b"}},{"branch":"8.15","label":"v8.15.1","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
@pmuellr
Copy link
Member Author

pmuellr commented Sep 10, 2024

Presumably the bulk operations that failed were create operations, since thats the op we use, as seen in getBulkMeta() above.

Here's a repo of running the bulk twice twice:
# run this twice, you'll see the error below on second call
POST /test-2024-09-09-a/_bulk
{ "create": { "_id": "1" } }
{ "x": 1 }
{
  "errors": true,
  "took": 0,
  "items": [
    {
      "create": {
        "_index": "test-2024-09-09-a",
        "_id": "1",
        "status": 409,
        "error": {
          "type": "version_conflict_engine_exception",
          "reason": "[1]: version conflict, document already exists (current version [1])",
          "index_uuid": "0CZsPRQXQ_GIKWnEVvKsbw",
          "shard": "0",
          "index": "test-2024-09-09-a"
        }
      }
    }
  ]
}

What I'm not sure of yet is whether the alert doc may have existed from a previous run, or it's this run that is for some reason trying to create it twice.

I think it might be best to assume either could happen :-)

So, probably fixing this so the conflict is re-tried as an update will keep things running. If it's not clear WHY this is happening by the time I'm done, I'll open another issue to investigate if something is broken ...

@pmuellr
Copy link
Member Author

pmuellr commented Sep 10, 2024

This seems likely PR to cause the problem: #160572

It's where we started using create op as well as index op for bulk, since that works with both datastreams and aliases/indexes (create when creating new alerts, index for updating).

Problem is, we didn't change the places we check for index.blah to be index?.blah | create.blah sort of thing, since the responses are placed in an op specific property.

Here's an example:

function getResponseStats(bulkResponse: BulkResponse): ResponseStatsResult {
const sanitizedResponse = sanitizeBulkErrorResponse(bulkResponse) as BulkResponse;
const stats: ResponseStatsResult = { success: 0, conflicts: 0, errors: 0, messages: [] };
for (const item of sanitizedResponse.items) {
const op = item.create || item.index || item.update || item.delete;
if (op?.error) {
if (op?.status === 409 && op === item.index) {
stats.conflicts++;
} else {
stats.errors++;
stats.messages.push(op?.error?.reason || 'no bulk reason provided');
}
} else {
stats.success++;
}
}
return stats;
}

When op is item.create, it won't go through the "conflict" branch, it will go through the "error" branch. It should go through the "conflict" branch.

Gonna need a helper, I think ...

There don't seem to be any problems in the alerts_client.ts code, so I'm hoping it's all in alerts_client/lib/alert_conflict_resolver.ts

pmuellr added a commit that referenced this issue Sep 26, 2024
resolves: #190376

In PR #160572, we changed from
using just the bulk op `index` to using `create` when new alerts are
being created.

Unfortunately, the code to handle the bulk responses didn't take into
account that the bulk responses for `create`s need different handling
than `index`s.  Specifically, conflicts for `create` were being treated
as errors.

This PR changes the processing to consider additional ops besides just
`index`.
@pmuellr
Copy link
Member Author

pmuellr commented Oct 14, 2024

In PR #195948, we are fixing some processing where alerts seem to be getting out of sync when we hit the alert limits. I'm thinking this is going to resolve some of the doc conflicts, so we should check overview after this is released to see if it actually did - hopefully it at least removes some classes of these conflicts ...

pmuellr added a commit to pmuellr/kibana that referenced this issue Oct 15, 2024
resolves: elastic#190376

In PR elastic#160572, we changed from
using just the bulk op `index` to using `create` when new alerts are
being created.

Unfortunately, the code to handle the bulk responses didn't take into
account that the bulk responses for `create`s need different handling
than `index`s.  Specifically, conflicts for `create` were being treated
as errors.

This PR changes the processing to consider additional ops besides just
`index`.
pmuellr added a commit to pmuellr/kibana that referenced this issue Dec 10, 2024
resolves: elastic#190376

In PR elastic#160572, we changed from
using just the bulk op `index` to using `create` when new alerts are
being created.

Unfortunately, the code to handle the bulk responses didn't take into
account that the bulk responses for `create`s need different handling
than `index`s.  Specifically, conflicts for `create` were being treated
as errors.

This PR changes the processing to consider additional ops besides just
`index`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants