[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

pmuellr · 2024-08-12T22:12:50Z

Seen in Kibana logs:

Error writing alerts: 71 successful, 0 conflicts, 59 errors:
[f0674156-4d87-4bc4-a073-0395ad6a2454]: version conflict, document already exists (current version [1]);
[9e9246e3-4997-4720-a30b-ef1f8ba56f19]: version conflict, document already exists (current version [1]);
[affee1f4-25f5-4826-9e99-4e18c2540e35]: version conflict, document already exists (current version [1]);

We've seen these messages occur often in clusters under stress. However note how the message says there are no conflicts but 59 errors. And note that the errors all look like conflict errors: version conflict. This was not a one-time event; one cluster had thousands of these all claiming 0 conflicts, and all the errors seemed to be conflicts.

We don't retry errors, but we do retry conflicts. So that means there are a lot of conflicts that we are not resolving, but giving up on.

So apparently we aren't identifying these as conflicts. It looks like that code is here:

kibana/x-pack/plugins/alerting/server/alerts_client/lib/alert_conflict_resolver.ts

Lines 272 to 289 in b037ea7

    
           function getResponseStats(bulkResponse: BulkResponse): ResponseStatsResult { 
        
             const sanitizedResponse = sanitizeBulkErrorResponse(bulkResponse) as BulkResponse; 
        
             const stats: ResponseStatsResult = { success: 0, conflicts: 0, errors: 0, messages: [] }; 
        
             for (const item of sanitizedResponse.items) { 
        
               const op = item.create || item.index || item.update || item.delete; 
        
               if (op?.error) { 
        
                 if (op?.status === 409 && op === item.index) { 
        
                   stats.conflicts++; 
        
                 } else { 
        
                   stats.errors++; 
        
                   stats.messages.push(op?.error?.reason || 'no bulk reason provided'); 
        
                 } 
        
               } else { 
        
                 stats.success++; 
        
               } 
        
             } 
        
             return stats; 
        
           }

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-08-12T22:12:52Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2024-08-13T15:23:03Z

This appears to go back to at least 8.12 ...

pmuellr · 2024-08-15T15:31:07Z

I think there's more to this than getting the code to recognize the conflict. There should not have BEEN a conflict, it seems, for many of them. It's like someone is trying to create a new alert document, but it's already been created ...

pmuellr · 2024-08-15T22:17:33Z

I now suspect this bug has actually exposed a different bug.

Since most/all of these message have been version conflict, document already exists, that means we aren't sending the OCC bits (_index, if_seq_no, if_primary_term) when we try to update/create. That determination is made here:

kibana/x-pack/plugins/alerting/server/alerts_client/alerts_client.ts

Lines 585 to 611 in 16bded2

    
             function getBulkMeta( 
        
               uuid: string, 
        
               index: string | undefined, 
        
               seqNo: number | undefined, 
        
               primaryTerm: number | undefined, 
        
               isUsingDataStreams: boolean 
        
             ) { 
        
               if (index && seqNo != null && primaryTerm != null) { 
        
                 return { 
        
                   index: { 
        
                     _id: uuid, 
        
                     _index: index, 
        
                     if_seq_no: seqNo, 
        
                     if_primary_term: primaryTerm, 
        
                     require_alias: false, 
        
                   }, 
        
                 }; 
        
               } 
        
               return { 
        
                 create: { 
        
                   _id: uuid, 
        
                   ...(isUsingDataStreams ? {} : { require_alias: true }), 
        
                 }, 
        
               }; 
        
             } 
        
           }

So now we'll have to see why that is - maybe the alert was not persisted in task state, so it just assumed it was "new". I did however see an alert id suffering this problem, that was eventually in the task state (same uuid), so it may be that it starts in the task state but somehow it got missed when we queried for those alert docs.

Some other aspects of the deployment we've seen this:

metric threshold, 5-level grouping, alerting on no data/no groups on
rule has 1,000's of alert docs, many still not recovered, probably due to exceeding alert limits (generated > 1000 alerts during a single run) which may be causing problems - [ResponseOps][alerting] Investigate odd behaviors when max alerts limit reached #190258 .
rule has 10's of alert objects in task state, some active, some recovered
deployment has had some performance stress
flapping may be disabled in the rule's space

…0857) ## Summary While investigating some issues with the alertsClient, I realized that we weren't writing out any rule information for the logged messages. This made debugging quite difficult, as I wanted to see the rule, so had to search through the alerts indices for the specified alert to get it's rule id, rule type, etc. As an example, see #190376 This PR adds that kind of rule info to the logged messages in alertsClient, as well as the typical sort of tags we write out (rule id, rule type, module).

…stic#190857) ## Summary While investigating some issues with the alertsClient, I realized that we weren't writing out any rule information for the logged messages. This made debugging quite difficult, as I wanted to see the rule, so had to search through the alerts indices for the specified alert to get it's rule id, rule type, etc. As an example, see elastic#190376 This PR adds that kind of rule info to the logged messages in alertsClient, as well as the typical sort of tags we write out (rule id, rule type, module). (cherry picked from commit 07717a4)

…nt (#190857) (#191094) # Backport This will backport the following commits from `main` to `8.15`: - [[ResponseOps][alerting] add rule info to logging in alertsClient (#190857)](#190857)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>

pmuellr · 2024-09-10T02:28:49Z

Presumably the bulk operations that failed were create operations, since thats the op we use, as seen in getBulkMeta() above.

Here's a repo of running the bulk twice twice:

# run this twice, you'll see the error below on second call
POST /test-2024-09-09-a/_bulk
{ "create": { "_id": "1" } }
{ "x": 1 }

{
  "errors": true,
  "took": 0,
  "items": [
    {
      "create": {
        "_index": "test-2024-09-09-a",
        "_id": "1",
        "status": 409,
        "error": {
          "type": "version_conflict_engine_exception",
          "reason": "[1]: version conflict, document already exists (current version [1])",
          "index_uuid": "0CZsPRQXQ_GIKWnEVvKsbw",
          "shard": "0",
          "index": "test-2024-09-09-a"
        }
      }
    }
  ]
}

What I'm not sure of yet is whether the alert doc may have existed from a previous run, or it's this run that is for some reason trying to create it twice.

I think it might be best to assume either could happen :-)

So, probably fixing this so the conflict is re-tried as an update will keep things running. If it's not clear WHY this is happening by the time I'm done, I'll open another issue to investigate if something is broken ...

pmuellr · 2024-09-10T02:56:19Z

This seems likely PR to cause the problem: #160572

It's where we started using create op as well as index op for bulk, since that works with both datastreams and aliases/indexes (create when creating new alerts, index for updating).

Problem is, we didn't change the places we check for index.blah to be index?.blah | create.blah sort of thing, since the responses are placed in an op specific property.

Here's an example:

kibana/x-pack/plugins/alerting/server/alerts_client/lib/alert_conflict_resolver.ts

Lines 291 to 308 in 9b13685

    
           function getResponseStats(bulkResponse: BulkResponse): ResponseStatsResult { 
        
             const sanitizedResponse = sanitizeBulkErrorResponse(bulkResponse) as BulkResponse; 
        
             const stats: ResponseStatsResult = { success: 0, conflicts: 0, errors: 0, messages: [] }; 
        
             for (const item of sanitizedResponse.items) { 
        
               const op = item.create || item.index || item.update || item.delete; 
        
               if (op?.error) { 
        
                 if (op?.status === 409 && op === item.index) { 
        
                   stats.conflicts++; 
        
                 } else { 
        
                   stats.errors++; 
        
                   stats.messages.push(op?.error?.reason || 'no bulk reason provided'); 
        
                 } 
        
               } else { 
        
                 stats.success++; 
        
               } 
        
             } 
        
             return stats; 
        
           }

When op is item.create, it won't go through the "conflict" branch, it will go through the "error" branch. It should go through the "conflict" branch.

Gonna need a helper, I think ...

There don't seem to be any problems in the alerts_client.ts code, so I'm hoping it's all in alerts_client/lib/alert_conflict_resolver.ts

resolves: #190376 In PR #160572, we changed from using just the bulk op `index` to using `create` when new alerts are being created. Unfortunately, the code to handle the bulk responses didn't take into account that the bulk responses for `create`s need different handling than `index`s. Specifically, conflicts for `create` were being treated as errors. This PR changes the processing to consider additional ops besides just `index`.

pmuellr · 2024-10-14T17:34:43Z

In PR #195948, we are fixing some processing where alerts seem to be getting out of sync when we hit the alert limits. I'm thinking this is going to resolve some of the doc conflicts, so we should check overview after this is released to see if it actually did - hopefully it at least removes some classes of these conflicts ...

resolves: elastic#190376 In PR elastic#160572, we changed from using just the bulk op `index` to using `create` when new alerts are being created. Unfortunately, the code to handle the bulk responses didn't take into account that the bulk responses for `create`s need different handling than `index`s. Specifically, conflicts for `create` were being treated as errors. This PR changes the processing to consider additional ops besides just `index`.

pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Aug 12, 2024

pmuellr self-assigned this Aug 15, 2024

pmuellr mentioned this issue Aug 20, 2024

[ResponseOps][alerting] add rule info to logging in alertsClient #190857

Merged

1 task

pmuellr linked a pull request Sep 11, 2024 that will close this issue

[ResponseOps][Alerting] fix alert conflict resolution to support create #192534

Draft

10 tasks

pmuellr mentioned this issue Oct 14, 2024

Skip scheduling actions for the alerts without scheduledActions #195948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

pmuellr commented Aug 12, 2024

elasticmachine commented Aug 12, 2024

pmuellr commented Aug 13, 2024

pmuellr commented Aug 15, 2024

pmuellr commented Aug 15, 2024

pmuellr commented Sep 10, 2024

pmuellr commented Sep 10, 2024 •

edited

Loading

pmuellr commented Oct 14, 2024

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

Comments

pmuellr commented Aug 12, 2024

elasticmachine commented Aug 12, 2024

pmuellr commented Aug 13, 2024

pmuellr commented Aug 15, 2024

pmuellr commented Aug 15, 2024

pmuellr commented Sep 10, 2024

pmuellr commented Sep 10, 2024 • edited Loading

pmuellr commented Oct 14, 2024

pmuellr commented Sep 10, 2024 •

edited

Loading