Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

Closed
pmuellr opened this issue Oct 26, 2021 · 20 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Oct 26, 2021

Kibana version: 7.14.1 (reported by user)

User encountered a situation where the task document indicated that it failed with 3 attempts, but the rule doesn't seem to indicate any failure status.

The "three strikes you're out" processing centers around the concept of "max_attempts":

/* The maximum number of times a task will be attempted before being abandoned as failed */
max_attempts: schema.number({
defaultValue: 3,
min: 1,
}),

Some data from the docs:

task manager doc
      {
        "_index" : ".kibana_task_manager_7.14.1_001",
        "_type" : "_doc",
        "_id" : "task:5885e3d0-da5f-11eb-8c0e-9782f1ad0c93",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "[[elided - pmuellr]]",
            "retryAt" : "2021-09-12T21:30:05.288Z",
            "runAt" : "2021-09-12T20:44:02.033Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : "2021-09-12T21:10:05.287Z",
            "state" : """{"alertInstances":{},"previousStartedAt":"2021-09-12T20:43:02.033Z"}""",
            "params" : "[[elided - pmuellr]]",
            "ownerId" : "kibana:1ef3f039-cb28-486b-9046-01a41adb06e8",
            "scheduledAt" : "2021-07-01T11:27:40.684Z",
            "attempts" : 3,
            "status" : "failed"
          },
          "references" : [ ],
          "updated_at" : "2021-09-12T21:10:05.339Z",
          "coreMigrationVersion" : "7.14.1",
          "type" : "task"
        }
  }
rule doc
        {
            "id": "80cb3a73-b149-4780-a290-4acb9cf06ccd",
            "notifyWhen": "onActiveAlert",
            "params": { // elided
            },
            "consumer": "alerts",
            "alertTypeId": "elided",
            "schedule": {
                "interval": "1m"
            },
            "actions": [ // elided
            ],
            "tags": [ // elided
            ],
            "name": "elided",
            "enabled": true,
            "throttle": null,
            "apiKeyOwner": "elided",
            "createdBy": "elided",
            "updatedBy": "elided",
            "muteAll": false,
            "mutedInstanceIds": [],
            "updatedAt": "2021-10-22T10:16:03.999Z",
            "createdAt": "2021-07-01T11:27:40.384Z",
            "scheduledTaskId": "5885e3d0-da5f-11eb-8c0e-9782f1ad0c93",
            "executionStatus": {
                "lastExecutionDate": "2021-09-12T20:43:02.778Z", 
                "status": "ok"
            }
        },

I'm not super familiar with how we handle these task manager errors, but just looking at these docs makes me wonder if we are updating the rule when the task fails like this. I'd guess we aren't. But clearly we should be. Perhaps we can't because the task manager failure is occurring before or after the rule runs, so the rule doesn't have a chance to update it's own status? If so, it seems like gathering the "status" on a rule should be looking at the task manager doc to see if it's "given up" on this task document because of the number of attempts. In the end, if we've marked a task as "will not run again", we need to make sure the resource that created the task is informed of this, so they can reflect this back in the resource status.

Note also the user has updated the rule (hence the updatedAt: 2021-10-22T10:16 property), in hopes of getting the rule to "fire" - they thought the rule stopped recognizing it's alerting conditions, but as near as I can tell, it was just not running at all.

I added an estimate:needs-research label, as I think we'll need to:

  • repro
  • find out if there is an existing way to get these types of failures reflected back in the rule status; of if there is, why didn't that happen
  • if we don't reflect this task status back in the rule status, we'll need to invent a way to do that
@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Oct 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Oct 26, 2021

Here's a PR for 7.16 regarding actions, not rules, which is very tangentially related - #109655 . In that one we are "cleaning up" these failed action tasks.

You could imagine something similar for rules, but it would be kinda sad since we'd have to do it in a cleanup task on a schedule, vs getting immediate feedback - a task that looks for these failed rule tasks, and presumably sets the last execution status to an error condition with hopefully a relevant message. It could also disable / re-enable the rule, if it was marked enabled. But perhaps it should just be disabled, otherwise if the whatever error is being hit, keeps getting hit, it would essentially be ignore the maxAttempts configuration altogether.

@pmuellr
Copy link
Member Author

pmuellr commented Oct 26, 2021

Upon further research, the user has 12 rules in a similar state; the last execution date for them recorded in the rule (presumably last successful task run) was on 9/12 - 9/13, so it seems like there was some kind of systemic problem around that time ...

@gmmorris
Copy link
Contributor

gmmorris commented Nov 1, 2021

At this stage it isn't clear what led to this - whoever picks this up should use it as a research issue to identify how it might have happened and then bring it back to the team to propose work to remediate.

@mikecote
Copy link
Contributor

mikecote commented Nov 1, 2021

I'm thinking this may be timeout related and occur after timing our 3x in a row.

@ymao1 ymao1 self-assigned this Nov 1, 2021
@ymao1
Copy link
Contributor

ymao1 commented Nov 2, 2021

For non-recurring tasks, task manager makes 3 (or the configured maxAttempts) attempts and then marks the task as failed. However, rules are recurring tasks and, as this test specifically calls out, they are not marked as failed even if maxAttempts is reached:

test(`Doesn't fail recurring tasks when maxAttempts reached`, async () => {
const id = _.random(1, 20).toString();
const initialAttempts = 3;
const intervalSeconds = 10;
const { runner, store } = await readyToRunStageSetup({
instance: {
id,
attempts: initialAttempts,
schedule: { interval: `${intervalSeconds}s` },
startedAt: new Date(),
},
definitions: {
bar: {
title: 'Bar!',
maxAttempts: 3,
createTaskRunner: () => ({
run: async () => {
throw new Error();
},
}),
},
},
});
await runner.run();
expect(store.update).toHaveBeenCalledTimes(1);
const instance = store.update.mock.calls[0][0];
expect(instance.attempts).toEqual(3);
expect(instance.status).toEqual('idle');
expect(instance.runAt.getTime()).toEqual(
new Date(Date.now() + intervalSeconds * 1000).getTime()
);
});

For recurring tasks, task manager only marks as failed if running the task results in an unrecoverable error, otherwise it tries to reschedule the task

private rescheduleFailedRun = (
failureResult: FailedRunResult
): Result<SuccessfulRunResult, FailedTaskResult> => {
const { state, error } = failureResult;
if (this.shouldTryToScheduleRetry() && !isUnrecoverableError(error)) {
// if we're retrying, keep the number of attempts
const { schedule, attempts } = this.instance.task;
const reschedule = failureResult.runAt
? { runAt: failureResult.runAt }
: failureResult.schedule
? { schedule: failureResult.schedule }
: schedule
? { schedule }
: // when result.error is truthy, then we're retrying because it failed
{
runAt: this.getRetryDelay({
attempts,
error,
}),
};
if (reschedule.runAt || reschedule.schedule) {
return asOk({
state,
attempts,
...reschedule,
});
}
}
// scheduling a retry isn't possible,mark task as failed
return asErr({ status: TaskStatus.Failed });
};

Separately, for all tasks, the number of attempts is incremented in the markTaskAsRunning() function, which, if it encounters an error will still increment the number of attempts

}, await this.releaseClaimAndIncrementAttempts());

The only type of unrecoverable error that the alerting task runner throws is the Saved object not found error. Every other type of error that could be thrown as part of the alerting task (whether inside the alerting task runner or inside a rule executor) looks to be caught and returned to the Task Manager as an ok result.

I'm wondering if it's possible that it's a coincidence that the attempts = 3 when the task was marked as failed. If a cluster is experiencing instability, could task manager have had trouble marking the tasks as running, incrementing the attempt count each time until it finally grabbed the task, only to get an unrecoverable error? Or as @mikecote suggested, if rule types are consistently exceeding the timeout, they become up for grabs after the 5m timeout interval and the next task manager to pick up the task will increment the number of attempts. However, the failed status would still need to be triggered by an unrecoverable error

@mikecote
Copy link
Contributor

mikecote commented Nov 2, 2021

Or as @mikecote suggested, if rule types are consistently exceeding the timeout, they become up for grabs after the 5m timeout interval and the next task manager to pick up the task will increment the number of attempts. However, the failed status would still need to be triggered by an unrecoverable error

I haven't verified but I also noticed this script (https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/queries/task_claiming.ts#L403-L412) which marks tasks as failed. Is it possible this script applies to recurring tasks and causes them to get into a failed state?

@ymao1
Copy link
Contributor

ymao1 commented Nov 2, 2021

@mikecote That script uses this logic:

if (ctx._source.task.schedule != null || ctx._source.task.attempts < 
  params.taskMaxAttempts[ctx._source.task.taskType] || params.claimTasksById.contains(ctx._id)) {
    ${markAsClaimingScript}
} else {
    ctx._source.task.status = "failed";
}

So recurring tasks with a schedule interval should not be marked as failed

@ymao1
Copy link
Contributor

ymao1 commented Nov 2, 2021

Thanks for bringing up that script @mikecote!

Here's something interesting. When I run 7.14 locally and create a rule, the task associated with the rule has a schedule.interval field that reflects the rule interval.

In the issue summary, @pmuellr pasted a task manager task with attempts:3 and status:failed that has no schedule.interval field. I went back to to the source of this data and it looks like all the task manager tasks (which otherwise are consistent with alerting tasks) for the failed tasks are missing this schedule.interval field in the task.

So I think you are right in that the script is what is marking the task as failed, but how did the schedule interval disappear from the alerting task?!?

@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2021

It looks like for rules created in older versions of Kibana (7.7 - 7.10), the associated task manager task does not contain a schedule.interval field, so the mark task as failed script (added in 7.11) will mark them as failed if the max number of attempts is reached, even though they are recurring tasks. Starting in 7.11, schedule.interval is written out to the task manager doc for created rules (with this PR)

@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2021

So I created 2 rules on a 7.10 deployment, 1 that runs frequently (every minute) and 1 that runs less frequently (every day). Then I updated the deployment to 7.11. After the upgrade, the rule that runs frequently ran and the associated task was updated to include the schedule.interval field. The rule that runs infrequently was not scheduled to run, so its associated task remained the same, missing the schedule.interval field.

So it seems like the conditions that could lead to this state (rule tasks with attempts:3 and status:failed) are the following:

  • rules created prior to 7.11
  • rule did not run after the upgrade (or was unable to successfully run) - this means the task manager doc was never updated with the schedule.interval field
    • in the original user issue, they had 12 rules running at 1m interval that ended up in this state, so it's possible that they had connection issues for a short amount of time that led to the number of attempts incrementing rapidly...although without the original logs, it is hard to be certain
  • the markTaskAsFailed script kicked in on the 4th attempt after this and marked the task as failed bc it doesn't realize it's a recurring task

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

Just a comment that although post 7.11, we are marking the task as failed if it has no schedule and has reached its maxAttempts, prior to this, in 7.10, task manager would still not claim these types of tasks. So if a pre-7.10 alerting rule reached attempts:3, it would stop getting claimed by task manager:

expect(
asUpdateByQuery({
query: mustBeAllOf(
// Either a task with idle status and runAt <= now or
// status running or claiming with a retryAt <= now.
shouldBeOneOf(IdleTaskWithExpiredRunAt, RunningOrClaimingTaskWithExpiredRetryAt),
// Either task has an schedule or the attempts < the maximum configured
shouldBeOneOf<ExistsFilter | TermFilter | RangeFilter>(
TaskWithSchedule,
...Object.entries(definitions).map(([type, { maxAttempts }]) =>
taskWithLessThanMaxAttempts(type, maxAttempts || defaultMaxAttempts)
)
)
),

@mikecote
Copy link
Contributor

mikecote commented Nov 4, 2021

@ymao1 nice find! It makes sense now thinking of it because the task documents get "migrated" / "their interval set" after the rule runs, so if they didn't after an upgrade and encountered maxAttempts, I can see how they would then get marked as failed.

It sounds like a tricky fix to ensure these tasks don't get marked as failed?

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

@mikecote I think we could possibly update the updateByQuery to take the task type into consideration and never mark alerting tasks as failed?

Do you think this scenario is common enough to add a fix for it? I

@pmuellr
Copy link
Member Author

pmuellr commented Nov 4, 2021

It sounds like a tricky fix to ensure these tasks don't get marked as failed?

For some reason I was thinking it would be as easy as adding a migration to "fix" these failed rules, by clearing the fails and ensuring they have a scheduled interval. But I guess ... what interval? Maybe we need to mark the tasks with a new field "needs an interval!", and then let alerting fix it in a cleanup task or next time it runs the rule executor.

@mikecote
Copy link
Contributor

mikecote commented Nov 4, 2021

@mikecote I think we could possibly update the updateByQuery to take the task type into consideration and never mark alerting tasks as failed?

Do you think this scenario is common enough to add a fix for it? I

It sounds like a tricky fix to ensure these tasks don't get marked as failed?

For some reason I was thinking it would be as easy as adding a migration to "fix" these failed rules, by clearing the fails and ensuring they have a scheduled interval. But I guess ... what interval? Maybe we need to mark the tasks with a new field "needs an interval!", and then let alerting fix it in a cleanup task or next time it runs the rule executor.

Since this issue is a research issue, I think we could come up with a proposal on how we envision this getting fixed and create a follow-up issue to prioritize accordingly.

I think @ymao1 and @pmuellr you're onto something; maybe some per-task-type logic, maybe task types can define themselves as single occurrences (run once) or recurring occurrences (run on interval) and leverage that in the updateByQuery.

We may end up prioritizing the fix soon if ever our focus for 8.1 through 8.3 is to ensure alerting rules operate continually. This would be an edge case, so may warrant lower on the list but worth attention for sure.

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

I think it's important to note that this scenario shouldn't happen on 7.11+ rules. They should never be marked as failed and they will continue getting claimed because they have a schedule field.

I think if we want to go back and fix pre-7.11 rules, perhaps we could have a task manager migration that looks for alerting task types with no schedule and resets their number of attempts to 0 and resets their status to idle. This would allow task manager to start claiming these rules again and if they run successfully, the document would get updated correctly. The edge case in this scenario would be if task manager starts claiming them but they never run successfully...the task manager doc would never get updated to correctly reflect the schedule....

@pmuellr
Copy link
Member Author

pmuellr commented Nov 4, 2021

perhaps we could have a task manager migration that looks for alerting task types with no schedule and resets their number of attempts to 0 and resets their status to idle

Ya, this sounds better than my idea! They will have a YUUUGE task delay/drift, once they run! My only concern is that if it ends up failing again 3 times, the interval still wouldn't be set, and so it would be in a zombie state again. Maybe add a check when we run the rule, if the schedule isn't in the task, do a one-time update on it to set it?

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

Maybe add a check when we run the rule, if the schedule isn't in the task, do a one-time update on it to set it?

That makes sense sounds pretty straightforward! I'll create a follow up issue for these items.

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

Created #117593 to implement the suggestions from this research issue. Closing this issue.

@ymao1 ymao1 closed this as completed Nov 4, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants