Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][Task Manager] "elastic-product" not present in some task manager requests #189306

Closed
pmuellr opened this issue Jul 26, 2024 · 7 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jul 26, 2024

Trolling the serverless logs, I came across some messages from - I think - the new discovery service mostly, and some coming through running tasks (I assume those may be from some unknown ES requests the task is making):

Deleting current node has failed. error: x-elastic-product not present or not recognized: Saved object [background-task-node/b89846bd-5560-45f6-9a11-1a46df30c279] not found

Task endpoint:user-artifact-packager "endpoint:user-artifact-packager:1.0.0" failed in attempt to run: x-elastic-product not present or not recognized: Not Found

Some telemetry code is also generating error messages with elastic-product not present, but there are windows of time (like 2 days, earlier this week), where the messages were not being generated.

For the "Deleting current node" message, I noticed it is generated here:

public async deleteCurrentNode() {
try {
await this.savedObjectsRepository.delete(BACKGROUND_TASK_NODE_SO_NAME, this.currentNode);
this.logger.info('Removed this node from the Kibana Discovery Service');
} catch (e) {
this.logger.error(`Deleting current node has failed. error: ${e.message}`);
}
}

which is only called from here:

public stop() {
if (this.kibanaDiscoveryService?.isStarted()) {
this.kibanaDiscoveryService.deleteCurrentNode().catch(() => {});
}
}
}

I'm thinking the problem is that we're running after Kibana has basically shutdown. Here's what we should do instead:

  public async stop() {
    if (this.kibanaDiscoveryService?.isStarted()) {
      try {
        await this.kibanaDiscoveryService.deleteCurrentNode();
      } catch (err) {
        this.logger.error(`Error deleting current node from background task manager: ${err}`);
      }
    }
  }

plugin::stop can be async, to make Kibana wait for this to finish before completing shutdown.

More importantly, what happens when Kibana crashes and doesn't run this code? Should we have it try to clean up old docs still hanging around?

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Jul 30, 2024

Noting there are more of these:

Failed to mark Task Fleet-Metrics-Task "Fleet-Metrics-Task:1.1.1" as running: x-elastic-product not present or not recognized: Not Found
[Task Runner] Task Fleet-Metrics-Task:1.1.1 failed to release claim after failure: Error: x-elastic-product not present or not recognized: Not Found

These seem like different places we're doing i/o in TM that this can happen. We're going to need some interesting solution to dealing with running after "shutdown", since it does seem like we're running tasks, etc at that time.

@pmuellr
Copy link
Member Author

pmuellr commented Aug 6, 2024

Another ... just noticed this, not sure how often it happens - maybe just if we exit Kibana during a claim cycle?

[root] SIGINT received - initiating shutdown
[root] Kibana is shutting down
[plugins-system.standard] Stopping all plugins.
[plugins.taskManager] Failed to poll for work: NoLivingConnectionsError: There are no living connections
[root] SIGINT received - initiating shutdown
[plugins.taskManager] Removed this node from the Kibana Discovery Service

There's no need for the log message Failed to poll for work: NoLivingConnectionsError: There are no living connections

ersin-erdal added a commit that referenced this issue Aug 23, 2024
towards: #189306

This PR fixes the `Deleting current node has failed.`errors mentioned in
the above issue.
@ersin-erdal
Copy link
Contributor

"Deleting current node" logs has been fixed with #191218

@ersin-erdal ersin-erdal self-assigned this Aug 28, 2024
@mikecote
Copy link
Contributor

mikecote commented Sep 3, 2024

Moving to backlog given we've fixed the newly introduced problem but eventually we'll want to investigate the other sources related to stopping tasks that are currently running.

@ersin-erdal ersin-erdal removed their assignment Sep 10, 2024
@pmuellr
Copy link
Member Author

pmuellr commented Oct 11, 2024

Just a note and linkage to PR Stop polling on Kibana shutdown - I suspect most of the cases of seeing the message about the missing header will go away when this PR is merged. I think the message was caused by some Kibana plugins removing their http context bits that add the header, during their shutdown.

With the PR we should see task manager itself stop making ES calls, but I'm guessing we will see some stragglers:

  • updates to task docs after a run completes after a shutdown
  • tasks still running after shutdown that make ES calls, which we can't really control

The first - and other cases of task manager making ES calls after shutdown - I'm guessing we can fix, once we see them. The second is harder, but we could probably NOT log errors like this, after shutdown, if we know they are ES errors. Or maybe log in debug. Or just live with it - could be interesting diagnostic info.

So ... suggest we do a comparison of before/after the PR merges, figure out if we want to do some more work and leave this issue open to track that - or we figure out the volume is low enough that it's good enough for now.

@ersin-erdal
Copy link
Contributor

Closing in favor of: #195817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

4 participants