Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate the downtime between tasks completing and the next polling interval #65552

Closed
mikecote opened this issue May 6, 2020 · 13 comments
Closed
Assignees
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented May 6, 2020

From @kobelb

If we always kept a certain number of tasks claimed and ready for idle workers, we'd drastically reduce the idle time.

@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels May 6, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jul 21, 2020

This needs more context.

My initial guess is that rather than waiting for the task manager interval to complete, we would like to read from the queue earlier if worker slots become available.

If so,

  • was there some threshold of empty worker slots that we'd re-read the queue? Like, over 50%? Or just re-read the queue whenever a new worker slot becomes available?
  • does the existing interval basically become an idle time-out? Eg, once we do a queue read, start the interval timeout, so we'd only do a queue read in that interval if no other workers became available in that time slice. Presumably if a worker became available (or % of workers became available), we'd read immediately, and cancel the outstanding idle timeout, but then start a new idle timeout once that read was done.
  • were we thinking that maybe we'd claim more tasks that we have workers for, and then run those as existing workers finished, so we could keep the worker slots busy without having to do another read

@gmmorris
Copy link
Contributor

gmmorris commented Jul 21, 2020

IIRC the intention was to claim more tasks that we have workers for and work through them if a ew workers finish before the next interval - so your last one.

@pmuellr
Copy link
Member

pmuellr commented Jul 21, 2020

Ah cool, that seems pretty straight-forward. Any thoughts on how many extra we'd ask for? 50%?

@gmmorris
Copy link
Contributor

gmmorris commented Jul 21, 2020

I think we were talking about doubling it, but we can make it configurable 🤷

I'm working on this and think there might be some overlapping work here: #71441
As we can't limit how many items are returned of a certain type in the query, we might need some smart queue that takes on more than needed and pulls in work by type... 🤔 I'm still playing around with ideas

@pmuellr
Copy link
Member

pmuellr commented Jul 21, 2020

Guessing it will be painful to work on this and #71441 at the same time, two different people, does feel like there's going to be some overlap.

@pmuellr
Copy link
Member

pmuellr commented Jul 21, 2020

As we can't limit how many items are returned of a certain type in the query

One simple thing we could do would be that if we know we're already "at capacity" for type X, we could add a filter on the query to not return ANY X's.

@gmmorris
Copy link
Contributor

gmmorris commented Jul 21, 2020

Yeah, I had the same thoughts: #71441 (comment)

@pmuellr
Copy link
Member

pmuellr commented Jul 21, 2020

I'm curious how this is going to work if tasks get claimed, but the claim "times out" because the current tasks have prevented claimed tasks from being run. We must have some kind of a claiming timeout somewhere, to handle the case of tasks getting claimed by a Kibana instance and then that instance goes down. Presumably, we'll be hitting those cases a lot more once we start asking for more tasks than we can run. And I guess we'd need to check some of these, before we run them, to make sure some other Kibana instance hasn't claimed them in the meantime.

@gmmorris
Copy link
Contributor

When it tries to mark as running it'll either fail because it's been claimed by someone else or update the expiration, so it should work fine.

@gmmorris gmmorris self-assigned this Jul 22, 2020
@pmuellr pmuellr self-assigned this Jul 22, 2020
@gmmorris gmmorris removed their assignment Jul 22, 2020
@mikecote
Copy link
Contributor Author

In regards to #71441, the issue is purely research and we'll plan on supporting limited concurrency with #54916 (~7.11) based on the research / findings.

pmuellr added a commit to pmuellr/kibana that referenced this issue Aug 20, 2020
resolves elastic#65552

Currently TaskManager attempts to claim exactly the number of tasks that
it has capacity for.

As an optimization, we're going to change to have it request more tasks than
it has capacity for.  This should improve latency, as when tasks complete,
there may be some of these excess tasks that can be started, as they are
already claimed (they still need to be marked running).

All the plumbing already handles getting more tasks than we asked for, we
were just never asking for more than we needed previously.
@pmuellr
Copy link
Member

pmuellr commented Oct 21, 2020

I've done a little more thinking on the "fetch more tasks to run than available capacity" PR #75429 - which doesn't actually work anyway, right now.

My main concern at this point, is that I believe this will actually cause more 409 conflicts when claiming/mark running, when there are > 1 Kibana instances running. Presumably each Kibana will be getting even more tasks that will conflict with other Kibanas, than if they only got their actual capacity. And adding more Kibanas may make things worse.

Not clear how bad this would be on the system, but you could certainly imagine degenerate cases where some Kibanas consistently get starved by other Kibanas.

I think we'll need a decent set of benchmarks in place before we could make a change like this and assure it doesn't make things worse :-)

@mikecote
Copy link
Contributor Author

++ Also would add complexity when managing timeouts.

After chatting with @pmuellr and @kobelb, I think we're in a good state with task manager performance without this issue and that we can park the issue until we see a need for it. Closing for now and we can re-open when we see a need.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants