-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate the downtime between tasks completing and the next polling interval #65552
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
This needs more context. My initial guess is that rather than waiting for the task manager interval to complete, we would like to read from the queue earlier if worker slots become available. If so,
|
IIRC the intention was to claim more tasks that we have workers for and work through them if a ew workers finish before the next interval - so your last one. |
Ah cool, that seems pretty straight-forward. Any thoughts on how many extra we'd ask for? 50%? |
I think we were talking about doubling it, but we can make it configurable 🤷 I'm working on this and think there might be some overlapping work here: #71441 |
Guessing it will be painful to work on this and #71441 at the same time, two different people, does feel like there's going to be some overlap. |
One simple thing we could do would be that if we know we're already "at capacity" for type X, we could add a filter on the query to not return ANY X's. |
Yeah, I had the same thoughts: #71441 (comment) |
I'm curious how this is going to work if tasks get claimed, but the claim "times out" because the current tasks have prevented claimed tasks from being run. We must have some kind of a claiming timeout somewhere, to handle the case of tasks getting claimed by a Kibana instance and then that instance goes down. Presumably, we'll be hitting those cases a lot more once we start asking for more tasks than we can run. And I guess we'd need to check some of these, before we run them, to make sure some other Kibana instance hasn't claimed them in the meantime. |
When it tries to mark as running it'll either fail because it's been claimed by someone else or update the expiration, so it should work fine. |
resolves elastic#65552 Currently TaskManager attempts to claim exactly the number of tasks that it has capacity for. As an optimization, we're going to change to have it request more tasks than it has capacity for. This should improve latency, as when tasks complete, there may be some of these excess tasks that can be started, as they are already claimed (they still need to be marked running). All the plumbing already handles getting more tasks than we asked for, we were just never asking for more than we needed previously.
I've done a little more thinking on the "fetch more tasks to run than available capacity" PR #75429 - which doesn't actually work anyway, right now. My main concern at this point, is that I believe this will actually cause more 409 conflicts when claiming/mark running, when there are > 1 Kibana instances running. Presumably each Kibana will be getting even more tasks that will conflict with other Kibanas, than if they only got their actual capacity. And adding more Kibanas may make things worse. Not clear how bad this would be on the system, but you could certainly imagine degenerate cases where some Kibanas consistently get starved by other Kibanas. I think we'll need a decent set of benchmarks in place before we could make a change like this and assure it doesn't make things worse :-) |
From @kobelb
The text was updated successfully, but these errors were encountered: