-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research adding concurrency support to Task Manager #71441
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Just to follow up: having some kind of pre-check hook would also work, and give plugins a way to retry at a later time. This way you all don't have to implement the logic to limit concurrency, but each Task could offer a way to limit it via a hook. If a hook is defined, then the retry behavior can be part of the function/hooks returned body:
|
The hook idea is interesting, being more dynamic than say a fixed I think your example ^^^ (based on a side chat) is a hook run AFTER the task has been read from task manager. My comment in that conversation is that it would be optimal to avoid reading such task types at all, since we just read a task that we're not going to run (wasted task slot) and we have to write it back out. ETOOMUCHIO heh. But ... we don't know just yet. |
I'd assume so, as sort of a pre-check before the task actually begins execution. Apologies on that, I'm not intimately aware about how the internals work, or even how you serialize/store the tasks themselves and all the sandboxing/processing of them to make sure they're "safe". Anyways, I don't mean to expand the surface area of what we're trying to do here, just thought I'd offer an option that might be cheaper/better/faster to implement. If concurrency is more straightforward, then by all means lets go with that. |
I still like the hook idea. I think it would help to have an architecture where each registered task is responsible for determining whether it is available at the time. Could there be a way for task manager to call a hook from each registered task type BEFORE querying the task manager index for pending tasks? I understand there is an update-by-query that claims tasks. The query could be composed of multiple OR filters include only the "available" task types. Whether the task type is available should be controlled by the plugin that registers the task. That idea helps solve the "Ability to set concurrency to 0" problem. If a hook says the task type is not available, then its concurrency is 0. To solve the ability to set concurrency to 1 or a different number, the information coming out of the hook will have to be able to set the search |
I've began looking into this and have a few ideas. It's impossible for us to limit the number of tasks of a specific type returned by the update-by-query, but we can tell the query to only return specific types, so we could omit a type that we do not have capacity for. My current thought process is that if TaskManager knows how many tasks of a certain type are running and what the max concurrency of that type is, we can do a few things:
⋆ This would require checking whether the type can be run before each polling stage, which would be easier and faster with an internal semaphore per type than handing off to an async handler in the TaskTypeDefinition (which also opens us up to potential issues where one TaskType holds up the rest). ⋆⋆ If this is a local semaphore we can do smart things like query for more items than our max_workers if a certain type is at capacity, which will prevent a build up of a certain type from clogging up the Task Manager. If we need to keep asking a TaskType then this is harder and, again, more open to potential blocking. I like the idea of task types being responsible for telling TaskManager whether they can handle any more, but I'm not sure it's worth it the added complexity. If we allowed the type to manage this, then it would still have to give Task Manager a number of "available slots", rather than true/false (which would always mean 1 or 0), and I think we'd have to make it synchronous. Additionally, my feeling is that If we maintain this semaphore in TaskManager we can provide better observability over TaskManager's queue and gain a better understanding of why it's doing what it's doing. |
Steps 1-3 ^^^ LGTM. Some clarity note on 3 - when task with type X finishes, I think we'd want to look for the next available task to run, independent of type, as long as there is capacity at that type. I don't think, eg, that when task with type X finishes we'd only look for the next task of type X. 🤔 ... |
Oh yeah, I just meant it would release that type in the semaphore, not that it would take the next item of the same type - just the next one in the queue. 👍 |
A couple of concerns that this approach could introduce: Concern: If a Kibana has no reporting tasks in its queue and its query returns 30 reporting tasks, the task queue could be clogged up by Reporting tasks and a backlog of Reports will accumulate. Remediation: There are a couple of things we can do that will balance this behaviour out, I believe:
|
Hi @gmmorris I want to go back to:
I did some research into Elasticsearch queries to see if there's a way to filter or specify a range for the count of documents in a leaf of a compound query. That's another way of saying what I attempted at above with my comment with "set the search If Task Manager could access an ES feature in the update_by_query, I really think that's the ideal way to go. Second, I have a big question about:
Correct me if I'm wrong, but since the results come from update_by_query, they've already been modified in ES. The update has a script to modify the task status to |
Ya, I had been thinking of something similar, but more elaborate. Somehow arrange to run multiple task managers, each managing their own set of workers, whose concurrency requirements are similar / the same. We probably don't need to get so elaborate as to actually have multiple task managers, but somehow partitioning the tasks in some way. One potential downside to this is increased ES i/o, which we are already sensitive to. |
I hope we can explore this as the solution. I think there are only 2 partitions needed: unlimited concurrency, and limited or no concurrency. Each partition has its own query. In unlimited concurrency, there is no size limits set on the search - the task types take whatever gets fired out. In the limited concurrency, we add a specific size param on the search that reflects the capability of the task types in the limited partition. It seems like there is an upside to this idea, which is that the different partitions couldn't clog each other.
That is true and could merit this idea as being temporary one. To remove that downside, we would need some enhancement in ES. It could be really great if ES |
Popping into this thread since @tsullivan and I were chatting in slack about it. An "msearch-enabled update-by-query" would effectively be the same internally as multiple UBQ requests. E.g. it'd be doing the same work, just wrapped up in a single client call instead of multiple client calls. So that is likely a lower-hanging fruit if you wanted to investigate that route. I'm not sure if there will be motivation to enhance UBQ in ES. Would have to check with the search team.. generally UBQ and DBQ are not well-liked by the ES team since they have some pointy edge-cases (doesn't handle failed shards well for example) A more robust method would be getting off UBQ entirely and doing the process yourself: search request (or More work for the client obviously, but UBQ is a relatively simple tool so I wouldn't expect much time/effort is put into making it more sophisticated. I'm not the search team though, so they may have different opinions... wouldn't hurt to ping them and see. :) |
Thanks for doing that... I too, haven't found a solution to this.
Sorry, I don't follow this... What do you mean by access an ES feature? 🤔
Correct, but I can't figure out a way to do that in the UpdateByQuery... any idea how this could be achieved? |
Yup, I have investigated this too, but it adds a lot of complexity to the timing in Task Manager and we already have a hard time in terms of observability into this and I'm seriously worried about how we can support this without first addressing the lack of observability into TM. ( I suspect @kobelb might have thoughts on this too) Balancing between these different cycles and ensuring that we understand what's happening in there is going to be difficult once you have multiple different concurrency configurations. One of the problems I encountered when experimenting with this was that the tasks without concurrency limitations would "starve" the ones with concurrency limitations once there were many tasks in TM. |
Thanks @polyfractal I really appreciate you weighing in. The key difference for us has been the difference in time between the |
To say more plainly, this point of view is about requesting a new feature that can be added to Elasticsearch for benefiting Task Manager.
I get why using UBQ alleviates timing problems of version conflict across multiple instances trying to claim tasks with high throughput. The query and update in a single client call seems more "transactional" because the timing works out better. Based on what @polyfractal said about UBQ being semantic sugar though, I'm trying to get away from thinking of UBQ as a transaction. Maybe (I truly don't know) with UBQ still could 2 instances of Kibana attempting to update the same task documents (aka "claim the task") and get a version conflict error. The version conflict problem should be recoverable: if a Kibana can't claim a task because another instance claimed it, then it should ignore it and let the other instance handle it. It is a problem when a subset of bulk documents fail to update and cause the overall set to fail. Maybe that's where work needs to go in: with that fixed, then the claiming cycle doesn't need to be "transaction-like." |
Ah, yeah, haha, I'm trying to work with what we have, but obviously we'd love it if ES had an idea on their end that could help us.
Yeah, it still happens, but not nearly as often. Which is why switched to it.
Sorry, I wasn't clear - it is recoverable, it was one of the first things we addressed in TM, but it still hurt performance as we ended up with lots of wasted cycles that made it hard to scale horizontally. I'm happy to jump on a call and run you through everything we've done in TM over the past year, I think you'll find we've addressed all of the low hanging fruit and we're now trying to figure out to fit in new functionality within the limitations that we're forced to work with. |
I've dragged this back into |
Yeah correct, UBQ still technically can run into the same version conflict issues as regular query-then-update... it just tends to happen less because the round-trip time is smaller/local. But that said, if there is sufficient contention in general, this reduced latency is only a bandaid and will run into scaling issues eventually too. E.g. it just kicks the can down the road if there are fundamental contention issues, which need to be solved in a different manner (multiple work-stealing queues or something).
++ Yeah just wanted to mention that I agree, if there's an opportunity here to build something that really solves the issue we shouldn't ignore that. I was just popping in to make a note about UBQ in particular. So don't view my comments as discouraging exploring options, including specialized functionality (or modifying existing functionality) to make things work better. /cc @jpountz in case we need to grab someone from Search or Core to help look into this :) |
Thanks Zachary 🙏 |
I've spiked an approach where we spawn a separate poller for the limited concurrency task type, and it seems to work pretty well. PR is here: #74883 |
I think I like this approach :-) Just took a quick peek through the code though ... It seems like a very clean way to separate these sorts of things out - from a higher level - rather than make the innards more complicated. My main worry is the additional i/o on ES that this will cause. I think in the case of reporting, we could certainly change the polling interval to be longer, which would I think be the easiest way to cut down on the i/o. |
So do I :) (thanks)
Yes, agreed, it's my concern as well. |
@tsullivan @joelgriffith How do y'all feel about Reporting polling for work at a lower rate than the normal task polling? |
Currently our polling interval is I think that rate is slower than the normal task polling, so I think we're ok all around on this. |
👍
For Reporting, there is generally a lot of "idle time". Is a search more performant than an update_by_query? If so, perhaps we could do a simple search to see if there's anything to run prior to running the update_by_query? |
Pinging @elastic/kibana-app-services (Team:AppServices) |
That's an interesting question.... Worth investigating further though. |
This has now been delivered 🎉 |
Research approaches to satisfy #54916 in order to come up with level of effort and timeline.
The text was updated successfully, but these errors were encountered: