Improve TaskBatcher performance in case of a datacenter failure #41406
Labels
:Distributed Indexing/Distributed
A catch all label for anything in the Distributed Area. Please avoid if you can.
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
We have a production cluster containing several hundred shards distributed between several datacenters. In case of a datacenter failure it takes 10-15 min for the cluster to converge and resume normal functioning, which is not quite acceptable.
Stack trace analysis shows that all
transport_server_worker.default
threads are blocked onTaskBatcher.submitTasks()
during that downtime interval:At the same time
TaskBatcher
queue contains tens of thousands tasks.There are two major causes for the bottleneck:
TaskBatcher
effectively serializes attempts to submit state update tasks due to use ofserialized(tasksPerBatchingKey)
TaskBatcher
compares each added task to all the existing tasks inO(n)
to ensure no duplicate existsThe problem can be easily reproduced using
v5.6.11
release.TaskBatcher
remains unchanged inmaster
.The text was updated successfully, but these errors were encountered: