Optimize TaskBatcher behavior in case of a datacenter failure. #41407

incubos · 2019-04-22T10:55:34Z

Closes #41406:

Replace global synchronized(tasksPerBatchingKey) with fine-grained ConcurrentMap facilities
Much faster check for task duplicates in the common case
TaskBatcher functional behavior (ordered tasks with identity) remains intact

Tested in production using v5.6.11 release.
CLA signed. gradle check passed.

* Replace the global synchronized lock with ConcurrentMap facilities * Faster duplicate task check in the common case

jasontedor

I left a code style review comment.

server/src/main/java/org/elasticsearch/cluster/service/TaskBatcher.java

elasticmachine · 2019-04-23T09:16:13Z

Pinging @elastic/es-distributed

ywelsch

Have you benchmarked the approach here and checked whether it works better under high contention?

ywelsch · 2019-07-01T13:24:42Z

server/src/main/java/org/elasticsearch/cluster/service/TaskBatcher.java

-                if (duplicateTask != null) {
-                    throw new IllegalStateException("task [" + duplicateTask.describeTasks(
-                        Collections.singletonList(existing)) + "] with source [" + duplicateTask.source + "] is already queued");
+            LinkedHashMap::new));


why LinkedHashMap? Do you care about insertion order?

I have been trying to understand that and I am not 100% sure that we should or should not care.
The insertion order was maintained in the original version with the help of LinkedHashSet though and I was not brave enough to relax the semantics.

incubos · 2019-07-01T14:43:05Z

I have not benchmarked the approach here, but noticed that all transport_server_worker.default threads were blocked on TaskBatcher.submitTasks() during a datacenter downtime (see #41406). I agree that the synchronization code becomes more complicated.
Having applied this patch we've managed to decrease cluster convergence time from 15 min to 3 min in case of a datacenter failure.
We can try upgrading to 6.4.0+ first as you suggested in #41406.

ywelsch · 2019-07-02T08:29:57Z

Ok, I would prefer for you to try out 6.4.0+ first then, as I'm a bit on the fence that the changes here are needed (or whether we can get away with something much simpler, e.g. just a faster check for task duplicates).

Optimize TaskBatcher behavior in case of a datacenter failure.

d3dfe9b

* Replace the global synchronized lock with ConcurrentMap facilities * Faster duplicate task check in the common case

jasontedor requested changes Apr 22, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/service/TaskBatcher.java Outdated Show resolved Hide resolved

Use explicit comparison to false instead of if-not

7b932be

colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Apr 23, 2019

incubos mentioned this pull request Apr 23, 2019

Improve TaskBatcher performance in case of a datacenter failure #41406

Closed

ywelsch reviewed Jul 1, 2019

View reviewed changes

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

ywelsch closed this May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize TaskBatcher behavior in case of a datacenter failure. #41407

Optimize TaskBatcher behavior in case of a datacenter failure. #41407

incubos commented Apr 22, 2019

jasontedor left a comment

elasticmachine commented Apr 23, 2019

ywelsch left a comment

ywelsch Jul 1, 2019

incubos Jul 1, 2019

incubos commented Jul 1, 2019

ywelsch commented Jul 2, 2019

Optimize TaskBatcher behavior in case of a datacenter failure. #41407

Optimize TaskBatcher behavior in case of a datacenter failure. #41407

Conversation

incubos commented Apr 22, 2019

jasontedor left a comment

Choose a reason for hiding this comment

elasticmachine commented Apr 23, 2019

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jul 1, 2019

Choose a reason for hiding this comment

incubos Jul 1, 2019

Choose a reason for hiding this comment

incubos commented Jul 1, 2019

ywelsch commented Jul 2, 2019