Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TaskBatcher behavior in case of a datacenter failure. #41407

Closed
wants to merge 2 commits into from

Conversation

incubos
Copy link

@incubos incubos commented Apr 22, 2019

Closes #41406:

  • Replace global synchronized(tasksPerBatchingKey) with fine-grained ConcurrentMap facilities
  • Much faster check for task duplicates in the common case
  • TaskBatcher functional behavior (ordered tasks with identity) remains intact

Tested in production using v5.6.11 release.
CLA signed. gradle check passed.

* Replace the global synchronized lock with ConcurrentMap facilities
* Faster duplicate task check in the common case
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a code style review comment.

@colings86 colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Apr 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked the approach here and checked whether it works better under high contention?

if (duplicateTask != null) {
throw new IllegalStateException("task [" + duplicateTask.describeTasks(
Collections.singletonList(existing)) + "] with source [" + duplicateTask.source + "] is already queued");
LinkedHashMap::new));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why LinkedHashMap? Do you care about insertion order?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been trying to understand that and I am not 100% sure that we should or should not care.
The insertion order was maintained in the original version with the help of LinkedHashSet though and I was not brave enough to relax the semantics.

@incubos
Copy link
Author

incubos commented Jul 1, 2019

I have not benchmarked the approach here, but noticed that all transport_server_worker.default threads were blocked on TaskBatcher.submitTasks() during a datacenter downtime (see #41406). I agree that the synchronization code becomes more complicated.
Having applied this patch we've managed to decrease cluster convergence time from 15 min to 3 min in case of a datacenter failure.
We can try upgrading to 6.4.0+ first as you suggested in #41406.

@ywelsch
Copy link
Contributor

ywelsch commented Jul 2, 2019

Ok, I would prefer for you to try out 6.4.0+ first then, as I'm a bit on the fence that the changes here are needed (or whether we can get away with something much simpler, e.g. just a faster check for task duplicates).

@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
@ywelsch ywelsch closed this May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve TaskBatcher performance in case of a datacenter failure
6 participants