-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Concurrent Task Insertion in pendingCompletionTaskGroups #16834
Fix Concurrent Task Insertion in pendingCompletionTaskGroups #16834
Conversation
Fix Concurrent Task Insertion in pendingCompletionTaskGroups
...a/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisorStateTest.java
Outdated
Show resolved
Hide resolved
I think the logic may be incorrect. We should expect the following test (that I have written) to pass with the current logic and it does.
I think it is because you're handling the task group but not the start offsets as part of your logic. |
It might be helpful if you can also add comments explaining what this method does and why concurrent insertion was previously failing and why changing the order of steps would help. |
Hey @AmatyaAvadhanula , thanks for identifying this case which I missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, and the changes @hardikbajaj!
I was just wondering if we could somehow test this without adding a test-only method. LGTM otherwise.
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you, @hardikbajaj
…16834) Fix streaming task failures that may arise due to concurrent task insertion in pendingCompletionTaskGroups
…16834) Fix streaming task failures that may arise due to concurrent task insertion in pendingCompletionTaskGroups
…16834) (#219) Fix streaming task failures that may arise due to concurrent task insertion in pendingCompletionTaskGroups apache@1cf3f4b
Fix Concurrent Task Insertion in pendingCompletionTaskGroups
Fixes #16727
Description
Fixed thread synchronisation issue in addDiscoveredTaskToPendingCompletionTaskGroups so the pendingCompletionTaskGroup is properly locked while initialising.
As multiple threads with different task ids are hitting this function, they read a stale copy of concurrent hash map and create new pending completion task groups for tasks with same group id, which ideally should be added in a single TaskGroup.
To properly syncrhonize pendingCompletionTaskGroup across multiple threads, we need to do all updates inside the
compute
block as it locks theMap[key]
and performs a write based on a locked read. This synchronises the value across all running threadsFixed the bug ...
#16727
addDiscoveredTaskToPendingCompletionTaskGroups is not properly Thread Synchronized and updates to it are made on the basis of a stale copy of reads. When we submit supervisor config, there are cases where instead of adding all tasks in same TaskGroup, it can create Multiple copies of TaskGroups. For ex, if A1,A2 are consuming from same partition and are in same group, then
This is because, while initialising a new TaskGroup, the threads rely on a stale copy of read and multiple threads executing simultaneously, adding tasks to pendingCompletionTaskGroup can create new task groups instead of being added to existing ones. This behaviour defeats the purpose of Task replication as these single task taskgroup if gets failed for some reason, than Overlord sees it as entire task group is failed, and kill actively reading tasks too, to resume ingestion from last published segment.
Renamed the class ...
Added a forbidden-apis entry ...
Release note
Key changed/added classes in this PR
MyFoo
OurBar
TheirBaz
This PR has: