Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate potential race conditions in KafkaSupervisor #5919

Closed
surekhasaharan opened this issue Jun 28, 2018 · 3 comments
Closed

Investigate potential race conditions in KafkaSupervisor #5919

surekhasaharan opened this issue Jun 28, 2018 · 3 comments

Comments

@surekhasaharan
Copy link

surekhasaharan commented Jun 28, 2018

In KafkaSupervisor , the taskGroups is accessed by multiple threads and modified in main exec thread. In checkTaskDuration , the keys from taskGroups are retrieved and passed to checkpointTaskGroup which executes in workerExec thread and returns a Future. So potential race condition could be while the future is executing, a groupId might get removed from taskGroups. Potential places causing race condition could be in KafkaSupervisor.checkpointTaskGroup() or KafkaSupervisor.verifyAndMergeCheckpoints().
Related issue #5900

@surekhasaharan surekhasaharan changed the title Investigate potential NPE due to race conditions in KafkaSupervisor Investigate potential race conditions in KafkaSupervisor Jun 28, 2018
@jihoonson
Copy link
Contributor

I would add some more details. Here is a snippet of KafkaSupervisor.checkpointTaskGroup().

  private ListenableFuture<Map<Integer, Long>> checkpointTaskGroup(final int groupId, final boolean finalize)
  {
    final TaskGroup taskGroup = taskGroups.get(groupId);
...
    return Futures.transform(
        Futures.successfulAsList(pauseFutures), new Function<List<Map<Integer, Long>>, Map<Integer, Long>>()
        {
          @Nullable
          @Override
          public Map<Integer, Long> apply(List<Map<Integer, Long>> input)
          {
            // 3) Build a map of the highest offset read by any task in the group for each partition
            final Map<Integer, Long> endOffsets = new HashMap<>();
            for (int i = 0; i < input.size(); i++) {
              Map<Integer, Long> result = input.get(i);

              if (result == null || result.isEmpty()) { // kill tasks that didn't return a value
                String taskId = pauseTaskIds.get(i);
                log.warn("Task [%s] failed to respond to [pause] in a timely manner, killing task", taskId);
                killTask(taskId);
                taskGroup.tasks.remove(taskId);

              } else { // otherwise build a map of the highest offsets seen
...

So, taskGroup is gotten at the first line, and is being used in the future executed by workerExec. The potential race condition is, taskGroup might not be in taskGroups when the future is executed. This matters because actually taskGroups has some state context as well. (taskGroup instances are moved from taskGroups to pendingCompletionTaskGroups if they finish reading). Not sure this is intentional or not.

@github-actions
Copy link

github-actions bot commented Jul 2, 2023

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 2, 2023
@github-actions
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants