KAFKA-12495: IncrementalCooperativeAssignor#handleLostAssignments invokes logic for lost Assignments even when there are no Lost assignments#14000
Conversation
|
Hey Chris, I tagged you for this minor PR since you have context around these changes. |
|
Hey @vamossagar12 I don't think that this comment is incorrect or confusing enough to warrant stand-alone PR. If you have other substantive changes in this area, then we can re-examine this comment at that time. |
|
Thanks @gharris1727 . hmm the meaning of the variable and it's usage in the comment is slightly off in this case. would be more accurate imo. WDYT? Also, there is a bug where in due to empty lost assignments and a follow up rebalance post connector deletions can lead to these lines getting printed: Ideally nothing should happen if lost assignments are empty. I haven't had the chance to take a look at fixing it though. Will file a ticket later when I have some time. |
|
hey Greg, let me know what you think about my above comment specially updating the comment in question to
|
…d has revocations and lost assignments is empty
| addNewEmptyWorkers("worker3"); | ||
| performStandardRebalance(); | ||
| assertTrue(assignor.delay > 0); | ||
| assertEquals(40, assignor.delay); // First successive revoking rebalance. |
There was a problem hiding this comment.
Added an elaborate check over here.
| addNewEmptyWorkers("worker5"); | ||
| performStandardRebalance(); | ||
| assertTrue(assignor.delay > 40); | ||
| assertEquals(40, assignor.delay); // First successive revoking rebalance. |
There was a problem hiding this comment.
IMO this happens because we weren't unsetting the counter when delay goes equal to 0. This is the first successive revoking rebalance and the delay should be 40. I say this because this case is similar to worker2 -> worker3 joining above where the delay was 40. It should be similar here as well.
|
|
||
| // Sixth assignment with sixth worker joining after the expiry. | ||
| // Should revoke | ||
| time.sleep(assignor.delay); |
There was a problem hiding this comment.
I removed this to prove that the changes still work correctly and the delay gets extended further.
|
@gharris1727 I updated the PR to cater to the changes I was talking about here. Note that the original comment correction is not needed anymore. |
|
Test failures are unrelated. |
|
@gharris1727 , bumping this one again. It would be nice to fix this since the log line |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact). If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
This PR has been closed since it has not had any activity in 120 days. If you feel like this |
This PR addresses a few issues I noticed with #12561. To provide some context, #12561 provided a mechanism to allow successive revoking rebalances as long as those successive rebalances happened after an exponential backoff timer set after 1 successive revoking rebalance. However, there were some things I noticed while doing some testing which need fixing. These are :
revokedInPreviousflag would be set to true post that. But, on the follow up rebalance, when there are no lost assignments, the conditionlostAssignments.isEmpty() && !revokedInPreviouswould no longer be true becauserevokedInPreviouswould be true. This typically culminates in printing this lineNo worker seems to have departed the group during the rebalance....which doesn't make sense. Ideally we should exit this condition as soon as lost assignments are empty. Note that this change as such is harmless but this line's presence can seem very confusing. This can be checked in the logs as follows:Log lines after connector deletions:
Log lines in the subsequent rebalance:
This PR fixes it by removing the check on
revokedInPreviousflag from that condition.This unsetting of the flag
revokedInPreviouswithinhandleLostAssignmentshas been removed. Instead now it has been moved to the beginning of the rebalance start loop. That makes the changes a lot easier to reason about with no loss of functionality that the original PR aimed to solved.There was also an inconsistency in the flag and the counter being used for unsetting the revocations state. Over here only the flag is being unset but at a couple of other places, the counter is also reset. IMO both should be done as we are signifying that this round has no revoking rebalance. This PR addresses that concern as well.
Some general refactoring to move the revoking rebalance unsetting to a separate method.
Modified one of the tests. Note that only of the tests needed modifications which was failing due to 3 IMO. Other tests worked w/o any changes.