Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17109: Move lock backoff retry to streams TaskManager #17209

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

aliehsaeedii
Copy link
Contributor

This PR aims at resolving the issue made by #17116

@mumrah
Copy link
Contributor

mumrah commented Sep 16, 2024

@aliehsaeedii please update the PR title to have a description of the patch. Thanks!

@aliehsaeedii aliehsaeedii changed the title KAFKA-17109 KAFKA-17109: Move lock backoff retry to streams TaskManager Sep 16, 2024
Copy link
Contributor

@cadonna cadonna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @aliehsaeedii !

Here my feedback.

I am missing unit tests.

}

public boolean canAttempt(final long nowMs) {
return nowMs - lastAttemptMs >= EXPONENTIAL_BACKOFF.backoff(attempts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return nowMs - lastAttemptMs >= EXPONENTIAL_BACKOFF.backoff(attempts);
return nowMs - lastAttemptMs >= EXPONENTIAL_BACKOFF.backoff(attempts);

Comment on lines 1016 to 1017
stateUpdater.add(task);
taskIdToBackoffRecord.remove(task.id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:
I would swap those two lines. Once the task is initialized, the backoff can be removed.

taskIdToBackoffRecord.remove(task.id());
} else {
log.trace("Task {} is still not allowed to retry acquiring the state directory lock", task.id());
handleUnsuccessfulLockAcquiring(task, nowMs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?
Every time initialization is attempted before the back-off, the time of the last attempt is updated to the current time. If we assume an attempt every poll interval and the poll interval is less than the back-off time, the task will never be initialized.
Assume the last unsuccessful attempt occurred at time 200 and now the current call to canTryLock() is 100ms later at time 300. Furthermore, assume the current back-off is 250. That is, canTryLock() should return false because 300 - 200 >= 250 is not true. The last attempt is updated to 300 and the backoff is exponentially updated with the increased number of attempt (let's say 500). If you try again in 100ms at 400 canTryLock() will again return false, because 400 - 300 >= 500 is still not true and it will also not be true next time. You should only update the back-off record if you actually have attempted to initialize the task and it was unsuccessful and not when you skipped the attempt due to the back-off.

public static class BackoffRecord {
private long attempts;
private long lastAttemptMs;
private static final ExponentialBackoff EXPONENTIAL_BACKOFF = new ExponentialBackoff(1, 2, 10000, 0.5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the exponential back-off be specified in terms of poll time? Something like

new ExponentialBackoff(pollTime, 2, 10000, 0.5);

If it is to much trouble getting that config into the task manager, just choose something larger than 1ms. 1 ms sounds really small. The sequence of the back-offs would be 1ms, 2ms, 4ms, 8ms, 16ms, 32ms, 64, 128. At the same time, with default configs, the task initialization is attempted every 100ms. So, it seems there will not be much improvement to the current situation because the first 7 poll iterations you attempt to initialize the task.

@aliehsaeedii
Copy link
Contributor Author

aliehsaeedii commented Sep 16, 2024

Thanks for the PR, @aliehsaeedii !

Here my feedback.

I am missing unit tests.

Thanks @cadonna. Utest is added + review is addressed

@@ -2116,4 +2132,37 @@ boolean needsInitializationOrRestoration() {
void addTask(final Task task) {
tasks.addTask(task);
}

private boolean canTryLock(final TaskId taskId, final long nowMs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to add this comment before in my review. Could you please rename this method to canTryInitializeTask()? I think that makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cadonna makes sense!

Copy link
Contributor

@cadonna cadonna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Here my comments.

@@ -1006,14 +1014,22 @@ private void addTasksToStateUpdater() {
}

private void addTaskToStateUpdater(final Task task) {
final long nowMs = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you need to use

Suggested change
final long nowMs = System.currentTimeMillis();
final long nowMs = time.milliseconds();

We inject the time object at creation, so that we can control time for example in tests.

Comment on lines +156 to +159
/* For testing */
void setTaskIdToBackoffRecord(final Map<TaskId, BackoffRecord> taskIdToBackoffRecord) {
this.taskIdToBackoffRecord = taskIdToBackoffRecord;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think you need this method if you can control time as I describe on line 1017.

.inState(State.RUNNING).build();
final TasksRegistry tasks = mock(TasksRegistry.class);
when(tasks.drainPendingTasksToInit()).thenReturn(mkSet(task00, task01));
final TaskManager.BackoffRecord backoffRecord = mock(TaskManager.BackoffRecord.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need this mock. You can advance time with the time object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants