Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: actor wait barrier manager inject barrier #17613

Merged
merged 12 commits into from
Jul 19, 2024

Conversation

wenym1
Copy link
Contributor

@wenym1 wenym1 commented Jul 8, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Currently, we erase the mutation of barrier in remote output, and in remote input, we restore the barrier mutation by getting it from the LocalBarrierWorker. This can ensure that all actors on a CN will not start processing the barrier before the LocalBarrierWorker on the CN receives the barrier injected from meta node.

In this PR, we extend this mechanism to local input/output, to ensure that, the mutation of each actor is always obtained from the LocalBarrierWorker, or further, from meta node. This feature is not leveraged yet, but provide us with better flexibility that different actors can receive different mutations for a same barrier.

Since now both remote and local exchange dispatcher will erase the mutation from barrier, we can make the mutation type of barrier to be generic. For barrier in the exchange dispatcher, the mutation is (), and has type alias type DispatcherBarrier = BarrierInner<()>, and for the barrier in actor, the mutation is the original one. In this way, we can statically ensure the mutation is erased in dispatcher, and that we always fetch the mutation from LocalBarrierWorker before emitting it to the real processing logic of actors. The Message enum is also introduced a generic mutation type, to indicate whether the message is flowing within actors, or in the dispatcher.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The idea is straight-forward.

@hzxa21 hzxa21 requested a review from BugenZhao July 11, 2024 10:37
@BugenZhao
Copy link
Member

Can you please elaborate more in the PR body?

Comment on lines +133 to +138
yield process_msg(msg, |barrier| {
mutation_subscriber.get_or_insert_with(|| {
local_barrier_manager.subscribe_barrier_mutation(self_actor_id, barrier)
})
})
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be no need for local exchanges to acquire mutation from the channel because we don't prune it on the sender side. So this seems to be a regression. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be too much regression when we have #17613. When local input receives a barrier, in most cases the mutation already be sent by the LocalBarrierWorker to the mutation subscribe channel.

@BugenZhao BugenZhao requested a review from yezizp2012 July 12, 2024 06:34
@wenym1
Copy link
Contributor Author

wenym1 commented Jul 16, 2024

Can you please elaborate more in the PR body?

Updated.

In brief, the motivation of this PR is to provide the flexibility that different actors can have different mutations. This will be helpful in partial checkpoint implementation, or for #15490.

For #15490, a rough idea in my mind is, when we do scale, we don't need to inject the mutation from source via the newly injected barrier. We can first locate the most upstream fragment that needs to receive the scale mutation, and then figure out the smallest epoch that has not started being processed by any actor in this fragment, and then we inject the mutation to the actors of this fragment by sending a different mutation from LocalBarrierWorker

@wenym1 wenym1 requested a review from BugenZhao July 16, 2024 10:16
@BugenZhao
Copy link
Member

and then figure out the smallest epoch that has not started being processed by any actor in this fragment,

Sounds interesting but delicate under asynchronous checkpointing. 🤣

@wenym1
Copy link
Contributor Author

wenym1 commented Jul 17, 2024

Sounds interesting but delicate under asynchronous checkpointing. 🤣

Do you mean our current global explicit try_wait_epoch during scale? This can be resolved by explicitly try_wait_epoch after the vnode bitmap is changed. On update vnode bitmap, executors can first yield the barrier, and then update the vnode bitmap, and then locally try_wait_epoch. Since the barrier has been yielded, executors should be able to assume that this epoch can eventually finish and not block by the in place try_wait_epoch.

@yezizp2012
Copy link
Member

Sounds interesting but delicate under asynchronous checkpointing. 🤣

Do you mean our current global explicit try_wait_epoch during scale? This can be resolved by explicitly try_wait_epoch after the vnode bitmap is changed. On update vnode bitmap, executors can first yield the barrier, and then update the vnode bitmap, and then locally try_wait_epoch. Since the barrier has been yielded, executors should be able to assume that this epoch can eventually finish and not block by the in place try_wait_epoch.

It sounds like the correctness can be guaranteed.

and then figure out the smallest epoch that has not started being processed by any actor in this fragment,

I agree that extending this mechanism to local input/output can benefit a lot for partial checkpoint, drop and cancel commands. Figuring out which epoch or barrier the mutation should be attached requires more works in global and local barrier managers. Because the actors of the target fragment may be distributed across different compute nodes. Haven't thought of a good solution to ensure certainty yet.

Anyway, the impl LGTM.

@wenym1
Copy link
Contributor Author

wenym1 commented Jul 18, 2024

@yezizp2012 @BugenZhao Is it ok to merge this PR?

Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@yezizp2012
Copy link
Member

From the recovery test failure, it is highly likely that there was a problem with the recovery process. Retrying running slt for 5 or 10 more times should be enough already. 🤔

@wenym1
Copy link
Contributor Author

wenym1 commented Jul 19, 2024

From the recovery test failure, it is highly likely that there was a problem with the recovery process. Retrying running slt for 5 or 10 more times should be enough already. 🤔

Indeed. I saw repeatedly scale actors failed error=scale_actors failed to acquire reschedule_lock. Looks like we are entering some kind of deadlock.

@wenym1
Copy link
Contributor Author

wenym1 commented Jul 19, 2024

From the recovery test failure, it is highly likely that there was a problem with the recovery process. Retrying running slt for 5 or 10 more times should be enough already. 🤔

Indeed. I saw repeatedly scale actors failed error=scale_actors failed to acquire reschedule_lock. Looks like we are entering some kind of deadlock.

The recovery test has passed. The previous cause is that, on recovery, the catalog manager only notify failure at the beginning of recovery. If a create mv ddl attaches a finish notifier to the catalog manager after having notified failure, the notifier will never be notified. The create mv ddl holds the reschedule read lock, and then in recovery it will never be able to acquire the reschedule write lock, and keep recovery without notifying the failure, and cause deadlock.

@wenym1 wenym1 enabled auto-merge July 19, 2024 06:29
@wenym1 wenym1 added this pull request to the merge queue Jul 19, 2024
Merged via the queue into main with commit bd3b9a1 Jul 19, 2024
30 of 31 checks passed
@wenym1 wenym1 deleted the yiming/actor-wait-barrier-inject branch July 19, 2024 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants