-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent access to snapshotstore causes partial snapshots to get committed #6255
Comments
Related issue #5611 |
I guess we need a form of lock when a component is going to modify (i.e. delete, move, etc.) a pending snapshot? e.g.
Is the last point feasible for all components? It probably is for the snapshot director (which will get uninstalled probably before, so all good), would it be for the follower role? I see we delete pending snapshots when we receive a new one, and when stopping the follower role (so transitioning to inactive or leader?). I guess the A follow up question, though a little out of scope - we have a fallback in case moving is not atomic, right? This means we could potentially move the folder and only partially do it? I suppose we need a way to mark that the snapshot was correctly committed in general, no? So:
That would cover cases where we cannot guarantee an atomic move, right? |
Here is a summary of our discussion. There are two issues here:
Potential solutions: Detecting partial snapshots
As first step, we will evaluate moving FilebasedSnapshotStore to an actor. |
… from 3.1.6 to 3.2.2 (#6255) * chore(deps): bump org.springframework.boot:spring-boot-starter-parent Bumps [org.springframework.boot:spring-boot-starter-parent](https://github.com/spring-projects/spring-boot) from 3.1.6 to 3.2.2. - [Release notes](https://github.com/spring-projects/spring-boot/releases) - [Commits](spring-projects/spring-boot@v3.1.6...v3.2.2) --- updated-dependencies: - dependency-name: org.springframework.boot:spring-boot-starter-parent dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * fix(identity): explicitly list request parameters * fix(health/test): provide user details bean * fix(test): use qualifier to identify the expected bean * fix(auth/test): don't pass response (content-length) to http entity * fix: adjust to changed nested jar support --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Roman <roman.smirnov@camunda.com>
#3670 (comment)_
Broker-1 initiates taking snapshot. Takes the snapshot in the pending directory and waits for the follow up event to be committed.
Raft transition to follower. On transitioning to follower it deletes all pending snapshots. It also deletes the latest pending snapshot for which the SnapshotDirector is waiting to commit. But it fails to delete it completely. It is probably due to concurrent operation by the SnapshotDirector, but it is not clear from the logs. So we have a snapshot in pending which is partially deleted.
Leader services are still running because transition on raft and zeebe services are asynchronous. (The following error is expected during a failover, and we handle this.)
At some point the follow up event is committed. The SnapshotDirector tries to commit the snapshot. But the snapshot in pending directory is partially deleted.
As a result the committed snapshot is broken.
Root cause:
Concurrent access to snapshots does not work correctly. When the pending snapshot is being deleted by the follower role in raft, SnapshotDirectory is trying to commit the same snapshot which is now partially deleted.
Concurrent access to snapshot happened because role transitions in raft is asynchronous with role transition in zeebe. As a result when raft is running follower operation, zeebe may be still running leader services.
The text was updated successfully, but these errors were encountered: