Concurrent access to snapshotstore causes partial snapshots to get committed #6255

deepthidevaki · 2021-02-04T12:06:23Z

Broker-1 initiates taking snapshot. Takes the snapshot in the pending directory and waits for the follow up event to be committed.

2021-01-14 02:52:52.877 [Broker-1-SnapshotDirector-3] [Broker-1-zb-fs-workers-1] INFO  io.zeebe.logstreams.snapshot - Finished taking snapshot, need to wait until last written event position 21488378670 is committed, current commit position is 21488376579. After that snapshot can be marked as valid.

Raft transition to follower. On transitioning to follower it deletes all pending snapshots. It also deletes the latest pending snapshot for which the SnapshotDirector is waiting to commit. But it fails to delete it completely. It is probably due to concurrent operation by the SnapshotDirector, but it is not clear from the logs. So we have a snapshot in pending which is partially deleted.

at io.atomix.raft.roles.PassiveRole.abortPendingSnapshots(PassiveRole.java:418) ~[atomix-cluster-0.25.2.jar:0.25.2]
at io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore.purgePendingSnapshots(FileBasedSnapshotStore.java:112) ~[zeebe-snapshots-0.25.2.jar:0.25.2]
java.nio.file.NoSuchFileException: /usr/local/zeebe/data/raft-partition/partitions/3/pending/10926499-195-1610591887661-21488376573-21488376579.tmp/229345.sst
2021-01-14 02:52:52.878 [] [raft-server-1-raft-partition-partition-3] ERROR io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore - Failed to delete not completed (orphaned) snapshot /usr/local/zeebe/data/raft-partition/partitions/3/pending/10926499-195-1610591887661-21488376573-21488376579.tmp
2021-01-14 02:53:00.272 [] [raft-server-2-raft-partition-partition-3] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 0 failed: java.util.concurrent.TimeoutException: Request type raft-partition-partition-3-poll timed out in 5000 milliseconds

Leader services are still running because transition on raft and zeebe services are asynchronous. (The following error is expected during a failover, and we handle this.)

java.lang.IllegalStateException: Unexpected position 21488378687 was encountered after position 21488378677 when appending positions <21488378687, 21488378688>.
2021-01-14 02:53:46.001 [] [raft-server-1-raft-partition-partition-3] ERROR io.zeebe.logstreams - Failed to append block with last event position 21488378688.

At some point the follow up event is committed. The SnapshotDirector tries to commit the snapshot. But the snapshot in pending directory is partially deleted.

2021-01-14 03:02:10.879 [Broker-1-SnapshotDirector-3] [Broker-1-zb-fs-workers-1] INFO  io.zeebe.logstreams.snapshot - Current commit position 21488378670 is greater than 21488378670, snapshot 10926499-195-1610591887661-21488376573-21488376579 is valid and has been persisted.

As a result the committed snapshot is broken.

Root cause:
Concurrent access to snapshots does not work correctly. When the pending snapshot is being deleted by the follower role in raft, SnapshotDirectory is trying to commit the same snapshot which is now partially deleted.
Concurrent access to snapshot happened because role transitions in raft is asynchronous with role transition in zeebe. As a result when raft is running follower operation, zeebe may be still running leader services.

The text was updated successfully, but these errors were encountered:

deepthidevaki · 2021-02-04T12:56:36Z

Related issue #5611

npepinpe · 2021-02-04T17:55:53Z

I guess we need a form of lock when a component is going to modify (i.e. delete, move, etc.) a pending snapshot?

e.g.

Component wants exclusive access to the snapshot to modify it
Tries to lock it so it cannot be modified by anyone else
If the lock is acquired, modify it (delete it, move it)
If the lock is not acquired, back off and retry later (do not block)

Is the last point feasible for all components? It probably is for the snapshot director (which will get uninstalled probably before, so all good), would it be for the follower role? I see we delete pending snapshots when we receive a new one, and when stopping the follower role (so transitioning to inactive or leader?).

I guess the SnapshotStore would need to own the lock per snapshot? That way we can remove locks for pending snapshots when removing them, and we have a central lock that works for all threads.

A follow up question, though a little out of scope - we have a fallback in case moving is not atomic, right? This means we could potentially move the folder and only partially do it?

I suppose we need a way to mark that the snapshot was correctly committed in general, no? So:

Commit snapshot
Start moving the snapshot
On complete, mark it as complete with, uh, a mark file, or maybe write the checksum as a file? That has some nice side effects that we don't need to recompute it every time
Delete the previous snapshot
On recover, check for the mark file
If present, delete any previous snapshots (if any)
If not present, check if a previous snapshot is present
If present, delete the incomplete snapshot
If not present, "crash" because we now have only an incomplete one which cannot be used

That would cover cases where we cannot guarantee an atomic move, right?

deepthidevaki · 2021-02-05T09:07:22Z

Here is a summary of our discussion.

There are two issues here:

Concurrent access causes corruption
There is no way to detect corrupted snapshots

Potential solutions:
Ensure concurrent access is safe:
1. Locks/synchronous access to snapshot store - Anti-pattern to our Actor model
2. Make FilebasedSnapshotStore as an Actor - How does it work with the raft threading model?
3. Other fine grained concurrency control with in each snapshot object - Difficult to test, easy to make mistakes.

Detecting partial snapshots

Add a marker file that indicates the snapshot is valid.
- In addition may be also add a checksum file

As first step, we will evaluate moving FilebasedSnapshotStore to an actor.

npepinpe · 2021-02-05T15:37:27Z

6331: [Backport stable/0.26] fix(snapshot): make FilebasedSnapshotStore an actor r=MiguelPires a=deepthidevaki Backport of #6283 closes #6255 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

6332: [Backport stable/0.25] fix(snapshot): make FilebasedSnapshotStore an actor r=MiguelPires a=deepthidevaki ## Description Backport of #6283 closes #6255 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

… from 3.1.6 to 3.2.2 (#6255) * chore(deps): bump org.springframework.boot:spring-boot-starter-parent Bumps [org.springframework.boot:spring-boot-starter-parent](https://github.com/spring-projects/spring-boot) from 3.1.6 to 3.2.2. - [Release notes](https://github.com/spring-projects/spring-boot/releases) - [Commits](spring-projects/spring-boot@v3.1.6...v3.2.2) --- updated-dependencies: - dependency-name: org.springframework.boot:spring-boot-starter-parent dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * fix(identity): explicitly list request parameters * fix(health/test): provide user details bean * fix(test): use qualifier to identify the expected bean * fix(auth/test): don't pass response (content-length) to http entity * fix: adjust to changed nested jar support --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Roman <roman.smirnov@camunda.com>

deepthidevaki added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Data severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround labels Feb 4, 2021

deepthidevaki mentioned this issue Feb 4, 2021

Cluster starts with no leader for partition #3670

Closed

npepinpe assigned deepthidevaki Feb 4, 2021

npepinpe added Status: Ready and removed Status: Needs Priority labels Feb 4, 2021

deepthidevaki added Status: In Progress and removed Status: Ready labels Feb 5, 2021

deepthidevaki mentioned this issue Feb 8, 2021

Ensure safe concurrent access to snapshot store #6283

Merged

9 tasks

deepthidevaki added Status: Needs Review and removed Status: In Progress labels Feb 8, 2021

deepthidevaki mentioned this issue Feb 10, 2021

Add checksum file to the snapshot to detect partial snapshots #6290

Closed

ghost closed this as completed in 359aec0 Feb 11, 2021

This was referenced Feb 12, 2021

[Backport stable/0.26] fix(snapshot): make FilebasedSnapshotStore an actor #6331

Merged

[Backport stable/0.25] fix(snapshot): make FilebasedSnapshotStore an actor #6332

Merged

ghost pushed a commit that referenced this issue Feb 12, 2021

Merge #6331

3a2d486

6331: [Backport stable/0.26] fix(snapshot): make FilebasedSnapshotStore an actor r=MiguelPires a=deepthidevaki Backport of #6283 closes #6255 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

npepinpe added Release: 0.25.4 labels Feb 17, 2021

npepinpe added the Release: 1.0.0-alpha2 label Mar 1, 2021

This was referenced Mar 15, 2021

deps(master): bump version.zeebe from 0.26.0 to 0.26.1 camunda-community-hub/zeebe-kafka-exporter#129

Merged

deps(2.0): bump version.zeebe from 0.26.0 to 0.26.1 camunda-community-hub/zeebe-kafka-exporter#134

Merged

npepinpe added the Release: 1.0.0 label May 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent access to snapshotstore causes partial snapshots to get committed #6255

Concurrent access to snapshotstore causes partial snapshots to get committed #6255

deepthidevaki commented Feb 4, 2021

deepthidevaki commented Feb 4, 2021

npepinpe commented Feb 4, 2021

deepthidevaki commented Feb 5, 2021

npepinpe commented Feb 5, 2021 •

edited

Loading

Concurrent access to snapshotstore causes partial snapshots to get committed #6255

Concurrent access to snapshotstore causes partial snapshots to get committed #6255

Comments

deepthidevaki commented Feb 4, 2021

deepthidevaki commented Feb 4, 2021

npepinpe commented Feb 4, 2021

deepthidevaki commented Feb 5, 2021

npepinpe commented Feb 5, 2021 • edited Loading

npepinpe commented Feb 5, 2021 •

edited

Loading