Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent access to snapshotstore causes partial snapshots to get committed #6255

Closed
deepthidevaki opened this issue Feb 4, 2021 · 4 comments · Fixed by #6283
Closed

Concurrent access to snapshotstore causes partial snapshots to get committed #6255

deepthidevaki opened this issue Feb 4, 2021 · 4 comments · Fixed by #6283
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround

Comments

@deepthidevaki
Copy link
Contributor

#3670 (comment)_

Broker-1 initiates taking snapshot. Takes the snapshot in the pending directory and waits for the follow up event to be committed.

2021-01-14 02:52:52.877 [Broker-1-SnapshotDirector-3] [Broker-1-zb-fs-workers-1] INFO  io.zeebe.logstreams.snapshot - Finished taking snapshot, need to wait until last written event position 21488378670 is committed, current commit position is 21488376579. After that snapshot can be marked as valid.

Raft transition to follower. On transitioning to follower it deletes all pending snapshots. It also deletes the latest pending snapshot for which the SnapshotDirector is waiting to commit. But it fails to delete it completely. It is probably due to concurrent operation by the SnapshotDirector, but it is not clear from the logs. So we have a snapshot in pending which is partially deleted.

at io.atomix.raft.roles.PassiveRole.abortPendingSnapshots(PassiveRole.java:418) ~[atomix-cluster-0.25.2.jar:0.25.2]
at io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore.purgePendingSnapshots(FileBasedSnapshotStore.java:112) ~[zeebe-snapshots-0.25.2.jar:0.25.2]
java.nio.file.NoSuchFileException: /usr/local/zeebe/data/raft-partition/partitions/3/pending/10926499-195-1610591887661-21488376573-21488376579.tmp/229345.sst
2021-01-14 02:52:52.878 [] [raft-server-1-raft-partition-partition-3] ERROR io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore - Failed to delete not completed (orphaned) snapshot /usr/local/zeebe/data/raft-partition/partitions/3/pending/10926499-195-1610591887661-21488376573-21488376579.tmp
2021-01-14 02:53:00.272 [] [raft-server-2-raft-partition-partition-3] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 0 failed: java.util.concurrent.TimeoutException: Request type raft-partition-partition-3-poll timed out in 5000 milliseconds

Leader services are still running because transition on raft and zeebe services are asynchronous. (The following error is expected during a failover, and we handle this.)

java.lang.IllegalStateException: Unexpected position 21488378687 was encountered after position 21488378677 when appending positions <21488378687, 21488378688>.
2021-01-14 02:53:46.001 [] [raft-server-1-raft-partition-partition-3] ERROR io.zeebe.logstreams - Failed to append block with last event position 21488378688.

At some point the follow up event is committed. The SnapshotDirector tries to commit the snapshot. But the snapshot in pending directory is partially deleted.

2021-01-14 03:02:10.879 [Broker-1-SnapshotDirector-3] [Broker-1-zb-fs-workers-1] INFO  io.zeebe.logstreams.snapshot - Current commit position 21488378670 is greater than 21488378670, snapshot 10926499-195-1610591887661-21488376573-21488376579 is valid and has been persisted.

As a result the committed snapshot is broken.

Root cause:
Concurrent access to snapshots does not work correctly. When the pending snapshot is being deleted by the follower role in raft, SnapshotDirectory is trying to commit the same snapshot which is now partially deleted.
Concurrent access to snapshot happened because role transitions in raft is asynchronous with role transition in zeebe. As a result when raft is running follower operation, zeebe may be still running leader services.

@deepthidevaki deepthidevaki added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Data severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround labels Feb 4, 2021
@deepthidevaki
Copy link
Contributor Author

Related issue #5611

@npepinpe
Copy link
Member

npepinpe commented Feb 4, 2021

I guess we need a form of lock when a component is going to modify (i.e. delete, move, etc.) a pending snapshot?

e.g.

  • Component wants exclusive access to the snapshot to modify it
  • Tries to lock it so it cannot be modified by anyone else
  • If the lock is acquired, modify it (delete it, move it)
  • If the lock is not acquired, back off and retry later (do not block)

Is the last point feasible for all components? It probably is for the snapshot director (which will get uninstalled probably before, so all good), would it be for the follower role? I see we delete pending snapshots when we receive a new one, and when stopping the follower role (so transitioning to inactive or leader?).

I guess the SnapshotStore would need to own the lock per snapshot? That way we can remove locks for pending snapshots when removing them, and we have a central lock that works for all threads.

A follow up question, though a little out of scope - we have a fallback in case moving is not atomic, right? This means we could potentially move the folder and only partially do it?

I suppose we need a way to mark that the snapshot was correctly committed in general, no? So:

  1. Commit snapshot
  2. Start moving the snapshot
  3. On complete, mark it as complete with, uh, a mark file, or maybe write the checksum as a file? That has some nice side effects that we don't need to recompute it every time
  4. Delete the previous snapshot
  5. On recover, check for the mark file
  6. If present, delete any previous snapshots (if any)
  7. If not present, check if a previous snapshot is present
  8. If present, delete the incomplete snapshot
  9. If not present, "crash" because we now have only an incomplete one which cannot be used

That would cover cases where we cannot guarantee an atomic move, right?

@deepthidevaki
Copy link
Contributor Author

Here is a summary of our discussion.

There are two issues here:

  1. Concurrent access causes corruption
  2. There is no way to detect corrupted snapshots

Potential solutions:
Ensure concurrent access is safe:
1. Locks/synchronous access to snapshot store - Anti-pattern to our Actor model
2. Make FilebasedSnapshotStore as an Actor - How does it work with the raft threading model?
3. Other fine grained concurrency control with in each snapshot object - Difficult to test, easy to make mistakes.

Detecting partial snapshots

  1. Add a marker file that indicates the snapshot is valid.
    • In addition may be also add a checksum file

As first step, we will evaluate moving FilebasedSnapshotStore to an actor.

@ghost ghost closed this as completed in 359aec0 Feb 11, 2021
ghost pushed a commit that referenced this issue Feb 12, 2021
6331: [Backport stable/0.26] fix(snapshot): make FilebasedSnapshotStore an actor r=MiguelPires a=deepthidevaki

Backport of #6283

closes #6255 

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost pushed a commit that referenced this issue Feb 12, 2021
6332: [Backport stable/0.25] fix(snapshot): make FilebasedSnapshotStore an actor r=MiguelPires a=deepthidevaki

## Description

Backport of #6283 

closes #6255 

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024
… from 3.1.6 to 3.2.2 (#6255)

* chore(deps): bump org.springframework.boot:spring-boot-starter-parent

Bumps [org.springframework.boot:spring-boot-starter-parent](https://github.com/spring-projects/spring-boot) from 3.1.6 to 3.2.2.
- [Release notes](https://github.com/spring-projects/spring-boot/releases)
- [Commits](spring-projects/spring-boot@v3.1.6...v3.2.2)

---
updated-dependencies:
- dependency-name: org.springframework.boot:spring-boot-starter-parent
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix(identity): explicitly list request parameters

* fix(health/test): provide user details bean

* fix(test): use qualifier to identify the expected bean

* fix(auth/test): don't pass response (content-length) to http entity

* fix: adjust to changed nested jar support

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Roman <roman.smirnov@camunda.com>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants