-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve guarantees of experimental flag to disable flush #11423
Labels
kind/toil
Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
version:8.1.7
Marks an issue as being completely or in parts released in 8.1.7
version:8.2.0-alpha4
Marks an issue as being completely or in parts released in 8.2.0-alpha4
version:8.2.0
Marks an issue as being completely or in parts released in 8.2.0
Comments
Additional thoughts from @npepinpe:
Similarly, we can look into restore consistency levels:
|
As I said, maybe it's a bad idea at the moment to provide difference recovery guarantees 😄 |
ghost
pushed a commit
that referenced
this issue
Jan 25, 2023
11466: refactor(broker): refactor snapshot director for readability r=deepthidevaki a=deepthidevaki ## Description The chain of actions required to take a snapshot is difficult to understand because of the inability to compose actor futures. In this PR, we refactored it with a rudimentary way of composing actor futures. I hope it is more readable this way. We have applied a similar technique in `BackupService`. There is no behavior change. With this change, it is easy to allow multiple ongoing snapshots. But to keep the behavior as it is, we still do not allow multiple ongoing snapshot. For #11423 , we have to flush the journal before the snapshot is committed. With this PR, it is easy to add one step before committing snapshot. Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
ghost
pushed a commit
that referenced
this issue
Jan 25, 2023
This was referenced Jan 26, 2023
ghost
pushed a commit
that referenced
this issue
Jan 30, 2023
11499: [Backport stable/8.1] Flush log before committing a snapshot if raft flush is disabled r=deepthidevaki a=backport-action # Description Backport of #11487 to `stable/8.1`. relates to #11423 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@users.noreply.github.com>
This issue was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/toil
Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
version:8.1.7
Marks an issue as being completely or in parts released in 8.1.7
version:8.2.0-alpha4
Marks an issue as being completely or in parts released in 8.2.0-alpha4
version:8.2.0
Marks an issue as being completely or in parts released in 8.2.0
Users can use experimental flag to disable raft flush to get better performance by compromising safety guarantees. When flush is disabled there is no guarantee that a committed record is persisted after a node crash. However, sometimes users want to choose performance over consistency and they are ok with partial data loss. But currently with disabled flush there are scenarios where the system end up in a non-recoverable state even if users accept partial data loss from the last flush. A partition's snapshot is recoverable only if the followup events until a particular position is available in the logs. If we disable flush, we cannot guarantee that those events are persisted to disk at the time snapshot is taken. In this case, Zeebe cannot recover from that snapshot and cannot process anything on that partition. It might require manual action to clean up that partition's data and the partition can suffer a full data loss.
To allow users to use this experimental flag, while accepting that there is a chance of partial data loss across restarts, we should ensure that the journal is flushed atleast when taking the snapshot. Thus we can ensure that even if data written after last flush is lost, the partition will be able to restart from the last snapshot.
The text was updated successfully, but these errors were encountered: