stability: improve recovery after a crash of the majority of servers #1

freddyrios · 2022-01-10T11:18:43Z

This relates to the scenario mentioned at dotnet#89 (comment).

Scenario: running the modified example on 3 nodes, kill 2 nodes (ctrl - c) and then restart one of them.
Expected: consistently resumes normal operation when at least 2 out of 3 nodes are running again
Actual: cluster often does not recover consistenly

Notes:

it is usually possible to recover the cluster by retrying to restart the different nodes multiple times
we also saw a failure where a cluster that had been running without error during the weekend, failed a while after we connected to the 3 nodes. In this case, 2 of the nodes logged request timeouts, while another one logged an error from the data modifier without any further output until restarted (even after restarting the other nodes first). Given the busy nature of the run, some crazy ideas: what if the extra resource usage or any timing difference writing to the console made it easier to hit some limit and hit a race condition
we unintentially made a different test where we pulled the power of one of the nodes. When recovering it, its log had broken, which looked clearly different to the falures above. Unfortunately I did not keep the message, but it was something an index being out of range or not found in the log.

sakno · 2022-01-11T17:44:25Z

but it was something an index being out of range or not found in the log.

That may be an issue with filesystem setup. For instance, when SSD/NVMe is in place, Linux allows you to configure buffer caches for disk I/O. It means that even if you write some bytes to FileStream/SafeFileHandle they can be not yet written to the disk. Persistent WAL has WriteThrough flag to skip any intermediate OS-level buffers when writing to the disk However, it's always up to OS I/O layer.

sakno · 2022-01-11T18:15:50Z

@freddyrios , look at dotnet#94 (comment). Looks like the more accurate calculation of heartbeat timeout on the leader node (now it includes subtraction of replication duration) leads to more stable cluster. Anyway, this is a step forward to our goal.

sakno · 2022-01-11T22:05:23Z

Here is a new branch where I'm working on new protocol on top of TCP: https://github.com/dotnet/dotNext/tree/feature/new-raft-tcp

freddyrios · 2022-01-12T09:21:55Z

The deployment had default rasbian, so it makes sense there could be something like that at play for that log issue.

…

On Tue, Jan 11, 2022, 18:44 SRV ***@***.***> wrote: but it was something an index being out of range or not found in the log. That may be an issue with filesystem setup. For instance, when SSD/NVMe is in place, Linux allows you to configure buffer caches for disk I/O. It means that even if you write some bytes to FileStream/SafeFileHandle they can be not yet written to the disk. Persistent WAL has WriteThrough flag to skip any intermediate OS-level buffers when writing to the disk However, it's always up to OS I/O layer. — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ42ZDIKG3QFDMM3HSOHIU3UVRUAJANCNFSM5LTMQR6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

freddyrios · 2022-01-12T09:52:25Z

I am on a phone/train, so had trouble fully following. However, what I understood is that the response time trying to send heartbeats/entries to one follower prevented the leader from sending heartbeats/entries at the expected rate to a different follower. I could see addressing that helping the stability for some of the scenarios we have hit. One thing I noticed when playing with the raft simulator, is that heartbeats do not always align across different nodes.so depending on what was going on, it looked a lot like the leader could keep the heartbeats frequency independently for each follower. I suspect that decoupling is an alternate way one deals with it. However, it does make sense that it would keep trying at the right frequency, even if one of the appends happened to be a bit heavy due yo the entries being sent. For healthy followers the time to append entries is supposed to be an order of magnitude less than the election timeouts. If these are slow enough to matter for healthy nodes, then it could make some of the raft paper assumptions problematic I guess. Another independent but related stability area I have thought about is: what happens when appending entries to a followers takes long (due to the size and amount of entries for example), can this result in the follower becoming a candidate even though the leader is actively communicating to it? (not sure what the right behavior of such case, as the problem appending those entries could be in either side).

…

On Tue, Jan 11, 2022, 19:16 SRV ***@***.***> wrote: @freddyrios <https://github.com/freddyrios> , look at dotnet#94 (comment) <dotnet#94 (comment)>. Looks like the more accurate calculation of heartbeat timeout on the leader node (now it includes subtraction of replication duration) leads to more stable cluster. Anyway, this is a step forward to our goal. — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ42ZDJX6VJKP3EAQI2NNLTUVRXWDANCNFSM5LTMQR6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

sakno · 2022-01-12T11:00:19Z

what happens when appending entries to a followers takes long (due to the
size and amount of entries for example), can this result in the follower
becoming a candidate even though the leader is actively communicating to
it?

Yes, it is. We can mitigate this by stopping the timer counting on the follower node during execution of AppendEntriesAsync.

sakno · 2022-01-12T19:06:40Z

@freddyrios , done: dotnet@a6ab902

freddyrios · 2022-01-14T15:31:24Z

@sakno I have created a new branch with the same example but on top of the latest develop changes https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-example-rebased

There are 2 more branches with some exploration I did on stability (already rebased on the latest develop changes), specially given the fix you shard related to hearbeats.

https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed
https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed-frequentsnapshots

It seems:

bigstruct-example-rebased recovers 1 out of 3 fine, even before the latest changes. Maybe because the very frequent writes act as very frequent heartbeats. 2 out of 3 still fails often.
bigstruct-slowspeed recovering 1 out of 3 fails often, but the latest fixes the situation for the other nodes, as they are no longer disrupted by the node that keeps faliing. Usually giving "warn: DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74022] Request has timed out"
bigstruct-slowspeed-frequentsnapshots both 1 out of 3 and 2 out of 3 tend to recover fine (I saw 1 failure, but it seems hard to reproduce, so maybe some timing that is more unlikely in this configuration/load) with the latest changes.

What I find particularly odd is that it can push entries at such a high rate in the first configuration, yet run into timeout trouble recovering a node in the second configuration. Makes me wonder if more than a timeout is something getting stuck (but have no evidence of this).

Not sure what could explain 2 out of 3 being much more reliable in configuration 3 vs. 1. Even though configuration 1 is more of a stress test, I can't see why that would be special when taking out 2 nodes vs. 1 node in comparison to the configuration 3.

sakno · 2022-01-14T16:30:46Z

@freddyrios , I'm still working on new TCP transport. I expect to finish it this holidays. After that, I'll start testing and stabilizing the release using use cases you mentioned previously. All patches and new transport will be released as 4.2.0 version.

sakno · 2022-01-15T16:31:44Z

Alright, TCP transport is re-implemented now. Of course, I'll do the necessary tests using your examples. Now I think it's reasonable to implement recovery procedure for persistent Write-Ahead Log. Here is my plan:

Implement checksum for node's state file: node.state
Basic functionality for recovery procedure:
1.1. If node's state file is corrupted then the entire WAL can be considered as invalid. We can cleanup the WAL.
1.2. Add ClearAsync method to PersistentState
1.2. Add ability to override InitializeAsync method of persistent WAL. For each log entry, we can embed a checksum or even duplicate the data. Therefore, at startup we will able to validate all committed log entries. To be more precise, we will able to do that with existing functionality through ReplayAsync that allows to warmup underlying state machine. During initialization, we can easily verify the checkum and throw exception. In very bad case we still be able to execute the node even with empty WAL.

P.S.: Regarding our protection from bit flips: I think SHA is very heavyweight for our purposes. It's crypto-strong hash code typically not considered for data consistency check. FNV1a or CRC is much better and more performant.

sakno · 2022-01-15T16:48:39Z

Verification of node.state could prevent from accidental power off and incorrect indexes within the log.

sakno · 2022-01-15T18:12:20Z

My investigation shows that storing checksum for node.state makes a little sense. In general, we can't handle accidental power off normally:

Disk writes must be transactional and underlying file system must support copy-on-write semantics
Or WAL must support reliable writes via insertion of checkpoints and checksum for each checkpoint

Anyway, fault-tolerant WAL is another field of theoretical and practical research. I didn't have a goal to write WAL for such cases. The current implementation just works, at least under normal circumstances.

However, we can have workaround to simulate simple checkpoint. Before every append operation, we can save WriteInProgress indicator to the external file. The indicator can have a size of 1 byte. When operation is finished, just reverse this indicator. In case of accidental shutdown, the file will keep WriteInProgress indicator. It's enough to decide that the entire WAL probably broken and skip all the data in it to have a fresh setup. All these things out of scope at the moment, so I'll focus on stability of the cluster itself.

sakno · 2022-01-16T18:06:10Z

When recovering it, its log had broken, which looked clearly different to the falures above.

Found the root cause of this issue even in case of graceful shutdown of the node. The problem is in snapshot when it is installed for the index that is greater than the first index in the partition. In that case, the partition becomes completely invalid because the binary offset calculated incorrectly within partition file.

sakno · 2022-01-16T20:13:45Z

dotnet@ebd3b96:

Fixed aligned pointer access on ARM devices (on ARM devices, dereference of unaligned pointer can cause SEGFAULT)
Fixed bug that I mentioned previously: incorrect calculation of write position within the partition file when the caller trying to insert the log entry in the mid of the partition after snapshot installation.

freddyrios · 2022-01-17T15:44:01Z

Fixed aligned pointer access on ARM devices (on ARM devices, dereference of unaligned pointer can cause SEGFAULT

Nice catch. One I hit on our side in the past with ARM was trying to use https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.cast?view=net-6.0.

sakno · 2022-01-17T17:24:54Z

Fortunately, there is a small amount of such code. It is used on the hot path of program execution where the serialization/deserialization of log entry metadata required.

freddyrios · 2022-01-19T16:02:56Z

Test branches have been rebased to latest (if updating local copies, might want to delete them locally and pull them fresh):

The first example uses only default settings, except for BufferSize that is now 4096*2, which is a bit larger than one of the big entries (8000). Some short testing showed better stability, including in the scenario where 2 nodes crash.

The second example adds a 1 second delay after writting 16 entries. Note this is equivalent to the old slow + frequent snapshots, as the 50 records per partition was not changed (based on input on that being bad on its own). Short testing ran into 2 of the below crashes during re-elections. Restarting the crashed node recovered succesfully.

Process terminated. Assertion failed.
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.LogEntry..ctor(MemoryOwner`1& cachedContent, LogEntryMetadata& metadata, Int64 index) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.LogEntry.cs:line 44
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.Read(Int32 sessionId, Int64 absoluteIndex, LogEntryReadOptimizationHint hint) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 218
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line
455
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 479
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<<CommitAsync>g__CommitAndCompactSequentiallyAsync|0>d.MoveNext() in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 311
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<CommitAsync>g__CommitAndCompactSequentiallyAsync|0()
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.CommitAsync(Nullable`1 endIndex, CancellationToken token) i[40/1934]s\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 293
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.CommitAsync(Int64 endIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 718
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 465
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster.DotNext.Net.Cluster.Consensus.Raft.TransportServices.ILocalMember.Appen[28/1934]sync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.DefaultImpl.cs:line 272
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 246
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.ProcessRequestAsync(MessageType type, ProtocolStream protocol, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 177   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)       
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)     [14/1934]   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.SetResult(TResult result)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.ProtocolStream.ReadMessageTypeAsync(CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\ProtocolStream.ReadOperations.cs:line 93
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()                             [0/1934]   at System.Threading.ThreadPool.<>c.<.cctor>b__86_0(Object state)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
   at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32 bytesTransferred, SocketFlags flags)
   at System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32 bytesTransferred, Byte[] socketAddress, Int32 socketAddressSize, SocketFlags receivedFlags, SocketError socketError)
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
Aborted

sakno · 2022-01-19T18:19:52Z

I caught another exception:

System.IO.InternalBufferOverflowException: System error.
         at DotNext.Buffers.SpanReader`1.Read(Int32 count) in /home/roman/projects/dotnext/src/DotNext/Buffers/SpanReader.cs:line 201
         at DotNext.IO.FileReader.Read[T]() in /home/roman/projects/dotnext/src/DotNext.IO/IO/FileReader.Binary.cs:line 22
         at DotNext.IO.FileReader.ReadAsync[T](CancellationToken token) in /home/roman/projects/dotnext/src/DotNext.IO/IO/FileReader.Binary.cs:line 42
         at RaftNode.SimplePersistentState.UpdateValue(LogEntry entry) in /home/roman/projects/dotnext/src/examples/RaftNode/SimplePersistentState.cs:line 75
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/MemoryBasedStateMachine.cs:line 456
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<<CommitAsync>g__CommitAndCompactSequentiallyAsync|0>d.MoveNext() in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/MemoryBasedStateMachine.cs:line 311
      --- End of stack trace from previous location ---
         at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 465
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 246
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 251
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

I'll fix that shortly.

sakno · 2022-01-20T12:44:09Z

The previous issue has been fixed.

sakno · 2022-01-20T12:44:38Z

Found a new one:

DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74028]
      Failed to process request from 127.0.0.1:42824
      System.InvalidOperationException: Unknown Raft message type 251
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

I'm working on it.

sakno · 2022-01-20T16:15:47Z

The root cause was incorrect skip of the last log entry when it is not consumed by the follower. Fixed and pushed to develop branch.

sakno · 2022-01-20T21:38:11Z

Found the root cause of failed assertion. Here is the steps to reproduce:

Append non-empty entry at index X with addToCache=true argument, but not commit it
Rewrite log entry at index X with the entry of size 0
Request the entry from WAL

Raft allows to rewrite uncommitted log entries only. This is by-design and described in Raft paper. In the described scenario, WAL uses internal cache when read operation is called. However, the actual content is stored on the disk with different length.

How to fix: invalidate cache at the specific index when rewrite happened.

freddyrios · 2022-01-21T16:13:54Z

All the previously shared scenarios are working now.

Additional testing shows these new scenarios:

(low priority) node starting again seems to always cause an election, which makes me wonder if prevoting is happening or some other unintended consequence of the start of that node
(critical) cutting power to 1 out of 3 nodes: the 2 alive nodes did not resume for 13+ minutes. Even then, it only recovered after the device was powered again, even though the raft node program was not started on the device (OS seems to explicitly reset the connection in this case)
broken log when taking out power, regardless of WriteThrough setting. Even if some alternatives might appear, the ease to reproduce could hint at some less unavoidable issue to be at play, so its at least worth exploration.

sakno · 2022-01-21T22:35:10Z

2nd fixed.

sakno · 2022-01-22T15:23:59Z

3rd issue. I've investigated WAL dumps from ARM devices. Fs (I think it was ext4) was unable to restore some of the partition files. That's why the index of the last log entry stored in the node.state file larger than the partition file. There is no silver bullet to solve this issue. However, there are few practices that we can use:

Persistent WAL supports live backups, see PersistentState.CreateBackupAsync and PersistentState.RestoreFromBackupAsync methods
On initialization, if the WAL is broken, we can just call ClearAsync and join as the follower. The latest cluster state will be replicated as soon as possible.

In worst case, the app can implement incremental backups. But I think the option 2 is enough for our purposes.

sakno · 2022-01-23T13:11:11Z

@freddyrios , you mentioned performance or read/write operations in WAL. Unfortunately, Linux historically did not have a good syscall for async disk I/O. It was many attempts to do that without any great success. However, in the last kernel version, io_uring syscall has been added. However, it is not supported by .NET: dotnet/runtime#51985
It means that all asynchronous calls transformed to synchronous calls scheduled via ThreadPool.

freddyrios · 2022-01-25T13:53:47Z

@sakno closing this issue as done, as all the raised scenarios have been addressed. The only thing remaining is around node restarts usually triggering an election, but this is not a priority at the moment.

freddyrios assigned sakno Jan 11, 2022

sakno added the enhancement New feature or request label Jan 12, 2022

freddyrios closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: improve recovery after a crash of the majority of servers #1

stability: improve recovery after a crash of the majority of servers #1

freddyrios commented Jan 10, 2022 •

edited

Loading

sakno commented Jan 11, 2022

sakno commented Jan 11, 2022

sakno commented Jan 11, 2022

freddyrios commented Jan 12, 2022 via email

freddyrios commented Jan 12, 2022 via email

sakno commented Jan 12, 2022

sakno commented Jan 12, 2022

freddyrios commented Jan 14, 2022 •

edited

Loading

sakno commented Jan 14, 2022

sakno commented Jan 15, 2022

sakno commented Jan 15, 2022

sakno commented Jan 15, 2022

sakno commented Jan 16, 2022

sakno commented Jan 16, 2022

freddyrios commented Jan 17, 2022

sakno commented Jan 17, 2022

freddyrios commented Jan 19, 2022

sakno commented Jan 19, 2022

sakno commented Jan 20, 2022

sakno commented Jan 20, 2022

sakno commented Jan 20, 2022

sakno commented Jan 20, 2022

freddyrios commented Jan 21, 2022

sakno commented Jan 21, 2022 •

edited

Loading

sakno commented Jan 22, 2022

sakno commented Jan 23, 2022

freddyrios commented Jan 25, 2022

stability: improve recovery after a crash of the majority of servers #1

stability: improve recovery after a crash of the majority of servers #1

Comments

freddyrios commented Jan 10, 2022 • edited Loading

sakno commented Jan 11, 2022

sakno commented Jan 11, 2022

sakno commented Jan 11, 2022

freddyrios commented Jan 12, 2022 via email

freddyrios commented Jan 12, 2022 via email

sakno commented Jan 12, 2022

sakno commented Jan 12, 2022

freddyrios commented Jan 14, 2022 • edited Loading

sakno commented Jan 14, 2022

sakno commented Jan 15, 2022

sakno commented Jan 15, 2022

sakno commented Jan 15, 2022

sakno commented Jan 16, 2022

sakno commented Jan 16, 2022

freddyrios commented Jan 17, 2022

sakno commented Jan 17, 2022

freddyrios commented Jan 19, 2022

sakno commented Jan 19, 2022

sakno commented Jan 20, 2022

sakno commented Jan 20, 2022

sakno commented Jan 20, 2022

sakno commented Jan 20, 2022

freddyrios commented Jan 21, 2022

sakno commented Jan 21, 2022 • edited Loading

sakno commented Jan 22, 2022

sakno commented Jan 23, 2022

freddyrios commented Jan 25, 2022

freddyrios commented Jan 10, 2022 •

edited

Loading

freddyrios commented Jan 14, 2022 •

edited

Loading

sakno commented Jan 21, 2022 •

edited

Loading