Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: improve recovery after a crash of the majority of servers #1

Closed
freddyrios opened this issue Jan 10, 2022 · 27 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@freddyrios
Copy link

freddyrios commented Jan 10, 2022

This relates to the scenario mentioned at dotnet#89 (comment).

Scenario: running the modified example on 3 nodes, kill 2 nodes (ctrl - c) and then restart one of them.
Expected: consistently resumes normal operation when at least 2 out of 3 nodes are running again
Actual: cluster often does not recover consistenly

Notes:

  • it is usually possible to recover the cluster by retrying to restart the different nodes multiple times
  • we also saw a failure where a cluster that had been running without error during the weekend, failed a while after we connected to the 3 nodes. In this case, 2 of the nodes logged request timeouts, while another one logged an error from the data modifier without any further output until restarted (even after restarting the other nodes first). Given the busy nature of the run, some crazy ideas: what if the extra resource usage or any timing difference writing to the console made it easier to hit some limit and hit a race condition
  • we unintentially made a different test where we pulled the power of one of the nodes. When recovering it, its log had broken, which looked clearly different to the falures above. Unfortunately I did not keep the message, but it was something an index being out of range or not found in the log.
@sakno
Copy link
Collaborator

sakno commented Jan 11, 2022

but it was something an index being out of range or not found in the log.

That may be an issue with filesystem setup. For instance, when SSD/NVMe is in place, Linux allows you to configure buffer caches for disk I/O. It means that even if you write some bytes to FileStream/SafeFileHandle they can be not yet written to the disk. Persistent WAL has WriteThrough flag to skip any intermediate OS-level buffers when writing to the disk However, it's always up to OS I/O layer.

@sakno
Copy link
Collaborator

sakno commented Jan 11, 2022

@freddyrios , look at dotnet#94 (comment). Looks like the more accurate calculation of heartbeat timeout on the leader node (now it includes subtraction of replication duration) leads to more stable cluster. Anyway, this is a step forward to our goal.

@sakno
Copy link
Collaborator

sakno commented Jan 11, 2022

Here is a new branch where I'm working on new protocol on top of TCP: https://github.com/dotnet/dotNext/tree/feature/new-raft-tcp

@freddyrios
Copy link
Author

freddyrios commented Jan 12, 2022 via email

@freddyrios
Copy link
Author

freddyrios commented Jan 12, 2022 via email

@sakno
Copy link
Collaborator

sakno commented Jan 12, 2022

what happens when appending entries to a followers takes long (due to the
size and amount of entries for example), can this result in the follower
becoming a candidate even though the leader is actively communicating to
it?

Yes, it is. We can mitigate this by stopping the timer counting on the follower node during execution of AppendEntriesAsync.

@sakno
Copy link
Collaborator

sakno commented Jan 12, 2022

@freddyrios , done: dotnet@a6ab902

@sakno sakno added the enhancement New feature or request label Jan 12, 2022
@freddyrios
Copy link
Author

freddyrios commented Jan 14, 2022

@sakno I have created a new branch with the same example but on top of the latest develop changes https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-example-rebased

There are 2 more branches with some exploration I did on stability (already rebased on the latest develop changes), specially given the fix you shard related to hearbeats.

https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed
https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed-frequentsnapshots

It seems:

  • bigstruct-example-rebased recovers 1 out of 3 fine, even before the latest changes. Maybe because the very frequent writes act as very frequent heartbeats. 2 out of 3 still fails often.
  • bigstruct-slowspeed recovering 1 out of 3 fails often, but the latest fixes the situation for the other nodes, as they are no longer disrupted by the node that keeps faliing. Usually giving "warn: DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74022] Request has timed out"
  • bigstruct-slowspeed-frequentsnapshots both 1 out of 3 and 2 out of 3 tend to recover fine (I saw 1 failure, but it seems hard to reproduce, so maybe some timing that is more unlikely in this configuration/load) with the latest changes.

What I find particularly odd is that it can push entries at such a high rate in the first configuration, yet run into timeout trouble recovering a node in the second configuration. Makes me wonder if more than a timeout is something getting stuck (but have no evidence of this).

Not sure what could explain 2 out of 3 being much more reliable in configuration 3 vs. 1. Even though configuration 1 is more of a stress test, I can't see why that would be special when taking out 2 nodes vs. 1 node in comparison to the configuration 3.

@sakno
Copy link
Collaborator

sakno commented Jan 14, 2022

@freddyrios , I'm still working on new TCP transport. I expect to finish it this holidays. After that, I'll start testing and stabilizing the release using use cases you mentioned previously. All patches and new transport will be released as 4.2.0 version.

@sakno
Copy link
Collaborator

sakno commented Jan 15, 2022

Alright, TCP transport is re-implemented now. Of course, I'll do the necessary tests using your examples. Now I think it's reasonable to implement recovery procedure for persistent Write-Ahead Log. Here is my plan:

  1. Implement checksum for node's state file: node.state
  2. Basic functionality for recovery procedure:
    1.1. If node's state file is corrupted then the entire WAL can be considered as invalid. We can cleanup the WAL.
    1.2. Add ClearAsync method to PersistentState
    1.2. Add ability to override InitializeAsync method of persistent WAL. For each log entry, we can embed a checksum or even duplicate the data. Therefore, at startup we will able to validate all committed log entries. To be more precise, we will able to do that with existing functionality through ReplayAsync that allows to warmup underlying state machine. During initialization, we can easily verify the checkum and throw exception. In very bad case we still be able to execute the node even with empty WAL.

P.S.: Regarding our protection from bit flips: I think SHA is very heavyweight for our purposes. It's crypto-strong hash code typically not considered for data consistency check. FNV1a or CRC is much better and more performant.

@sakno
Copy link
Collaborator

sakno commented Jan 15, 2022

Verification of node.state could prevent from accidental power off and incorrect indexes within the log.

@sakno
Copy link
Collaborator

sakno commented Jan 15, 2022

My investigation shows that storing checksum for node.state makes a little sense. In general, we can't handle accidental power off normally:

  1. Disk writes must be transactional and underlying file system must support copy-on-write semantics
  2. Or WAL must support reliable writes via insertion of checkpoints and checksum for each checkpoint

Anyway, fault-tolerant WAL is another field of theoretical and practical research. I didn't have a goal to write WAL for such cases. The current implementation just works, at least under normal circumstances.

However, we can have workaround to simulate simple checkpoint. Before every append operation, we can save WriteInProgress indicator to the external file. The indicator can have a size of 1 byte. When operation is finished, just reverse this indicator. In case of accidental shutdown, the file will keep WriteInProgress indicator. It's enough to decide that the entire WAL probably broken and skip all the data in it to have a fresh setup. All these things out of scope at the moment, so I'll focus on stability of the cluster itself.

@sakno
Copy link
Collaborator

sakno commented Jan 16, 2022

When recovering it, its log had broken, which looked clearly different to the falures above.

Found the root cause of this issue even in case of graceful shutdown of the node. The problem is in snapshot when it is installed for the index that is greater than the first index in the partition. In that case, the partition becomes completely invalid because the binary offset calculated incorrectly within partition file.

@sakno
Copy link
Collaborator

sakno commented Jan 16, 2022

dotnet@ebd3b96:

  • Fixed aligned pointer access on ARM devices (on ARM devices, dereference of unaligned pointer can cause SEGFAULT)
  • Fixed bug that I mentioned previously: incorrect calculation of write position within the partition file when the caller trying to insert the log entry in the mid of the partition after snapshot installation.

@freddyrios
Copy link
Author

  • Fixed aligned pointer access on ARM devices (on ARM devices, dereference of unaligned pointer can cause SEGFAULT

Nice catch. One I hit on our side in the past with ARM was trying to use https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.cast?view=net-6.0.

@sakno
Copy link
Collaborator

sakno commented Jan 17, 2022

Fortunately, there is a small amount of such code. It is used on the hot path of program execution where the serialization/deserialization of log entry metadata required.

@freddyrios
Copy link
Author

Test branches have been rebased to latest (if updating local copies, might want to delete them locally and pull them fresh):

The first example uses only default settings, except for BufferSize that is now 4096*2, which is a bit larger than one of the big entries (8000). Some short testing showed better stability, including in the scenario where 2 nodes crash.

The second example adds a 1 second delay after writting 16 entries. Note this is equivalent to the old slow + frequent snapshots, as the 50 records per partition was not changed (based on input on that being bad on its own). Short testing ran into 2 of the below crashes during re-elections. Restarting the crashed node recovered succesfully.

Process terminated. Assertion failed.
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.LogEntry..ctor(MemoryOwner`1& cachedContent, LogEntryMetadata& metadata, Int64 index) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.LogEntry.cs:line 44
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.Read(Int32 sessionId, Int64 absoluteIndex, LogEntryReadOptimizationHint hint) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.Partition.cs:line 218
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line
455
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 479
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<<CommitAsync>g__CommitAndCompactSequentiallyAsync|0>d.MoveNext() in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 311
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<CommitAsync>g__CommitAndCompactSequentiallyAsync|0()
   at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.CommitAsync(Nullable`1 endIndex, CancellationToken token) i[40/1934]s\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 293
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.CommitAsync(Int64 endIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\PersistentState.cs:line 718
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 465
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster.DotNext.Net.Cluster.Consensus.Raft.TransportServices.ILocalMember.Appen[28/1934]sync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.DefaultImpl.cs:line 272
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 246
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.ProcessRequestAsync(MessageType type, ProtocolStream protocol, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 177   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpServer.cs:line 134
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)       
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)     [14/1934]   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.SetResult(TResult result)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.ProtocolStream.ReadMessageTypeAsync(CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ConnectionOriented\ProtocolStream.ReadOperations.cs:line 93
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()                             [0/1934]   at System.Threading.ThreadPool.<>c.<.cctor>b__86_0(Object state)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
   at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32 bytesTransferred, SocketFlags flags)
   at System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32 bytesTransferred, Byte[] socketAddress, Int32 socketAddressSize, SocketFlags receivedFlags, SocketError socketError)
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
Aborted

@sakno
Copy link
Collaborator

sakno commented Jan 19, 2022

I caught another exception:

System.IO.InternalBufferOverflowException: System error.
         at DotNext.Buffers.SpanReader`1.Read(Int32 count) in /home/roman/projects/dotnext/src/DotNext/Buffers/SpanReader.cs:line 201
         at DotNext.IO.FileReader.Read[T]() in /home/roman/projects/dotnext/src/DotNext.IO/IO/FileReader.Binary.cs:line 22
         at DotNext.IO.FileReader.ReadAsync[T](CancellationToken token) in /home/roman/projects/dotnext/src/DotNext.IO/IO/FileReader.Binary.cs:line 42
         at RaftNode.SimplePersistentState.UpdateValue(LogEntry entry) in /home/roman/projects/dotnext/src/examples/RaftNode/SimplePersistentState.cs:line 75
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/MemoryBasedStateMachine.cs:line 456
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.<>c__DisplayClass20_0.<<CommitAsync>g__CommitAndCompactSequentiallyAsync|0>d.MoveNext() in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/MemoryBasedStateMachine.cs:line 311
      --- End of stack trace from previous location ---
         at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 465
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 246
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 251
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

I'll fix that shortly.

@sakno
Copy link
Collaborator

sakno commented Jan 20, 2022

The previous issue has been fixed.

@sakno
Copy link
Collaborator

sakno commented Jan 20, 2022

Found a new one:

DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74028]
      Failed to process request from 127.0.0.1:42824
      System.InvalidOperationException: Unknown Raft message type 251
         at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /home/roman/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 134

I'm working on it.

@sakno
Copy link
Collaborator

sakno commented Jan 20, 2022

The root cause was incorrect skip of the last log entry when it is not consumed by the follower. Fixed and pushed to develop branch.

@sakno
Copy link
Collaborator

sakno commented Jan 20, 2022

Found the root cause of failed assertion. Here is the steps to reproduce:

  1. Append non-empty entry at index X with addToCache=true argument, but not commit it
  2. Rewrite log entry at index X with the entry of size 0
  3. Request the entry from WAL

Raft allows to rewrite uncommitted log entries only. This is by-design and described in Raft paper. In the described scenario, WAL uses internal cache when read operation is called. However, the actual content is stored on the disk with different length.

How to fix: invalidate cache at the specific index when rewrite happened.

@freddyrios
Copy link
Author

All the previously shared scenarios are working now.

Additional testing shows these new scenarios:

  1. (low priority) node starting again seems to always cause an election, which makes me wonder if prevoting is happening or some other unintended consequence of the start of that node
  2. (critical) cutting power to 1 out of 3 nodes: the 2 alive nodes did not resume for 13+ minutes. Even then, it only recovered after the device was powered again, even though the raft node program was not started on the device (OS seems to explicitly reset the connection in this case)
  3. broken log when taking out power, regardless of WriteThrough setting. Even if some alternatives might appear, the ease to reproduce could hint at some less unavoidable issue to be at play, so its at least worth exploration.

@sakno
Copy link
Collaborator

sakno commented Jan 21, 2022

2nd fixed.

@sakno
Copy link
Collaborator

sakno commented Jan 22, 2022

3rd issue. I've investigated WAL dumps from ARM devices. Fs (I think it was ext4) was unable to restore some of the partition files. That's why the index of the last log entry stored in the node.state file larger than the partition file. There is no silver bullet to solve this issue. However, there are few practices that we can use:

  1. Persistent WAL supports live backups, see PersistentState.CreateBackupAsync and PersistentState.RestoreFromBackupAsync methods
  2. On initialization, if the WAL is broken, we can just call ClearAsync and join as the follower. The latest cluster state will be replicated as soon as possible.

In worst case, the app can implement incremental backups. But I think the option 2 is enough for our purposes.

@sakno
Copy link
Collaborator

sakno commented Jan 23, 2022

@freddyrios , you mentioned performance or read/write operations in WAL. Unfortunately, Linux historically did not have a good syscall for async disk I/O. It was many attempts to do that without any great success. However, in the last kernel version, io_uring syscall has been added. However, it is not supported by .NET: dotnet/runtime#51985
It means that all asynchronous calls transformed to synchronous calls scheduled via ThreadPool.

@freddyrios
Copy link
Author

@sakno closing this issue as done, as all the raised scenarios have been addressed. The only thing remaining is around node restarts usually triggering an election, but this is not a priority at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants