-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: improve recovery after a crash of the majority of servers #1
Comments
That may be an issue with filesystem setup. For instance, when SSD/NVMe is in place, Linux allows you to configure buffer caches for disk I/O. It means that even if you write some bytes to FileStream/SafeFileHandle they can be not yet written to the disk. Persistent WAL has |
@freddyrios , look at dotnet#94 (comment). Looks like the more accurate calculation of heartbeat timeout on the leader node (now it includes subtraction of replication duration) leads to more stable cluster. Anyway, this is a step forward to our goal. |
Here is a new branch where I'm working on new protocol on top of TCP: https://github.com/dotnet/dotNext/tree/feature/new-raft-tcp |
The deployment had default rasbian, so it makes sense there could be
something like that at play for that log issue.
…On Tue, Jan 11, 2022, 18:44 SRV ***@***.***> wrote:
but it was something an index being out of range or not found in the log.
That may be an issue with filesystem setup. For instance, when SSD/NVMe is
in place, Linux allows you to configure buffer caches for disk I/O. It
means that even if you write some bytes to FileStream/SafeFileHandle they
can be not yet written to the disk. Persistent WAL has WriteThrough flag
to skip any intermediate OS-level buffers when writing to the disk However,
it's always up to OS I/O layer.
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ42ZDIKG3QFDMM3HSOHIU3UVRUAJANCNFSM5LTMQR6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I am on a phone/train, so had trouble fully following.
However, what I understood is that the response time trying to send
heartbeats/entries to one follower prevented the leader from sending
heartbeats/entries at the expected rate to a different follower.
I could see addressing that helping the stability for some of the scenarios
we have hit.
One thing I noticed when playing with the raft simulator, is that
heartbeats do not always align across different nodes.so depending on what
was going on, it looked a lot like the leader could keep the heartbeats
frequency independently for each follower. I suspect that decoupling is an
alternate way one deals with it. However, it does make sense that it would
keep trying at the right frequency, even if one of the appends happened to
be a bit heavy due yo the entries being sent.
For healthy followers the time to append entries is supposed to be an order
of magnitude less than the election timeouts. If these are slow enough to
matter for healthy nodes, then it could make some of the raft paper
assumptions problematic I guess.
Another independent but related stability area I have thought about is:
what happens when appending entries to a followers takes long (due to the
size and amount of entries for example), can this result in the follower
becoming a candidate even though the leader is actively communicating to
it? (not sure what the right behavior of such case, as the problem
appending those entries could be in either side).
…On Tue, Jan 11, 2022, 19:16 SRV ***@***.***> wrote:
@freddyrios <https://github.com/freddyrios> , look at dotnet#94 (comment)
<dotnet#94 (comment)>.
Looks like the more accurate calculation of heartbeat timeout on the leader
node (now it includes subtraction of replication duration) leads to more
stable cluster. Anyway, this is a step forward to our goal.
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ42ZDJX6VJKP3EAQI2NNLTUVRXWDANCNFSM5LTMQR6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, it is. We can mitigate this by stopping the timer counting on the follower node during execution of |
@freddyrios , done: dotnet@a6ab902 |
@sakno I have created a new branch with the same example but on top of the latest develop changes https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-example-rebased There are 2 more branches with some exploration I did on stability (already rebased on the latest develop changes), specially given the fix you shard related to hearbeats. https://github.com/copenhagenatomics/dotNext/tree/feature/bigstruct-slowspeed It seems:
What I find particularly odd is that it can push entries at such a high rate in the first configuration, yet run into timeout trouble recovering a node in the second configuration. Makes me wonder if more than a timeout is something getting stuck (but have no evidence of this). Not sure what could explain 2 out of 3 being much more reliable in configuration 3 vs. 1. Even though configuration 1 is more of a stress test, I can't see why that would be special when taking out 2 nodes vs. 1 node in comparison to the configuration 3. |
@freddyrios , I'm still working on new TCP transport. I expect to finish it this holidays. After that, I'll start testing and stabilizing the release using use cases you mentioned previously. All patches and new transport will be released as |
Alright, TCP transport is re-implemented now. Of course, I'll do the necessary tests using your examples. Now I think it's reasonable to implement recovery procedure for persistent Write-Ahead Log. Here is my plan:
P.S.: Regarding our protection from bit flips: I think SHA is very heavyweight for our purposes. It's crypto-strong hash code typically not considered for data consistency check. FNV1a or CRC is much better and more performant. |
Verification of |
My investigation shows that storing checksum for
Anyway, fault-tolerant WAL is another field of theoretical and practical research. I didn't have a goal to write WAL for such cases. The current implementation just works, at least under normal circumstances. However, we can have workaround to simulate simple checkpoint. Before every append operation, we can save WriteInProgress indicator to the external file. The indicator can have a size of 1 byte. When operation is finished, just reverse this indicator. In case of accidental shutdown, the file will keep WriteInProgress indicator. It's enough to decide that the entire WAL probably broken and skip all the data in it to have a fresh setup. All these things out of scope at the moment, so I'll focus on stability of the cluster itself. |
Found the root cause of this issue even in case of graceful shutdown of the node. The problem is in snapshot when it is installed for the index that is greater than the first index in the partition. In that case, the partition becomes completely invalid because the binary offset calculated incorrectly within partition file. |
|
Nice catch. One I hit on our side in the past with ARM was trying to use https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.cast?view=net-6.0. |
Fortunately, there is a small amount of such code. It is used on the hot path of program execution where the serialization/deserialization of log entry metadata required. |
Test branches have been rebased to latest (if updating local copies, might want to delete them locally and pull them fresh):
The first example uses only default settings, except for BufferSize that is now 4096*2, which is a bit larger than one of the big entries (8000). Some short testing showed better stability, including in the scenario where 2 nodes crash. The second example adds a 1 second delay after writting 16 entries. Note this is equivalent to the old slow + frequent snapshots, as the 50 records per partition was not changed (based on input on that being bad on its own). Short testing ran into 2 of the below crashes during re-elections. Restarting the crashed node recovered succesfully.
|
I caught another exception:
I'll fix that shortly. |
The previous issue has been fixed. |
Found a new one:
I'm working on it. |
The root cause was incorrect skip of the last log entry when it is not consumed by the follower. Fixed and pushed to |
Found the root cause of failed assertion. Here is the steps to reproduce:
Raft allows to rewrite uncommitted log entries only. This is by-design and described in Raft paper. In the described scenario, WAL uses internal cache when read operation is called. However, the actual content is stored on the disk with different length. How to fix: invalidate cache at the specific index when rewrite happened. |
All the previously shared scenarios are working now. Additional testing shows these new scenarios:
|
2nd fixed. |
3rd issue. I've investigated WAL dumps from ARM devices. Fs (I think it was ext4) was unable to restore some of the partition files. That's why the index of the last log entry stored in the
In worst case, the app can implement incremental backups. But I think the option 2 is enough for our purposes. |
@freddyrios , you mentioned performance or read/write operations in WAL. Unfortunately, Linux historically did not have a good syscall for async disk I/O. It was many attempts to do that without any great success. However, in the last kernel version, io_uring syscall has been added. However, it is not supported by .NET: dotnet/runtime#51985 |
@sakno closing this issue as done, as all the raised scenarios have been addressed. The only thing remaining is around node restarts usually triggering an election, but this is not a priority at the moment. |
This relates to the scenario mentioned at dotnet#89 (comment).
Scenario: running the modified example on 3 nodes, kill 2 nodes (ctrl - c) and then restart one of them.
Expected: consistently resumes normal operation when at least 2 out of 3 nodes are running again
Actual: cluster often does not recover consistenly
Notes:
The text was updated successfully, but these errors were encountered: