panic in `src/membership/membership.rs:309:51` #1242

sebadob · 2024-09-05T09:19:39Z

Describe the bug
I was able to get a panic using openraft when I would not expect it. I modified my application in a way that new nodes will auto-join existing clusters, if they are configured to do that. When I do a rolling release on Kubernetes and the pods restart fast enough, I always get into the situation where openraft panics, but it is pretty specific and not easily reproducible.

The code panics here:

openraft-0.9.15/src/membership/membership.rs:309:51:
called `Option::unwrap()` on a `None` value

To Reproduce
Steps to reproduce the behavior:
It is pretty hard to reproduce, but probably easy to fix, since it's just an .unwrap() on an Option<_>. I don't think that I can give an easy and quick guide to reproduce it here without quite a big of setup in advance. However, I can try to do this if actually needed. Probably just removing the unwrap() will solve it.

Expected behavior
To not panic.

Actual behavior

2024-09-05T09:06:48.865227Z  INFO change_membership: openraft::raft::impl_raft_blocking_write: change_membership: start to commit joint config changes=ReplaceAllVoters({3}) retain=true
2024-09-05T09:06:48.865299Z  INFO openraft::core::raft_core: received RaftMsg::ChangeMembership: openraft::core::raft_core::RaftCore<_, _, _, _>::handle_api_msg members=ReplaceAllVoters({3}) retain=true
thread 'tokio-runtime-worker' panicked at /home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/openraft-0.9.15/src/membership/membership.rs:309:51:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Env (please complete the following information):

openraft v0.19.5 with serde and storage-v2 features enabled
I don't know if it exists on main since I can't easily test because of many breaking changes.

Additional Information

What my application is doing in that szenario is, that I had a cluster with in-memory cache, that was running fine. I then trigger a rolling release on Kubernetes by specifying a new container image tag for instance. The nodes start to shut down and restart one by one as usual. My application code then does a check during startup if the node is a raft cluster member and if not, it tries to connect to the other nodes from the given config to find out the leader and then tries to join the cluster using the API on the leader node. Since in this situation, it is an in-memory cache, the node is un-initialized with each restart of course, because it can't track its state. It then tries to join the cluster on the leader, which I guess still thinks that this node is still a cluster member, while the node itself did a restart and lost its state.

I would assume that a new "join as learner" would simply change the config in a way that the node will become a learner again without any panics or issues. I guess I could do a workaround in this case and check on the leader first, if the joining node currently exists in the config and remove it first, but I have not tried this yet. In the end, it is just an unwrap on some option inside the membership.rs:309.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-05T09:19:53Z

👋 Thanks for opening this issue!

Get help or engage by:

/help : to print help messages.
/assignme : to assign this issue to you.

drmingdrmer · 2024-09-05T12:21:54Z

I'm not very sure, but the panic you're encountering is likely due to the node losing its state upon startup. When this happens, RaftStateMachine::applied_state() returns an empty configuration, which leads to the panic at:

https://github.com/datafuselabs/openraft/blob/5f5d7e9f60ff2e86af078d2c20401636de2c19f7/openraft/src/storage/helper.rs#L194

This situation arises because Openraft currently doesn't prevent Raft::change_membership() from being called when the node is in an uninitialized state. There's no API to directly set a membership value that would cause self.get_joint_config().last() to return None.

To address this issue, I am going to implement a safeguard that prevents Raft::change_membership() from being called on an uninitialized node. This should help avoid the panic and ensure more robust behavior during node startup and state recovery.

sebadob · 2024-09-05T13:26:28Z

I had another look at it and it happens in the following order:

k8s starts node 1, 2, 3
node 1 ends up as leader
rolling released triggered
- node 3 is shut down and restarted
- node 2 is shut down and restarted
- node 1 is shut down and restarted
- node 2 + 3 crash (only once) with the panic
- node 1 is fine
- k8s restarts the crashed node 2 + 3 and then they end up being healthy without any panics

So it's probably a race condition on my side as well, because the restarts happen quite fast, maybe even within the heartbeat interval, I am not sure.

drmingdrmer · 2024-09-05T15:39:40Z

I added a check to change_membership() when the node is not initialized:

Fix: Prevent panic when calling Raft::change_membership() on uninitialized node #1243

It should fix the panic, but calling Raft::change_membership() on a non-initialized node still gets an error. : ()

sebadob · 2024-09-05T16:07:44Z

Thanks! An error is not an issue, it will simply retry on error after a short timeout.

This situation is pretty hard to debug, because I can't reproduce it manually and it does not happen consistently, but I guess it includes a race condition somehow as already mentioned.

Edit:

I will test again by referring to the git branch tomorrow, but it should be fine, so I guess we can close the issue.

drmingdrmer self-assigned this Sep 5, 2024

drmingdrmer closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic in `src/membership/membership.rs:309:51` #1242

panic in `src/membership/membership.rs:309:51` #1242

sebadob commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

drmingdrmer commented Sep 5, 2024

sebadob commented Sep 5, 2024 •

edited

Loading

drmingdrmer commented Sep 5, 2024

sebadob commented Sep 5, 2024 •

edited

Loading

panic in src/membership/membership.rs:309:51 #1242

panic in src/membership/membership.rs:309:51 #1242

Comments

sebadob commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

drmingdrmer commented Sep 5, 2024

sebadob commented Sep 5, 2024 • edited Loading

drmingdrmer commented Sep 5, 2024

sebadob commented Sep 5, 2024 • edited Loading

panic in `src/membership/membership.rs:309:51` #1242

panic in `src/membership/membership.rs:309:51` #1242

sebadob commented Sep 5, 2024 •

edited

Loading

sebadob commented Sep 5, 2024 •

edited

Loading