-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader election after node restart #683
Comments
Can you give a detailed step to reproduce the problem?
|
You should keep jraft data in persistent storage,the data contains the commit logs and metadata, if they corrupts, the node will copied logs from leader and replay them to recover the state machine .The snapshot helps speeding up the recover procedure. |
Hi, thanks for your replies! @killme2008 is it much data that needs to be replayed, when I only use leader election and no KV store? The reason I decided to delete the raft data on startup is, that I thought it might lead to issues, |
Please attach the new leader's log. I think it would help us to find out the problem. There is no much logs if you just use the jraft for election,but everytime you delete the data that may hurt election speed. |
Hi, I have reproduced the problem and fetched the logs of all election nodes. |
How did you kill the process? Because the time range contained in the log is limited, I do not know whether device-sub-1 finally returns to normal successfully. Can you check the health of the rpcServer in this case? For example, whether the port is accessible or whether a client successfully connects to it, etc.. netstat is helpful if you are work on a linux. |
Hi @fengjiachun all services run on kubernetes on virtual machines based on KVM. Actually that is a problem that I already had with grpc-java in a different service (see grpc/grpc-java#8574) Another option would be to switch back from grpc to bolt. |
In the linked issue for grpc-java the real cause seems to be how jetcd is using grpc (with an own dns resolver, not grpc). Having a look at jraft-extension/rpc-grpc-impl/src/main/java/com/alipay/sofa/jraft/rpc/impl/GrpcClient.java, |
I have an idea: What do you think? @Excpt0r @killme2008 |
The bug fixed at 1.6.4 |
I think it will work. But jvm caches DNS name lookups, we must recommend user to set it to a smaller value by |
|
Hi @fengjiachun I think I found another occurrence of the described issue, even though it seemed to be fixed. In the attached logs, I did the following:
Maybe I will keep bolt rpc in favor of grpc, but it's probably good to track the problem anyway. Runtime is OpenJdk 11.0.11 Cheers and thanks. |
I haven't found the exact reason, but can you try to make sure this IP is not re-enabled (10.130.0.21), because from the logs, this IP still exists and the kernel responds to the RST. gRPC doesn't change the channel state to
|
Hi, the application runs on kubernetes, which means it has virtual networking on container level + networking on VM level. I'm wondering where the "OUTBOUND RST_STREAM" comes from, as reaction to what? I'm not sure if the IP still exists and device-sub-2 received a (TCP) RST. This RST message looks more like something that grpc is doing, and the direction is outgoing, where is the kernel involved? |
Your question
I use sofa-jraft for the sole purpose to have a leader election between instances of myapp.
One test is to shut down 1 of 3 myapp instances, and observe the other 2 instances elect a new leader - works.
After that when restarting the shutdown instance, I would expect it to be re-integrated in the raft cluster as a follower.
But I see the following behaviour
The last three lines will be repeatet over and over again.
Questions
1)
Is a persistent volume and the "snapshot" feature needed, that the restarted instance can be re-integrated in the raft cluster?
Currently I don't persist the jraft data and snapshots are disabled, myapp is even deleting any existing data in config-directory during startup (for local environments)
2)
What would be the expected behaviour, how the restarted instance is re-integrated and gets the current data from cluster?
Environment
java -version
): openjdk version "11.0.11"The text was updated successfully, but these errors were encountered: