Leader election after node restart #683

Excpt0r · 2021-09-09T12:43:04Z

Your question

I use sofa-jraft for the sole purpose to have a leader election between instances of myapp.
One test is to shut down 1 of 3 myapp instances, and observe the other 2 instances elect a new leader - works.
After that when restarting the shutdown instance, I would expect it to be re-integrated in the raft cluster as a follower.
But I see the following behaviour

Starts FSMCaller successfully.
Do not set snapshot uri, ignore initSnapshotStorage.
LeaderElection/myapp-0.myapp-headless:8888> init, term=0, lastLogId=LogId [index=0, term=0], conf=myapp-0.myapp-headless:8888,myapp-1.myapp-headless:8888,myapp-2.myapp-headless:8888, oldConf=.
myapp-0.myapp-headless:8888> term 0 start preVote

Channel in TRANSIENT_FAILURE state: myapp-1.myapp-headless:8888.
Channel in SHUTDOWN state: myapp-1.myapp-headless:8888.
Peer myapp-1.myapp-headless:8888 is connected.

Channel in TRANSIENT_FAILURE state: myapp-2.myapp-headless:8888.
Channel in SHUTDOWN state: myapp-2.myapp-headless:8888.
Peer myapp-2.myapp-headless:8888 is connected.

received invalid PreVoteResponse from myapp-1.myapp-headless:8888, term 2, expect=0.
LocalRaftMetaStorage | INFO | Save raft meta, path=/tmp/meta, term=2, votedFor=0.0.0.0:0, cost time=12 ms
received invalid PreVoteResponse from myapp-2.myapp-headless:8888, term=0, currTerm=2

myapp-0.myapp-headless:8888> term 2 start preVote.
received PreVoteResponse from myapp-1.myapp-headless:8888, term=2, granted=false.
received PreVoteResponse from myapp-2.myapp-headless:8888, term=2, granted=false.

The last three lines will be repeatet over and over again.

Questions
1)
Is a persistent volume and the "snapshot" feature needed, that the restarted instance can be re-integrated in the raft cluster?
Currently I don't persist the jraft data and snapshots are disabled, myapp is even deleting any existing data in config-directory during startup (for local environments)
2)
What would be the expected behaviour, how the restarted instance is re-integrated and gets the current data from cluster?

Environment

SOFAJRaft version: 1.3.7
JVM version (e.g. java -version): openjdk version "11.0.11"

The text was updated successfully, but these errors were encountered:

fengjiachun · 2021-09-15T07:26:19Z

Can you give a detailed step to reproduce the problem?
My test results are as follows:

2021-09-15 15:09:08 [main] INFO  FSMCallerImpl:201 - Starts FSMCaller successfully.
2021-09-15 15:09:08 [main] WARN  NodeImpl:560 - Do not set snapshot uri, ignore initSnapshotStorage.
2021-09-15 15:09:08 [main] INFO  NodeImpl:1085 - Node <counter/127.0.0.1:8083> init, term=0, lastLogId=LogId [index=0, term=0], conf=127.0.0.1:8081,127.0.0.1:8082,127.0.0.1:8083, oldConf=.
2021-09-15 15:09:08 [main] INFO  RaftGroupService:136 - Start the RaftGroupService successfully.
Started counter server at port:8083
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] INFO  LocalRaftMetaStorage:132 - Save raft meta, path=/tmp/server3/raft_meta, term=6, votedFor=0.0.0.0:0, cost time=11 ms
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] WARN  NodeImpl:1953 - Node <counter/127.0.0.1:8083> reject term_unmatched AppendEntriesRequest from 127.0.0.1:8082, term=6, prevLogIndex=999, prevLogTerm=6, localPrevLogTerm=0, lastLogIndex=0, entriesSize=0.
2021-09-15 15:09:09 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:89 - onStartFollowing: LeaderChangeContext [leaderId=127.0.0.1:8082, term=6, status=Status[ENEWLEADER<10011>: Raft node receives message from new leader with higher term.]].
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] WARN  NodeImpl:1953 - Node <counter/127.0.0.1:8083> reject term_unmatched AppendEntriesRequest from 127.0.0.1:8082, term=6, prevLogIndex=999, prevLogTerm=6, localPrevLogTerm=0, lastLogIndex=0, entriesSize=0.
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] WARN  NodeImpl:1953 - Node <counter/127.0.0.1:8083> reject term_unmatched AppendEntriesRequest from 127.0.0.1:8082, term=6, prevLogIndex=999, prevLogTerm=6, localPrevLogTerm=0, lastLogIndex=0, entriesSize=0.
2021-09-15 15:09:10 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] INFO  Recyclers:52 - -Djraft.recyclers.maxCapacityPerThread: 4096.
2021-09-15 15:09:10 [JRaft-FSMCaller-Disruptor-0] INFO  CounterStateMachine:102 - Added value=0 by delta=106 at logIndex=3
2021-09-15 15:09:10 [JRaft-FSMCaller-Disruptor-0] INFO  CounterStateMachine:102 - Added value=106 by delta=5 at logIndex=4
2021-09-15 15:09:10 [JRaft-FSMCaller-Disruptor-0] INFO  CounterStateMachine:102 - Added value=111 by delta=15 at logIndex=5

killme2008 · 2021-09-16T03:33:30Z

You should keep jraft data in persistent storage,the data contains the commit logs and metadata, if they corrupts, the node will copied logs from leader and replay them to recover the state machine .The snapshot helps speeding up the recover procedure.

Excpt0r · 2021-09-17T10:44:13Z

Hi,

thanks for your replies!
Sorry, I forgot to mention that the problem is not often reproducible.

@killme2008 is it much data that needs to be replayed, when I only use leader election and no KV store?
Will deleting the old raft on startup only slow down the sync, or could that be the reason for the observed problem?

The reason I decided to delete the raft data on startup is, that I thought it might lead to issues,
when all cluster nodes are shutdown intentionally, and started again (with previous raft data).
I made the experience that it's not easy to get stateful deployments (i.e. etcd, zookeeper) back in correct state with previous data.
Just wanted to avoid such cases, and any special handling that might be needed.

killme2008 · 2021-09-17T11:58:38Z

Please attach the new leader's log. I think it would help us to find out the problem.

There is no much logs if you just use the jraft for election,but everytime you delete the data that may hurt election speed.

Excpt0r · 2021-09-28T12:51:43Z

Hi,

I have reproduced the problem and fetched the logs of all election nodes.
After initial successful election, device-sub-1 became leader - this is the node that was then killed at 2021-09-28T11:20:45
The node device-sub-2 became new leader.
The logs attached for node device-sub-1 is after killing, while restarting and unsuccessfully trying to reconnect to the cluster.

election-log.tar.gz

fengjiachun · 2021-09-29T04:17:39Z

How did you kill the process?
From the log information, it seems that device-sub-1 did not start rpcServer successfully, so it can not receive leader‘s messages, but can send messages to others using GRPC client (vote messages).

Because the time range contained in the log is limited, I do not know whether device-sub-1 finally returns to normal successfully.

Can you check the health of the rpcServer in this case? For example, whether the port is accessible or whether a client successfully connects to it， etc..

netstat is helpful if you are work on a linux.

Excpt0r · 2021-10-01T13:09:00Z

Hi @fengjiachun

all services run on kubernetes on virtual machines based on KVM.
My test is executing a "virsh destroy" (like unplugging the machine, not a clean shutdown) on the worker node that contains the current election leader.
I analysed the rpc server connectivity as you suggested, as the new leader is not able to connect to the old leader that restarted.
I can connect to the restarted node from leader nodes container via telnet, but the running election process cannot connect.
-> Reason: the election node (grpc) tries to connect to the old IP (killing and restarting the old leader leads to a new pod IP)

Actually that is a problem that I already had with grpc-java in a different service (see grpc/grpc-java#8574)
Hopefully there will be a suggestion on how to change my grpc config or a bugfix, that can be applied to jraft as well.

Another option would be to switch back from grpc to bolt.
Is the bugfix for bolt that you mentioned in this comment already implemented and released?
#599 (comment)
Beside the bugfix, would there be any advantage or disadvantage to use bolt vs grpc, in my scenario where only a leader election functionality is needed?

Excpt0r · 2021-10-07T13:47:59Z

In the linked issue for grpc-java the real cause seems to be how jetcd is using grpc (with an own dns resolver, not grpc).
So I gathered more logs from jraft/grpc to see how dns resolving is done.
I can see the log message "Resolved address" only at the application startup (see attachment), not after failed connections.

Having a look at jraft-extension/rpc-grpc-impl/src/main/java/com/alipay/sofa/jraft/rpc/impl/GrpcClient.java,
it seems that in getChannel() the grpc channel is constructed based on IP. So to detect a changed DNS entry, jraft would need to update the endpoints/IPs, right?

jraft-grpc.log

fengjiachun · 2021-10-08T03:45:06Z

I have an idea:
Delete the channel when it becomes TRANSIENT_FAILURE and reinitialize the connection the next time it is called.

What do you think? @Excpt0r @killme2008

fengjiachun · 2021-10-08T03:48:00Z

Is the bugfix for bolt that you mentioned in this comment already implemented and released?

The bug fixed at 1.6.4

killme2008 · 2021-10-08T03:50:27Z

I have an idea: Delete the channel when it becomes TRANSIENT_FAILURE and reinitialize the connection the next time it is called.

What do you think? @Excpt0r @killme2008

I think it will work. But jvm caches DNS name lookups, we must recommend user to set it to a smaller value by networkaddress.cache.ttl.

fengjiachun · 2021-10-09T11:24:19Z

I have an idea: Delete the channel when it becomes TRANSIENT_FAILURE and reinitialize the connection the next time it is called.

What do you think? @Excpt0r @killme2008

channel#resetConnectBackoff is equivalent to reinitializing a connection #690

Excpt0r · 2022-02-22T14:15:29Z

Hi @fengjiachun

I think I found another occurrence of the described issue, even though it seemed to be fixed.
To rule out a problem on application/env level, I used jraft with bolt rpc instead grpc, and the problem does not appear.

In the attached logs, I did the following:

Kill the node with pod x-device-sub-1 (current raft leader)
Election works and x-device-sub-2 is voted new leader
Start the node again, pod x-device-sub-1 is restarted (new IP 10.130.0.9)
At 15:57:13h x-device-sub-1 is connected to other peers device-sub-2 and device-sub-0
At the same time, we see at device-sub-2 that the old IP is referenced for device-sub-0 heartbearts (10.130.0.21)
At 15:58:13h (one minute later), device-sub-2 still shows this IP in the logs.
We also see at device-sub-0:
ignore PreVoteRequest from van-device-sub-1.van-device-sub-headless:8888, term=1, currTerm=3, because the leader van-device-sub-2.van-device-sub-headless:8888's lease is still valid.

Maybe I will keep bolt rpc in favor of grpc, but it's probably good to track the problem anyway.

Runtime is OpenJdk 11.0.11
jraft v1.3.9
java.security:
networkaddress.cache.ttl=10
networkaddress.cache.negative.ttl=10

Cheers and thanks.

logs.zip

fengjiachun · 2022-03-15T10:18:32Z

I haven't found the exact reason, but can you try to make sure this IP is not re-enabled (10.130.0.21), because from the logs, this IP still exists and the kernel responds to the RST. gRPC doesn't change the channel state to TRANSIENT_FAILURE when it received a RST (which I'm also rather confused about).

| 2022-02-21T15:56:49,620 | grpc-default-worker-ELG-3-1 | io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler | DEBUG | [id: 0x174b8491, L:/10.129.2.34:46478 - R:x-device-sub-1.x-device-sub-headless/10.130.0.21:8888] OUTBOUND RST_STREAM: streamId=691 errorCode=8

Excpt0r · 2022-03-28T15:04:41Z

Hi,

the application runs on kubernetes, which means it has virtual networking on container level + networking on VM level.
I don't think I can influence the behaviour of that stack in regards to handling of previous pod IPs.

I'm wondering where the "OUTBOUND RST_STREAM" comes from, as reaction to what?
We see a prevote request/response between the restarted leader device-sub-1 and the follower device-sub-0.
But I don't see that between device-sub-1 and device-sub-2, which sends the OUTBOUND RST_STREAM.

I'm not sure if the IP still exists and device-sub-2 received a (TCP) RST. This RST message looks more like something that grpc is doing, and the direction is outgoing, where is the kernel involved?

fengjiachun mentioned this issue Oct 9, 2021

fix: grpc conn refresh #690

Merged

killme2008 closed this as completed in #690 Nov 23, 2021

Excpt0r mentioned this issue Mar 15, 2022

Leader election after node restart #793

Closed

fengjiachun reopened this Mar 15, 2022

luckyxiaoqiang mentioned this issue Jul 17, 2023

不同 Raft 集群非预期通信导致 Raft 状态异常 #1012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader election after node restart #683

Leader election after node restart #683

Excpt0r commented Sep 9, 2021 •

edited

Loading

fengjiachun commented Sep 15, 2021

killme2008 commented Sep 16, 2021

Excpt0r commented Sep 17, 2021

killme2008 commented Sep 17, 2021

Excpt0r commented Sep 28, 2021

fengjiachun commented Sep 29, 2021

Excpt0r commented Oct 1, 2021

Excpt0r commented Oct 7, 2021

fengjiachun commented Oct 8, 2021

fengjiachun commented Oct 8, 2021

killme2008 commented Oct 8, 2021

fengjiachun commented Oct 9, 2021

Excpt0r commented Feb 22, 2022

fengjiachun commented Mar 15, 2022

Excpt0r commented Mar 28, 2022

Leader election after node restart #683

Leader election after node restart #683

Comments

Excpt0r commented Sep 9, 2021 • edited Loading

Your question

Environment

fengjiachun commented Sep 15, 2021

killme2008 commented Sep 16, 2021

Excpt0r commented Sep 17, 2021

killme2008 commented Sep 17, 2021

Excpt0r commented Sep 28, 2021

fengjiachun commented Sep 29, 2021

Excpt0r commented Oct 1, 2021

Excpt0r commented Oct 7, 2021

fengjiachun commented Oct 8, 2021

fengjiachun commented Oct 8, 2021

killme2008 commented Oct 8, 2021

fengjiachun commented Oct 9, 2021

Excpt0r commented Feb 22, 2022

fengjiachun commented Mar 15, 2022

Excpt0r commented Mar 28, 2022

Excpt0r commented Sep 9, 2021 •

edited

Loading