Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单例模式启动, 服务启动报jarft错误“No leader for raft group naming_persistent_service” #12504

Closed
damonj opened this issue Aug 16, 2024 · 11 comments

Comments

@damonj
Copy link

damonj commented Aug 16, 2024

从2.4.0.1升级到2.4.1,服务启动时报错误

Caused by: com.alibaba.nacos.api.exception.NacosException: failed to req API:/api//nacos/v1/ns/instance after all servers([...:6001]) tried: server is DOWNnow, detailed error message: Optional[No leader for raft group naming_persistent_service, please see logs alipay-jraft.log or naming-raft.log to see details.]
at com.alibaba.nacos.client.naming.net.NamingProxy.reqAPI(NamingProxy.java:496)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqAPI(NamingProxy.java:401)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqAPI(NamingProxy.java:397)
at com.alibaba.nacos.client.naming.net.NamingProxy.registerService(NamingProxy.java:212)
at com.alibaba.nacos.client.naming.NacosNamingService.registerInstance(NacosNamingService.java:207)
at com.alibaba.cloud.nacos.registry.NacosServiceRegistry.register(NacosServiceRegistry.java:64)

@KomachiSion
Copy link
Collaborator

看下对应日志,为什么没选出leader

@KomachiSion
Copy link
Collaborator

是不是本机ip变更了,之前的ip因为raft的元数据持久化导致无法访问而无法选主。

@karsonto
Copy link
Contributor

可以删除user.home 下面nacos data文件夹再启动试试。

@damonj
Copy link
Author

damonj commented Aug 22, 2024

看下对应日志,为什么没选出leader
2024-08-16 21:54:56,140 INFO Initializes the Raft protocol, raft-config info : {"data":{},"members":["10.1.1.2:5004"],"selfMember":"10.1.1.2:5004"}

2024-08-16 21:54:57,459 INFO ========= The raft protocol is starting... =========

2024-08-16 21:54:58,832 INFO ========= The raft protocol start finished... =========

2024-08-16 21:55:03,130 INFO create raft group : naming_persistent_service

2024-08-16 21:55:04,452 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:04,567 INFO create raft group : naming_persistent_service_v2

2024-08-16 21:55:04,891 INFO create raft group : naming_instance_metadata

2024-08-16 21:55:04,928 INFO This Raft event changes : RaftEvent{groupId='naming_persistent_service_v2', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:05,262 INFO This Raft event changes : RaftEvent{groupId='naming_instance_metadata', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:05,267 INFO create raft group : naming_service_metadata

2024-08-16 21:55:05,552 INFO This Raft event changes : RaftEvent{groupId='naming_service_metadata', leader='10.1.1.2:5004', term=1, raftClusterInfo=[10.1.1.2:5004]}

2024-08-16 21:55:05,780 ERROR Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service
at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:605)
at com.alipay.sofa.jraft.core.CliServiceImpl.getPeers(CliServiceImpl.java:498)
at com.alibaba.nacos.core.distributed.raft.JRaftServer.registerSelfToCluster(JRaftServer.java:353)
at com.alibaba.nacos.core.distributed.raft.JRaftServer.lambda$createMultiRaftGroup$0(JRaftServer.java:264)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2024-08-16 21:57:59,048 INFO shutdown jraft server

@KomachiSion
Copy link
Collaborator

Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service

元数据里有一个10.1.1.2:5004, 你看下本机ip应该不是这个。

@damonj
Copy link
Author

damonj commented Aug 23, 2024

Failed to join the cluster, retry...

java.lang.IllegalStateException: Fail to get leader of group naming_persistent_service

元数据里有一个10.1.1.2:5004, 你看下本机ip应该不是这个。

ip没问题,回退到2.4.0.1就不报错了;但是也有环境里是升级到2.4.1是成功的。

@KomachiSion
Copy link
Collaborator

日志其实很明显,启动的时候,发现元数据里有ip:10.1.1.2:5004作为leader, 于是尝试加入自身到集群中,加入集群的操作需要通过leader写入到元数据中, 但是此时加入集群失败,原因是没有找到leader,这个加载的元数据内容矛盾。

可以判断当时肯定是无法连接上10.1.1.2:5004以获取最新的group信息和元数据, 也没有这个ip的leader来进行心跳续约,所以最终没有找到leader,没有加入集群成功。

可以按照@karsonto的方法,移除本地data目录后重试,同时再看一下alipay-jraft日志,有可能你会发现日志显示的新的leaderip或端口和之前这个不同

@akinlau
Copy link

akinlau commented Aug 28, 2024

我也是从2.3.2升级到2.4.1,alipay-jraft.log日志一直报错:
2024-08-28 12:38:44,979 INFO Node <naming_persistent_service_v2/192.168.1.2:8895> term 0 start preVote.

2024-08-28 12:38:44,980 WARN Node <naming_persistent_service_v2/192.168.1.2:8895> PreVote to 192.168.1.3:8895 error: Status[ENOENT<1012>: Peer id not found: 192.168.1.3:8895, group: naming_persistent_service_v2].

2024-08-28 12:38:44,980 WARN Node <naming_persistent_service_v2/192.168.1.2:8895> PreVote to 192.168.1.4:8895 error: Status[ENOENT<1012>: Peer id not found: 192.168.1.4:8895, group: naming_persistent_service_v2].

尝试把data目录下的文件全部删重启也一样,使用api查看状态,提示server down了
curl -X GET 'http://192.168.1.2:9895/nacos/v1/ns/raft/state'
server is DOWNnow, detailed error message: Optional[No leader for raft group naming_persistent_service, please see logs alipay-jraft.log or naming-raft.log to see details.]

回退到2.3.2就正常

@wsldl123292
Copy link

ip没问题,回退到2.4.0.1就不报错了;但是也有环境里是升级到2.4.1是成功的。我也是这个效果,我升级了3个地方的nacos到2.4.1,成功了2个,一个也是这个错误,退回去2.4.0就好了

@KomachiSion
Copy link
Collaborator

#12573 优化了一下对Server Status的校验逻辑, 再不使用到raft相关的接口上不再直接拦截请求,以保证核心功能的可用性。

但如果一直保持jraft无法选主的情况下, 对应依赖raft的功能仍然会有问题无法使用, 需要介入修复raft选主问题.

@KomachiSion KomachiSion closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2024
@ltjfk
Copy link

ltjfk commented Nov 19, 2024

原因分析:
该错误是由于 Nacos 采用的 Raft 算法导致的。Raft 算法用于选举 Leader 并记录上次启动的集群地址。如果服务器的 IP 地址发生变化,Raft 记录的集群地址将失效,导致无法正确选举出 Leader。

解决方案:
删除 Nacos 根目录下 data 文件夹中的 protocol 文件夹,以清除失效的集群地址记录。
————————————————

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

原文链接:https://blog.csdn.net/m0_47256162/article/details/142651875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants