Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nacos 2.1.0 集群,Dubbo注册实例数超过41K后,经常出现个别节点上实例数不一致,疑似堆外内存泄露 #12792

Open
goto456 opened this issue Oct 29, 2024 · 5 comments

Comments

@goto456
Copy link

goto456 commented Oct 29, 2024

环境描述

Nacos 版本 2.1.0
11节点集群,各个节点配置是:16核32G
注册实例数:41K(Dubbo注册实例)
相关内存分配:-Xms12g -Xmx12g -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=320m -XX:MaxDirectMemorySize=12g

问题描述

集群出现个别节点实例数不一致(比其他节点少20K~30K),重启问题节点后恢复正常,但是运行几天之后又会出现实例数不一致问题。如下图所示:
image

查看问题节点上的protocol-distro.log日志,出现如下异常:

2024-10-17 05:06:55,492 WARN [DISTRO] Sync data change failed.
io.grpc.StatusRuntimeException: UNKNOWN
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:497)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:521)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:641)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:529)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:703)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:692)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 12868124951, max: 12884901888)
        at io.grpc.netty.shaded.io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:725)
        at io.grpc.netty.shaded.io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:680)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:758)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:734)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:245)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.allocate(PoolArena.java:227)
        at io.grpc.netty.shaded.io.netty.buffer.PoolArena.reallocate(PoolArena.java:397)
        at io.grpc.netty.shaded.io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:119)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:310)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:281)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1118)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1111)
        at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1102)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:96)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:281)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:483)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 common frames omitted

从机器内存监控上看,机器的可用内存一直在下降:
image

尝试解决的过程:

  1. 按照nacos官方文档的建议配置(https://nacos.io/blog/faq/nacos-user-question-history12682/) ,将JVM堆内存(-Xmx)配置为22G,MaxDirectMemorySize配置为7G之后,还是会出现上述问题,且日志中还是会出现上述异常。

  2. 搜到类似的issue(io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006633239, max: 1020067840) #10352, io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 501219607, max: 503316480) #10854 ) ,修改调整回收堆外内存频率-Dio.grpc.netty.shaded.io.netty.allocation.cacheTrimIntervalMillis=60000(由默认的2min改为1min),之后依旧会出现上述问题,且日志中还是会出现上述异常。

其他异常日志

有问题的节点的nacos.log中出现很多如下日志

WARN Exception in onHeadersRead()
java.util.concurrent.RejectedExecutionException: Task io.grpc.internal.SerializingExecutor@4a6ee6d0 rejected from java.util.concurrent.ThreadPoolExecutor@1ebb611[Running, pool size = 256, active threads = 0, queued tasks = 16369, completed tasks = 13013149647]
        at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at io.grpc.internal.SerializingExecutor.schedule(SerializingExecutor.java:93)
        at io.grpc.internal.SerializingExecutor.execute(SerializingExecutor.java:86)
        at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.streamCreatedInternal(ServerImpl.java:588)
        at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.streamCreated(ServerImpl.java:476)
        at io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler.onHeadersRead(NettyServerHandler.java:450)
        at io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler.access$900(NettyServerHandler.java:101)
        at io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler$FrameListener.onHeadersRead(NettyServerHandler.java:807)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onHeadersRead(DefaultHttp2ConnectionDecoder.java:373)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onHeadersRead(Http2InboundFrameLogger.java:65)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader$1.processFragment(DefaultHttp2FrameReader.java:457)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readHeadersFrame(DefaultHttp2FrameReader.java:464)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:254)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:160)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:174)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378)
        at io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
        at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:483)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
@KomachiSion
Copy link
Collaborator

可以升级版本后重试一下, 新版本之前对数据推送做过优化, 发现推送有问题的连接,不会一直发起推送,会先断开连接释放资源,重连后再继续推送。

@kabiyangyang
Copy link

可以仔细讲一讲这个的具体原因吗 我也碰到一模一样的问题了 @KomachiSion

@woshioosm
Copy link

@KomachiSion "可以升级版本后重试一下, 新版本之前对数据推送做过优化, 发现推送有问题的连接,不会一直发起推送,会先断开连接释放资源,重连后再继续推送。" 同样碰到 有具体解决的issue 或者相关 commit 可以发一下么?

@KomachiSion
Copy link
Collaborator

内容比较复杂, 主要就是grpc 对 堆外内存的利用方式分为全局缓存和线程缓存, 如果某个客户端一直不回包(tcp包),nio的可使用窗口会一直为0, 导致grpc发送的数据一直缓存在堆外内存中,等待tcp可用,一直会到tcp可用,把请求正常发出去,或tcp连接断开,才会释放对应堆外内存;

当出现一个连接出现故障,比如假死,网络故障,导致grpc前面的请求发出去但一直无法收到tcp的ack时,后续的请求会全部缓存在tcp和堆外内存队列中, 导致这个链接对应的线程缓存占满, 然后开始申请全局缓存。

对应的现象就是堆外内存一直涨,直到所有的堆外内存都被用完且无法释放, 新的请求出现DirectOOM。

在2.3还是2.2的最后几个版本,Nacos在GrpcClient里做了一点优化, 如果这个链接一直处于tcp窗口为0的情况,主动断开链接,重新链接后再进行数据推送。

具体哪个PR已经不记得了, 只知道最新版本应该解决过这个问题。可以试一下。

@KomachiSion
Copy link
Collaborator

上面这个问题很有可能是因为dubbo订阅了非常多provider, 然后consumer的客户端出现GC之类的问题,导致nacos推送的数据无法处理,不回执tcp ack, 导致上面描述的问题出现, 进而影响到其他的链接和数据同步。

可以试着分析一下naming-push.log. 看下是不是某几个ip consumer,一直处于推送失败(推送超时)的状态。 单独重启这一些consumer,也可以释放这部分内存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants