[Agent][Bug] DEADLINE_EXCEEDED errors cause high CPU usage (busy-spin in MpscUnboundedArrayQueue.poll) #13011

hzhaop · 2025-01-24T03:43:27Z

hzhaop
Jan 24, 2025

Description:
We encountered a situation where the SkyWalking Java agent logged repeated DEADLINE_EXCEEDED errors while trying to call ServiceManagementClient. Shortly after those errors appeared, multiple threads began consuming 100% of the CPU indefinitely. Thread dumps show these busy threads are stuck in org.apache.skywalking.apm.dependencies.io.netty.util.internal.shaded.org.jctools.queues.BaseMpscLinkedArrayQueue.poll (i.e., MpscUnboundedArrayQueue.poll) within the Netty event loop.

Below are the relevant log messages and a snippet of one such thread stack trace:

SkywalkingAgent-7-ServiceManagementClient-0 ServiceManagementClient : ServiceManagementClient execute fail. 
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 29.999930989s. [closed=[], open=[[remote_addr=/xxxxx:xxx]]]
	at org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
	at org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
	at org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
	at org.apache.skywalking.apm.network.management.v3.ManagementServiceGrpc$ManagementServiceBlockingStub.keepAlive(ManagementServiceGrpc.java:253)
	at org.apache.skywalking.apm.agent.core.remote.ServiceManagementClient.run(ServiceManagementClient.java:121)
	at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:33)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

WARN 2025-01-23 23:06:12.862 DataCarrier.DEFAULT.Consumer.0.Thread GRPCStreamServiceStatus : Collector traceSegment service doesn't response in 70 seconds. 
ERROR 2025-01-23 23:06:23.940 SkywalkingAgent-7-ServiceManagementClient-0 ServiceManagementClient : ServiceManagementClient execute fail.

And one of the 100% CPU threads shows the following stack:

"grpc-nio-worker-ELG-1-2" #29 daemon prio=5 os_prio=0 tid=0x0000fffadc710800 nid=0x3f0dd5 runnable [0x0000fffabbffd000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.skywalking.apm.dependencies.io.netty.util.internal.shaded.org.jctools.queues.BaseMpscLinkedArrayQueue.poll(BaseMpscLinkedArrayQueue.java:340)
	at org.apache.skywalking.apm.dependencies.io.netty.util.internal.shaded.org.jctools.queues.MpscUnboundedArrayQueue.poll(MpscUnboundedArrayQueue.java:23)
	at org.apache.skywalking.apm.dependencies.io.netty.util.concurrent.SingleThreadEventExecutor.pollTaskFrom(SingleThreadEventExecutor.java:216)
	at org.apache.skywalking.apm.dependencies.io.netty.util.concurrent.SingleThreadEventExecutor.pollTask(SingleThreadEventExecutor.java:211)
	at org.apache.skywalking.apm.dependencies.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:483)
	at org.apache.skywalking.apm.dependencies.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at org.apache.skywalking.apm.dependencies.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.skywalking.apm.dependencies.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.skywalking.apm.dependencies.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

These threads appear to be in a busy-spin loop (RUNNABLE state) under BaseMpscLinkedArrayQueue.poll(), using 100% CPU.

Environment:

SkyWalking Agent version: 8.14.0-0b52256
Java version: openjdk 1.8.0_191
OS version: kernel: 4.19.90 aarch64

Additional Context or Screenshots:

The logs indicate repeated Collector traceSegment service doesn't response in xxx seconds. ...
Immediately afterwards, the grpc-nio-worker-ELG-* threads go into RUNNABLE state in a loop.
This behavior appears consistently once the collector side times out.
I use telnet to oap server port, it's connected. I restart oap server, and use netstat -anp command, but found no connection to oap server.

Could you please advise if this is a known bug or a configuration issue? Any recommended workaround or fix would be appreciated.

wu-sheng · 2025-01-24T06:45:57Z

wu-sheng
Jan 24, 2025
Collaborator

This is a kind to timeout. It means your server(oap) can't process the data in time.
But the client is still trying to send out the data.

5 replies

wu-sheng Jan 24, 2025
Collaborator

What I can't understand is, queue#poll is a very low cpu cost thing, it can't cost all CPUs.

hzhaop Jan 24, 2025
Author

netty/netty#13137
netty/netty#11956

It seems there is a bug in netty on low version jdk and arm64?

wu-sheng Jan 24, 2025
Collaborator

Maybe, that is netty scope thing. We don't know that.

hzhaop Jan 24, 2025
Author

Starting from Netty version 4.1.112.Final, the JCTools referenced by Netty seem to have fixed this issue (JCTools/JCTools@6e2a486). Can SkyWalking update its Netty dependency to version 4.1.112.Final or later? @wu-sheng

hzhaop Jan 24, 2025
Author

Sorry, I didn't notice that the master branch has already updated Netty to 4.1.115.Final. I will try the latest version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agent][Bug] DEADLINE_EXCEEDED errors cause high CPU usage (busy-spin in MpscUnboundedArrayQueue.poll) #13011

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Agent][Bug] DEADLINE_EXCEEDED errors cause high CPU usage (busy-spin in MpscUnboundedArrayQueue.poll) #13011

hzhaop Jan 24, 2025

Replies: 1 comment · 5 replies

wu-sheng Jan 24, 2025 Collaborator

wu-sheng Jan 24, 2025 Collaborator

hzhaop Jan 24, 2025 Author

wu-sheng Jan 24, 2025 Collaborator

hzhaop Jan 24, 2025 Author

hzhaop Jan 24, 2025 Author

hzhaop
Jan 24, 2025

Replies: 1 comment 5 replies

wu-sheng
Jan 24, 2025
Collaborator

wu-sheng Jan 24, 2025
Collaborator

hzhaop Jan 24, 2025
Author

wu-sheng Jan 24, 2025
Collaborator

hzhaop Jan 24, 2025
Author

hzhaop Jan 24, 2025
Author