-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Distributed Coordination/NetworkHttp and internode communication implementationsHttp and internode communication implementations>test-failureTriaged test failures from CITriaged test failures from CI
Description
The following build: https://scans.gradle.com/s/4prwec7zf6pba/ failed in a very strange way.
Seemingly all nodes of the internal test cluster keep getting stuck in in accept calls on non-blocking server sockets.
The build log is full of failed connections and the following stuck thread reporting:
1> [2019-06-19T16:22:22,985][WARN ][o.e.t.n.MockNioTransport ] [node_sm1] Potentially blocked execution on network thread [elasticsearch[node_sm1][transport_worker][T#2]] [2401 milliseconds]:
--
1> sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1624/1917852998.run(Unknown Source)
1> java.security.AccessController.doPrivileged(Native Method)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
1> org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
1> org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
1> org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
1> org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
1> org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
1> org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
1> org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
1> org.elasticsearch.nio.NioSelectorGroup$$Lambda$1545/393896253.run(Unknown Source)
1> java.lang.Thread.run(Thread.java:748)
It is not immediately clear to me how we could get into these calls blocking. It doesn't seem to be dead locks on some selector lock since no thread leaks are reported on the failing tests (though it could be that us interruption the node's thread pools clears all the stuck sys calls up). So far this seems to be a one time thing as far as I can see.
Metadata
Metadata
Assignees
Labels
:Distributed Coordination/NetworkHttp and internode communication implementationsHttp and internode communication implementations>test-failureTriaged test failures from CITriaged test failures from CI