Skip to content

Multiple Test Failures from Blocked accept0 Syscalls (Debian CI Runs only) #43387

@original-brownbear

Description

@original-brownbear

The following build: https://scans.gradle.com/s/4prwec7zf6pba/ failed in a very strange way.
Seemingly all nodes of the internal test cluster keep getting stuck in in accept calls on non-blocking server sockets.
The build log is full of failed connections and the following stuck thread reporting:

1> [2019-06-19T16:22:22,985][WARN ][o.e.t.n.MockNioTransport ] [node_sm1] Potentially blocked execution on network thread [elasticsearch[node_sm1][transport_worker][T#2]] [2401 milliseconds]:
--
1> sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
1> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1624/1917852998.run(Unknown Source)
1> java.security.AccessController.doPrivileged(Native Method)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
1> org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
1> org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
1> org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
1> org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
1> org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
1> org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
1> org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
1> org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
1> org.elasticsearch.nio.NioSelectorGroup$$Lambda$1545/393896253.run(Unknown Source)
1> java.lang.Thread.run(Thread.java:748)

It is not immediately clear to me how we could get into these calls blocking. It doesn't seem to be dead locks on some selector lock since no thread leaks are reported on the failing tests (though it could be that us interruption the node's thread pools clears all the stuck sys calls up). So far this seems to be a one time thing as far as I can see.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions