java.io.IOException: No file descriptors available #43

eskp · 2019-04-08T23:02:45Z

We are seeing the following log after some time of operation (around 3 hours). We already raised file descriptors limit to 8192 from the default 1024.
KafkaHQ not able to resolve our Kafka hosts DNS after the stack trace is a bit strange and tells me that the high number of the file descriptors is related to DNS querying. Maybe those are not being cached or alike?

Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: 2019-04-08 22:48:22,523 WARN  pGroup-1-1 n.c.DefaultChannelPipeline An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: java.io.IOException: No file descriptors available
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:110)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:107)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at java.security.AccessController.doPrivileged(Native Method)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.internal.SocketUtils.accept(SocketUtils.java:107)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:143)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at java.lang.Thread.run(Thread.java:748)
Apr 08 15:48:23 ip-10-0-131-100 kafkahq-service: 2019-04-08 22:48:23,530 WARN  1-thread-9 o.a.k.clients.ClientUtils  Couldn't resolve server kafka.io:9092 from bootstrap.servers as DNS resolution failed for kafka.io
Apr 08 15:48:23 ip-10-0-131-100 kafkahq-service: 2019-04-08 22:48:23,531 WARN  1-thread-9 o.a.k.clients.ClientUtils  Couldn't resolve server kafka2.io:9092 from bootstrap.servers as DNS resolution failed for kafka2.io

The text was updated successfully, but these errors were encountered:

eskp · 2019-04-09T00:09:39Z

Furthermore, looking inside the container, I can see there are about 2600 file descriptors open right now for the java process, after about an hour of operation:

# ls -l /proc/<PID>/fd | wc -l
2648

Looking at netstat output, looks like the connections are not getting cleaned up:

...
tcp        1      0 172.17.0.5:48510        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        0      0 172.17.0.5:43066        10.1.111.5:9092         ESTABLISHED 28/java
tcp        1      0 172.17.0.5:58176        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:41282        10.1.110.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:58616        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:37968        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:34994        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:36620        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:37474        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:50596        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:45030        10.1.110.5:9092         CLOSE_WAIT  28/java
...

There are about 1700 of these:

# netstat -pant | grep java | wc -l
1751

tchiotludo · 2019-04-09T06:07:15Z

Really strange behavior !

Can you tell me a little bit more ?

What is the environment (k8, docker, compose ...) ?
Where is the 2 Kafka cluster ? (Same host, ...)
Kafkahq Configuration file snippet ?
What is the result of a simple find resolution inside the container ?
Some metrics about Kafka cluster (number of topic x number of partition, number of consumer group)

Do you have successfully displayed the UI ?

The raise exception is directly from Kafka client. I think a simple Java api (simple consumer for example) will lead to the same exception.

KafkaHQ have a connection cache persistance, once connected, we keep the connection open.
Maybe this persistance can lead the high open file with an exception occur during connection.

But do be honest, it's really strange, I have a kafkahq open for weeks without any restart with this kid ND if errors.

eskp · 2019-04-10T01:13:37Z

Running inside AWS ECS in a container
3 Kafka nodes, running in a different subnet. These specified in bootstrap.servers and all show up in the UI with their IP addresses
Able to resolve kafka hosts' and make a connection from command line inside container
KafkaHQ config:

kafkahq:
  server:
    access-log:
      enabled: false
  connections:
    kafka-cluster:
      properties:
        bootstrap.servers: "kafka1.io:9092,kafka2.io:9092,kafka3:9092"
  security:
    default-roles:
      - topic/read
      - node/read
      - topic/data/read
      - group/read
      - registry/read
    basic-auth:
      user:
        password: pass
        roles:
          - topic/read
          - topic/insert
          - topic/delete
          - topic/config/update
          - node/read
          - node/config/update
          - topic/data/read
          - topic/data/insert
          - topic/data/delete
          - group/read
          - group/delete
          - group/offsets/update
          - registry/read
          - registry/insert
          - registry/update
          - registry/delete
          - registry/version/delete

16 consumer groups, 42 topics
Successfully see the UI the over the period of 3 hours memory and file descriptor count grows. After that stack traces in the UI and container logs saying no more file descriptors available. Memory usage flattens out after this.

eskp · 2019-04-10T01:36:40Z

Something is not getting cleaned up. Testing from a local container on each page refresh file descriptor count and memory grow. While the page is open and idling all the counts remain the same. The reason it was maxing out in ECS is because there is a health check that pings the service every 15s. And so file descriptors grow, don't get cleaned up and eventually the service crashes.

tchiotludo · 2019-04-10T04:27:56Z

Thanks for the feedback, especially the health check informations, will have a look at it ASAP

tchiotludo · 2019-04-10T06:21:18Z

Clearly got it, we have a true leak at every refresh 😢
Search for now since the connection are created in Kafka Client thread ...

tchiotludo · 2019-04-10T07:45:15Z

Normally it's fixed on last version 0.5.2 !!
Thanks for the report !!

eskp · 2019-04-11T00:14:45Z

Thanks a lot for the quick turn around and a great application.

tchiotludo · 2019-06-06T07:29:05Z

@eskp, for information, you have true healthcheck endpoint now : https://github.com/tchiotludo/kafkahq#monitoring-endpoint

…es-List-Backend AKHQ-12 - Topics - Data - Messages List - Backend

tchiotludo added the question Further information is requested label Apr 9, 2019

tchiotludo added bug Something isn't working and removed question Further information is requested labels Apr 10, 2019

tchiotludo closed this as completed in 7f054a2 Apr 10, 2019

ghost pushed a commit that referenced this issue Jun 19, 2020

Merge pull request #43 from polarising-oss/AKHQ-12-Topics-Data-Messag…

0869b05

…es-List-Backend AKHQ-12 - Topics - Data - Messages List - Backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.io.IOException: No file descriptors available #43

java.io.IOException: No file descriptors available #43

eskp commented Apr 8, 2019

eskp commented Apr 9, 2019 •

edited

Loading

tchiotludo commented Apr 9, 2019

eskp commented Apr 10, 2019

eskp commented Apr 10, 2019

tchiotludo commented Apr 10, 2019

tchiotludo commented Apr 10, 2019

tchiotludo commented Apr 10, 2019

eskp commented Apr 11, 2019

tchiotludo commented Jun 6, 2019

java.io.IOException: No file descriptors available #43

java.io.IOException: No file descriptors available #43

Comments

eskp commented Apr 8, 2019

eskp commented Apr 9, 2019 • edited Loading

tchiotludo commented Apr 9, 2019

eskp commented Apr 10, 2019

eskp commented Apr 10, 2019

tchiotludo commented Apr 10, 2019

tchiotludo commented Apr 10, 2019

tchiotludo commented Apr 10, 2019

eskp commented Apr 11, 2019

tchiotludo commented Jun 6, 2019

eskp commented Apr 9, 2019 •

edited

Loading