Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.io.IOException: No file descriptors available #43

Closed
eskp opened this issue Apr 8, 2019 · 9 comments
Closed

java.io.IOException: No file descriptors available #43

eskp opened this issue Apr 8, 2019 · 9 comments
Labels
bug Something isn't working

Comments

@eskp
Copy link

eskp commented Apr 8, 2019

We are seeing the following log after some time of operation (around 3 hours). We already raised file descriptors limit to 8192 from the default 1024.
KafkaHQ not able to resolve our Kafka hosts DNS after the stack trace is a bit strange and tells me that the high number of the file descriptors is related to DNS querying. Maybe those are not being cached or alike?

Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: 2019-04-08 22:48:22,523 WARN  pGroup-1-1 n.c.DefaultChannelPipeline An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: java.io.IOException: No file descriptors available
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:110)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:107)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at java.security.AccessController.doPrivileged(Native Method)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.internal.SocketUtils.accept(SocketUtils.java:107)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:143)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Apr 08 15:48:22 ip-10-0-131-100 kafkahq-service: at java.lang.Thread.run(Thread.java:748)
Apr 08 15:48:23 ip-10-0-131-100 kafkahq-service: 2019-04-08 22:48:23,530 WARN  1-thread-9 o.a.k.clients.ClientUtils  Couldn't resolve server kafka.io:9092 from bootstrap.servers as DNS resolution failed for kafka.io
Apr 08 15:48:23 ip-10-0-131-100 kafkahq-service: 2019-04-08 22:48:23,531 WARN  1-thread-9 o.a.k.clients.ClientUtils  Couldn't resolve server kafka2.io:9092 from bootstrap.servers as DNS resolution failed for kafka2.io
@eskp
Copy link
Author

eskp commented Apr 9, 2019

Furthermore, looking inside the container, I can see there are about 2600 file descriptors open right now for the java process, after about an hour of operation:

# ls -l /proc/<PID>/fd | wc -l
2648

Looking at netstat output, looks like the connections are not getting cleaned up:

...
tcp        1      0 172.17.0.5:48510        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        0      0 172.17.0.5:43066        10.1.111.5:9092         ESTABLISHED 28/java
tcp        1      0 172.17.0.5:58176        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:41282        10.1.110.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:58616        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:37968        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:34994        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:36620        10.1.112.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:37474        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:50596        10.1.111.5:9092         CLOSE_WAIT  28/java
tcp        1      0 172.17.0.5:45030        10.1.110.5:9092         CLOSE_WAIT  28/java
...

There are about 1700 of these:

# netstat -pant | grep java | wc -l
1751

@tchiotludo tchiotludo added the question Further information is requested label Apr 9, 2019
@tchiotludo
Copy link
Owner

Really strange behavior !

Can you tell me a little bit more ?

  • What is the environment (k8, docker, compose ...) ?
  • Where is the 2 Kafka cluster ? (Same host, ...)
  • Kafkahq Configuration file snippet ?
  • What is the result of a simple find resolution inside the container ?
  • Some metrics about Kafka cluster (number of topic x number of partition, number of consumer group)

Do you have successfully displayed the UI ?

The raise exception is directly from Kafka client. I think a simple Java api (simple consumer for example) will lead to the same exception.

KafkaHQ have a connection cache persistance, once connected, we keep the connection open.
Maybe this persistance can lead the high open file with an exception occur during connection.

But do be honest, it's really strange, I have a kafkahq open for weeks without any restart with this kid ND if errors.

@eskp
Copy link
Author

eskp commented Apr 10, 2019

  • Running inside AWS ECS in a container
  • 3 Kafka nodes, running in a different subnet. These specified in bootstrap.servers and all show up in the UI with their IP addresses
  • Able to resolve kafka hosts' and make a connection from command line inside container
  • KafkaHQ config:
kafkahq:
  server:
    access-log:
      enabled: false
  connections:
    kafka-cluster:
      properties:
        bootstrap.servers: "kafka1.io:9092,kafka2.io:9092,kafka3:9092"
  security:
    default-roles:
      - topic/read
      - node/read
      - topic/data/read
      - group/read
      - registry/read
    basic-auth:
      user:
        password: pass
        roles:
          - topic/read
          - topic/insert
          - topic/delete
          - topic/config/update
          - node/read
          - node/config/update
          - topic/data/read
          - topic/data/insert
          - topic/data/delete
          - group/read
          - group/delete
          - group/offsets/update
          - registry/read
          - registry/insert
          - registry/update
          - registry/delete
          - registry/version/delete
  • 16 consumer groups, 42 topics
  • Successfully see the UI the over the period of 3 hours memory and file descriptor count grows. After that stack traces in the UI and container logs saying no more file descriptors available. Memory usage flattens out after this.

@eskp
Copy link
Author

eskp commented Apr 10, 2019

Something is not getting cleaned up. Testing from a local container on each page refresh file descriptor count and memory grow. While the page is open and idling all the counts remain the same. The reason it was maxing out in ECS is because there is a health check that pings the service every 15s. And so file descriptors grow, don't get cleaned up and eventually the service crashes.

@tchiotludo
Copy link
Owner

Thanks for the feedback, especially the health check informations, will have a look at it ASAP

@tchiotludo
Copy link
Owner

Clearly got it, we have a true leak at every refresh 😢
Search for now since the connection are created in Kafka Client thread ...

@tchiotludo tchiotludo added bug Something isn't working and removed question Further information is requested labels Apr 10, 2019
@tchiotludo
Copy link
Owner

Normally it's fixed on last version 0.5.2 !!
Thanks for the report !!

@eskp
Copy link
Author

eskp commented Apr 11, 2019

Thanks a lot for the quick turn around and a great application.

@tchiotludo
Copy link
Owner

@eskp, for information, you have true healthcheck endpoint now : https://github.com/tchiotludo/kafkahq#monitoring-endpoint

ghost pushed a commit that referenced this issue Jun 19, 2020
…es-List-Backend

AKHQ-12 - Topics - Data - Messages List - Backend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants