Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jmxfetch OutOfMemoryError being suppressed #519

Open
jk2l opened this issue Apr 17, 2024 · 13 comments
Open

Jmxfetch OutOfMemoryError being suppressed #519

jk2l opened this issue Apr 17, 2024 · 13 comments

Comments

@jk2l
Copy link

jk2l commented Apr 17, 2024

It seem that the system at some point failed and it never able to recover. the only resolution at the moment is to restart datadog-agent. I am suspecting the bug is caused by #432 where during the clean up this.connection set to null but the instance never re-initialize the connection.

I am not familiar with Java as i am not java developer, but i am thinking will replacing connection.getMBeanInfo to this.getConnection().getMBeanInfo fix this issue. but the getConnection() probably need to set this.connection too

2024-04-16 05:15:51 NZST | JMX | INFO | Instance | Refreshing bean list for jboss_wildfly
2024-04-16 05:16:51 NZST | JMX | WARN | App | Unable to collect metrics or refresh bean list.: java.util.concurrent.CancellationException
2024-04-16 05:16:51 NZST | JMX | WARN | App | Instance jboss_hcn didn't return any metrics. Maybe the server got disconnected ? Trying to reconnect.
2024-04-16 05:16:51 NZST | JMX | WARN | Instance | Cannot get attributes or class name for bean jboss.as:subsystem=infinispan,cache-container=hibernate,cache=hcn.ear/hcn-codetabledata-jpa#CodeTableJPAUnit.nz.govt.cyf.hcn.server.entity.codetables.PlanDurationType:
java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:573)
at org.datadog.jmxfetch.Instance.getMetrics(Instance.java:482)
at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:26)
at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:8)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

Tested on:

  • 0.48.0
  • 0.49.0

========
JMXFetch

Information

runtime_version : 17.0.9
version : 0.48.0

Initialized checks

jboss_wildfly
- instance_name: jboss_hcn
  metric_count: 0
  service_check_count: 0
  message: Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null
  status: ERROR
- instance_name: jboss_hcn
  metric_count: 0
  service_check_count: 0
  message: Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null
  status: ERROR

Failed checks

no checks

one side effect happen is that i see two instances when it start to fail. but if it is running normally, it only show one instance

@jk2l
Copy link
Author

jk2l commented Apr 17, 2024

please note that this have been broken for all of our production environment recently when we updated our system to use 0.48.0 jmx patch

@jk2l
Copy link
Author

jk2l commented Apr 17, 2024

Also there were only one "Trying to reconnect.", it never happened again even if there is continuous failing

@jk2l
Copy link
Author

jk2l commented Apr 17, 2024

after the instance failed a while, there are new type of error messags

2024-04-17 17:05:58 NZST | JMX | WARN | App | No instance could be initiated. Retrying initialization.
2024-04-17 17:05:58 NZST | JMX | INFO | App | Found 0 config files
2024-04-17 17:05:58 NZST | JMX | INFO | App | Cleaning up instances...
2024-04-17 17:05:58 NZST | JMX | INFO | InstanceCleanupTask | Trying to cleanup: jboss_hcn
2024-04-17 17:05:58 NZST | JMX | INFO | App | Dealing with YAML config instances...
2024-04-17 17:05:58 NZST | JMX | INFO | App | Dealing with Auto-Config instances collected...
2024-04-17 17:05:58 NZST | JMX | INFO | App | Instantiating instance for: jboss_wildfly
2024-04-17 17:05:59 NZST | JMX | INFO | App | Started instance initialization...
2024-04-17 17:05:59 NZST | JMX | INFO | Instance | Trying to connect to JMX Server at jboss_hcn
2024-04-17 17:05:59 NZST | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection...
2024-04-17 17:05:59 NZST | JMX | INFO | ConnectionFactory | Connecting using JMX Remote

2024-04-17 17:05:59 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.196.69:9990
2024-04-17 17:05:59 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-17 17:05:59 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.ExecutionException: java.nio.channels.ClosedSelectorException
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998)
at org.datadog.jmxfetch.App$5.invoke(App.java:944)
at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
at org.datadog.jmxfetch.App.init(App.java:934)
at org.datadog.jmxfetch.App.start(App.java:421)
at org.datadog.jmxfetch.App.run(App.java:179)
at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
at org.datadog.jmxfetch.App.main(App.java:92)
Caused by: java.nio.channels.ClosedSelectorException
at java.base/sun.nio.ch.SelectorImpl.ensureOpen(SelectorImpl.java:80)
at java.base/sun.nio.ch.SelectorImpl.implRegister(SelectorImpl.java:239)
at java.base/sun.nio.ch.SelectorImpl.register(SelectorImpl.java:215)
at java.base/java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:236)
at java.base/java.nio.channels.SelectableChannel.register(SelectableChannel.java:260)
at org.xnio.nio.WorkerThread.registerChannel(WorkerThread.java:743)
at org.xnio.nio.WorkerThread.openTcpStreamConnection(WorkerThread.java:262)
at org.xnio.XnioIoThread.internalOpenTcpStreamConnection(XnioIoThread.java:254)
at org.xnio.XnioIoThread.openStreamConnection(XnioIoThread.java:226)
at org.xnio.XnioWorker.openStreamConnection(XnioWorker.java:405)
at org.jboss.remoting3.remote.RemoteConnectionProvider.createSslConnection(RemoteConnectionProvider.java:246)
at org.jboss.remoting3.remote.HttpUpgradeConnectionProvider.createSslConnection(HttpUpgradeConnectionProvider.java:135)
at org.jboss.remoting3.remote.RemoteConnectionProvider.connect(RemoteConnectionProvider.java:206)
at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:649)
at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:561)
at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:549)
at org.jboss.remotingjmx.RemotingConnector.internalRemotingConnect(RemotingConnector.java:268)
at org.jboss.remotingjmx.RemotingConnector.internalConnect(RemotingConnector.java:156)
at org.jboss.remotingjmx.RemotingConnector.connect(RemotingConnector.java:103)
at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
at org.datadog.jmxfetch.Connection.createConnection(Connection.java:57)
at org.datadog.jmxfetch.RemoteConnection.(RemoteConnection.java:101)
at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:40)
at org.datadog.jmxfetch.Instance.getConnection(Instance.java:455)
at org.datadog.jmxfetch.Instance.init(Instance.java:468)
at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

2024-04-17 17:05:59 NZST | JMX | WARN | App | Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null

@carlosroman
Copy link
Contributor

What version of the Agent are you using?

@jk2l
Copy link
Author

jk2l commented Apr 18, 2024

0.48.0 - Agent 6.49.1 - Commit: 1790cab - Serialization version: v5.0.97 - Go version: go1.20.10
0.49.0 - Agent 6.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8

@jk2l
Copy link
Author

jk2l commented Apr 18, 2024

i also tried to reproduce disconnect connections and force this fail via iptables DROP. but it don't seem to be able to reproduce this reconnect issue

updating 0.49.0 server with Agent 7.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8

@jk2l
Copy link
Author

jk2l commented Apr 18, 2024

here is log if using iptables DROP

24-04-18 16:54:28 NZST | JMX | WARN | App | No instance could be initiated. Retrying initialization.
2024-04-18 16:54:28 NZST | JMX | INFO | App | Found 0 config files
2024-04-18 16:54:28 NZST | JMX | INFO | App | Cleaning up instances...
2024-04-18 16:54:28 NZST | JMX | INFO | InstanceCleanupTask | Trying to cleanup: jboss_hcn
2024-04-18 16:54:28 NZST | JMX | INFO | App | Dealing with YAML config instances...
2024-04-18 16:54:28 NZST | JMX | INFO | App | Dealing with Auto-Config instances collected...
2024-04-18 16:54:28 NZST | JMX | INFO | App | Instantiating instance for: jboss_wildfly
2024-04-18 16:54:28 NZST | JMX | INFO | App | Started instance initialization...
2024-04-18 16:54:28 NZST | JMX | INFO | Instance | Trying to connect to JMX Server at jboss_hcn
2024-04-18 16:54:28 NZST | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection...
2024-04-18 16:54:28 NZST | JMX | INFO | ConnectionFactory | Connecting using JMX Remote
2024-04-18 16:54:28 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.196.69:9990
2024-04-18 16:54:58 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-18 16:54:58 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.ExecutionException: java.io.IOException: Operation failed with status WAITING
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998)
at org.datadog.jmxfetch.App$5.invoke(App.java:944)
at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
at org.datadog.jmxfetch.App.init(App.java:934)
at org.datadog.jmxfetch.App.start(App.java:421)
at org.datadog.jmxfetch.App.run(App.java:179)
at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
at org.datadog.jmxfetch.App.main(App.java:92)
Caused by: java.io.IOException: Operation failed with status WAITING
at org.jboss.remotingjmx.RemotingConnector.internalRemotingConnect(RemotingConnector.java:276)
at org.jboss.remotingjmx.RemotingConnector.internalConnect(RemotingConnector.java:156)
at org.jboss.remotingjmx.RemotingConnector.connect(RemotingConnector.java:103)
at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
at org.datadog.jmxfetch.Connection.createConnection(Connection.java:57)
at org.datadog.jmxfetch.RemoteConnection.(RemoteConnection.java:101)
at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:40)
at org.datadog.jmxfetch.Instance.getConnection(Instance.java:455)
at org.datadog.jmxfetch.Instance.init(Instance.java:468)
at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

2024-04-18 16:54:58 NZST | JMX | WARN | App | Unable to instantiate or initialize instance jboss_hcn. Is the target JMX Server or JVM running? Operation failed with status WAITING
2024-04-18 16:55:05 NZST | JMX | INFO | App | update is in order - updating timestamp: 1713416105

@jk2l
Copy link
Author

jk2l commented Apr 29, 2024

it seem the issue was jmxfetch out of heap space. i think the OutOfMemory error should not be under debug mode error logging

here is before and after turning on/off debug mode

2024-04-12 11:51:15 NZST | JMX | INFO | Instance | Trying to collect bean list for the first time for JMX Server at jboss_hcn
2024-04-12 11:51:20 NZST | JMX | INFO | Instance | Connected to JMX Server at jboss_hcn with 3556 beans
2024-04-12 11:52:16 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-12 11:52:16 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998)
        at org.datadog.jmxfetch.App$5.invoke(App.java:944)
        at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
        at org.datadog.jmxfetch.App.init(App.java:934)
        at org.datadog.jmxfetch.App.start(App.java:421)
        at org.datadog.jmxfetch.App.run(App.java:179)
        at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
        at org.datadog.jmxfetch.App.main(App.java:92)

Debug mode on

2024-04-27 08:09:10 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.197.136:9990
2024-04-27 08:09:10 NZST | JMX | INFO | Instance | Trying to collect bean list for the first time for JMX Server at jboss_hcn
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: java.lang:*
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: java.nio:type=BufferPool,*
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: jboss.as:*
Exception in thread "remoting-jmx client-thread-50" java.lang.OutOfMemoryError: Java heap space
2024-04-27 08:10:10 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-27 08:10:10 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1024)
        at org.datadog.jmxfetch.App$5.invoke(App.java:970)
        at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
        at org.datadog.jmxfetch.App.init(App.java:960)
        at org.datadog.jmxfetch.App.start(App.java:417)
        at org.datadog.jmxfetch.App.run(App.java:181)
        at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
        at org.datadog.jmxfetch.App.main(App.java:94)

@jk2l jk2l changed the title JMXFetch is failing Jmxfetch OutOfMemoryError being surpressed Apr 29, 2024
@jk2l jk2l changed the title Jmxfetch OutOfMemoryError being surpressed Jmxfetch OutOfMemoryError being suppressed Apr 29, 2024
@jk2l
Copy link
Author

jk2l commented Apr 29, 2024

Updated issue title to reflect the issue better

@carlosroman
Copy link
Contributor

Thanks for the detailed analysis of the issue. No 100% sure why that error message only appears if debug enabled in the Agent. Will have a look at it.

@jk2l
Copy link
Author

jk2l commented May 17, 2024

@carlosroman Update

Increased agent memory from 200MB to 600MB, it is still dying. atm i am not sure is there some sort of memory leak happening in datadog agent

the first error produced have more detailed stack trace, but subsequence error do not have stack trace for heap space exception

2024-05-15 16:58:30 NZST | JMX | DEBUG | Connection | Querying bean names on scope: jboss.as:*
Exception in thread "remoting-jmx client-thread-82" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.lang.StringLatin1.toChars(StringLatin1.java:74)
        at java.base/java.lang.String.toCharArray(String.java:4103)
        at java.management/javax.management.ObjectName.construct(ObjectName.java:441)
        at java.management/javax.management.ObjectName.readObject(ObjectName.java:1177)
        at java.base/java.lang.invoke.LambdaForm$DMH/0x00007f7668200400.invokeSpecial(LambdaForm$DMH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766827c400.invoke(LambdaForm$MH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766823c800.invoke_MT(LambdaForm$MH)
        at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadObject(JDKSpecific.java:171)
        at org.jboss.marshalling.reflect.SerializableClass.callReadObject(SerializableClass.java:252)
        at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1997)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1601)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:441)
        at org.jboss.marshalling.river.BlockUnmarshaller.readObject(BlockUnmarshaller.java:149)
        at org.jboss.marshalling.river.BlockUnmarshaller.readObject(BlockUnmarshaller.java:135)
        at org.jboss.marshalling.MarshallerObjectInputStream.readObjectOverride(MarshallerObjectInputStream.java:53)
        at org.jboss.marshalling.river.RiverObjectInputStream.readObjectOverride(RiverObjectInputStream.java:307)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:500)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:467)
        at java.base/java.util.HashSet.readObject(HashSet.java:345)
        at java.base/java.lang.invoke.LambdaForm$DMH/0x00007f7668200400.invokeSpecial(LambdaForm$DMH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766827b000.invoke(LambdaForm$MH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766823c800.invoke_MT(LambdaForm$MH)        at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadObject(JDKSpecific.java:171)        at org.jboss.marshalling.reflect.SerializableClass.callReadObject(SerializableClass.java:252)        at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1997)        at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1601)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:441)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:373)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41)
        at org.jboss.remotingjmx.protocol.v2.ClientCommon$MarshalledResponseHandler.readValue(ClientCommon.java:224)        at org.jboss.remotingjmx.protocol.v2.ClientCommon$BaseResponseHandler.handle(ClientCommon.java:176)        at org.jboss.remotingjmx.protocol.v2.ClientCommon$MessageReceiver$1.run(ClientCommon.java:123)2024-05-15 16:59:30 NZST | JMX | INFO | App | Completed instance initialization...2024-05-15 16:59:30 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1024)
        at org.datadog.jmxfetch.App$5.invoke(App.java:970)
        at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
        at org.datadog.jmxfetch.App.init(App.java:960)
        at org.datadog.jmxfetch.App.start(App.java:417)
        at org.datadog.jmxfetch.App.run(App.java:181)
        at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
        at org.datadog.jmxfetch.App.main(App.java:94)

@phongvq
Copy link

phongvq commented May 24, 2024

@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads?

@jk2l
Copy link
Author

jk2l commented May 27, 2024

@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads?

nope, i am still figuring how this whole thing work atm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants