-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jmxfetch OutOfMemoryError being suppressed #519
Comments
please note that this have been broken for all of our production environment recently when we updated our system to use 0.48.0 jmx patch |
Also there were only one "Trying to reconnect.", it never happened again even if there is continuous failing |
after the instance failed a while, there are new type of error messags
|
What version of the Agent are you using? |
0.48.0 - Agent 6.49.1 - Commit: 1790cab - Serialization version: v5.0.97 - Go version: go1.20.10 |
i also tried to reproduce disconnect connections and force this fail via iptables DROP. but it don't seem to be able to reproduce this reconnect issue updating 0.49.0 server with Agent 7.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8 |
here is log if using iptables DROP
|
it seem the issue was jmxfetch out of heap space. i think the OutOfMemory error should not be under debug mode error logging here is before and after turning on/off debug mode
Debug mode on
|
Updated issue title to reflect the issue better |
Thanks for the detailed analysis of the issue. No 100% sure why that error message only appears if |
@carlosroman Update Increased agent memory from 200MB to 600MB, it is still dying. atm i am not sure is there some sort of memory leak happening in datadog agent the first error produced have more detailed stack trace, but subsequence error do not have stack trace for heap space exception
|
@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads? |
nope, i am still figuring how this whole thing work atm |
It seem that the system at some point failed and it never able to recover. the only resolution at the moment is to restart datadog-agent. I am suspecting the bug is caused by #432 where during the clean up this.connection set to null but the instance never re-initialize the connection.
I am not familiar with Java as i am not java developer, but i am thinking will replacing
connection.getMBeanInfo
tothis.getConnection().getMBeanInfo
fix this issue. but the getConnection() probably need to set this.connection tooTested on:
one side effect happen is that i see two instances when it start to fail. but if it is running normally, it only show one instance
The text was updated successfully, but these errors were encountered: