-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add Leak Tracking Infrastructure #67688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Leak Tracking Infrastructure #67688
Conversation
This commit adds leak tracking infrastructure that enables assertions about the state of objects at GC time (simplified version of what Netty uses to track `ByteBuf` instances). This commit uses the infrastructure to improve the quality of leak checks for page recycling in the mock nio transport (the logic in `org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased` does not run for all tests and tracks too little information to allow for debugging what caused a specific leak in most cases due to the lack of an equivalent of the added `#touch` logic). This is elastic#67502
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Armin, there is some deep voodoo here but I think it's basically ok. I left a few comments. Is there any way we can test that this really does detect and log leaks as we expect?
libs/core/src/main/java/org/elasticsearch/common/util/concurrent/AbstractRefCounted.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/bytes/ReleasableBytesReference.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/bytes/ReleasableBytesReference.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/LeakTracker.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/LeakTracker.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/LeakTracker.java
Outdated
Show resolved
Hide resolved
I knew you were gonna ask this :D I'm not sure there is a nice and clean way of checking this. I don't think in G1GC we really do have a way of forcing a GC. But I guess we can build a test that just allocates small byte arrays or so until an expected leak is reported. That's the best I can think of, I'm a little fearful this will have some unexpected instability to it, but technically it should be fine I think. |
|
Thanks for taking a look (and fixing precommit). All points but the test addressed now I think, not sure we really want a test here that is based on having to stress the JVM (see above comment)? |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more minor comments
libs/core/src/main/java/org/elasticsearch/common/util/concurrent/AbstractRefCounted.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/bytes/ReleasableBytesReference.java
Show resolved
Hide resolved
| * Side Public License, v 1. | ||
| */ | ||
|
|
||
| package org.elasticsearch.transport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be in server/src/main or could we have it in test/framework instead? Not sure if you have plans to use it more generally in future, but if we're not going to test it we really should keep it out of production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, at least not right now. I used this a lot via ad-hoc testing for all kinds of classes when working on #67502 that's how it ended up in prod-code, but for now it's only used in tests -> moved it :)
|
Thanks David, all addressed now I think :) |
|
Jenkins run elasticsearch-ci/1 (unrelated + known geoip thing) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks David! |
This commit adds leak tracking infrastructure that enables assertions about the state of objects at GC time (simplified version of what Netty uses to track `ByteBuf` instances). This commit uses the infrastructure to improve the quality of leak checks for page recycling in the mock nio transport (the logic in `org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased` does not run for all tests and tracks too little information to allow for debugging what caused a specific leak in most cases due to the lack of an equivalent of the added `#touch` logic). Co-authored-by: David Turner <david.turner@elastic.co>
This commit adds leak tracking infrastructure that enables assertions about the state of objects at GC time (simplified version of what Netty uses to track `ByteBuf` instances). This commit uses the infrastructure to improve the quality of leak checks for page recycling in the mock nio transport (the logic in `org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased` does not run for all tests and tracks too little information to allow for debugging what caused a specific leak in most cases due to the lack of an equivalent of the added `#touch` logic). Co-authored-by: David Turner <david.turner@elastic.co>
This commit adds leak tracking infrastructure that enables assertions
about the state of objects at GC time (simplified version of what Netty
uses to track
ByteBufinstances).This commit uses the infrastructure to improve the quality of leak
checks for page recycling in the mock NIO transport (the logic in
org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleaseddoes not run for all tests and tracks too little information to allow for debugging
what caused a specific leak in most cases due to the lack of an equivalent of the added
#touchlogic).Added to production code to make it reusable for ad-hoc debugging of production classes and allow for a possible follow-up to run similar checks in production (e.g. like Netty allows for checking a small fraction of all
ByteBuffor leaks).This was very helpful for debugging buffer pooling issues in more complicated scenarios like #67502 but could also be used for leak-checking assertions on other things (transport request listener leaks, searchable snapshot cache chunks etc.).
Example logging on leak is pretty much what Netty logs e.g. if adding an extra ref count increment for a recovery file chunk this failure would log and trip test assertions: