Skip to content

Conversation

@original-brownbear
Copy link
Contributor

This commit adds leak tracking infrastructure that enables assertions
about the state of objects at GC time (simplified version of what Netty
uses to track ByteBuf instances).
This commit uses the infrastructure to improve the quality of leak
checks for page recycling in the mock NIO transport (the logic in
org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased
does not run for all tests and tracks too little information to allow for debugging
what caused a specific leak in most cases due to the lack of an equivalent of the added
#touch logic).

Added to production code to make it reusable for ad-hoc debugging of production classes and allow for a possible follow-up to run similar checks in production (e.g. like Netty allows for checking a small fraction of all ByteBuf for leaks).

This was very helpful for debugging buffer pooling issues in more complicated scenarios like #67502 but could also be used for leak-checking assertions on other things (transport request listener leaks, searchable snapshot cache chunks etc.).

Example logging on leak is pretty much what Netty logs e.g. if adding an extra ref count increment for a recovery file chunk this failure would log and trip test assertions:

[2021-01-19T17:59:43,279][ERROR][o.e.t.LeakTracker        ] [node_t1] LEAK: resource was not cleaned up before it was garbage-collected.
Recent access records: 
#1:
	org.elasticsearch.common.bytes.ReleasableBytesReference.decRef(ReleasableBytesReference.java:72)
	org.elasticsearch.indices.recovery.RecoveryFileChunkRequest.decRef(RecoveryFileChunkRequest.java:153)
	org.elasticsearch.transport.InboundHandler$1.onAfter(InboundHandler.java:242)
	org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onAfter(ThreadContext.java:717)
	org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:41)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	java.base/java.lang.Thread.run(Thread.java:832)
#2:
	org.elasticsearch.common.bytes.ReleasableBytesReference.decRef(ReleasableBytesReference.java:72)
	org.elasticsearch.indices.recovery.MultiFileWriter$FileChunk.close(MultiFileWriter.java:204)
	org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:237)
	org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:78)
	org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:502)
	org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:478)
	org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:447)
	org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72)
	org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:227)
	org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:739)
	org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	java.base/java.lang.Thread.run(Thread.java:832)
#3:
	org.elasticsearch.common.bytes.ReleasableBytesReference.retain(ReleasableBytesReference.java:76)
	org.elasticsearch.indices.recovery.MultiFileWriter$FileChunk.<init>(MultiFileWriter.java:197)
	org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:78)
	org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:502)
	org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:478)
	org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:447)
	org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72)
	org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:227)
	org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:739)
	org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	java.base/java.lang.Thread.run(Thread.java:832)
#4:
	org.elasticsearch.common.bytes.ReleasableBytesReference.incRef(ReleasableBytesReference.java:62)
	org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:454)
	org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:447)
	org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72)
	org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:227)
	org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:739)
	org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	java.base/java.lang.Thread.run(Thread.java:832)

This commit adds leak tracking infrastructure that enables assertions
about the state of objects at GC time (simplified version of what Netty
uses to track `ByteBuf` instances).
This commit uses the infrastructure to improve the quality of leak
checks for page recycling in the mock nio transport (the logic in
`org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased`
does not run for all tests and tracks too little information to allow for debugging
what caused a specific leak in most cases due to the lack of an equivalent of the added
`#touch` logic).

This is elastic#67502
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.12.0 labels Jan 19, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@mark-vieira
Copy link
Contributor

@elasticmachine update branch

@DaveCTurner
Copy link
Contributor

@elasticmachine update branch

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Armin, there is some deep voodoo here but I think it's basically ok. I left a few comments. Is there any way we can test that this really does detect and log leaks as we expect?

@original-brownbear
Copy link
Contributor Author

original-brownbear commented Feb 23, 2021

Is there any way we can test that this really does detect and log leaks as we expect?

I knew you were gonna ask this :D I'm not sure there is a nice and clean way of checking this. I don't think in G1GC we really do have a way of forcing a GC. But I guess we can build a test that just allocates small byte arrays or so until an expected leak is reported. That's the best I can think of, I'm a little fearful this will have some unexpected instability to it, but technically it should be fine I think.

@original-brownbear
Copy link
Contributor Author

Thanks for taking a look (and fixing precommit). All points but the test addressed now I think, not sure we really want a test here that is based on having to stress the JVM (see above comment)?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more minor comments

* Side Public License, v 1.
*/

package org.elasticsearch.transport;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be in server/src/main or could we have it in test/framework instead? Not sure if you have plans to use it more generally in future, but if we're not going to test it we really should keep it out of production code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, at least not right now. I used this a lot via ad-hoc testing for all kinds of classes when working on #67502 that's how it ended up in prod-code, but for now it's only used in tests -> moved it :)

@original-brownbear
Copy link
Contributor Author

Thanks David, all addressed now I think :)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/1 (unrelated + known geoip thing)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit c2370ff into elastic:master Feb 25, 2021
@original-brownbear original-brownbear deleted the leak-detection-logic branch February 25, 2021 10:41
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 25, 2021
This commit adds leak tracking infrastructure that enables assertions
about the state of objects at GC time (simplified version of what Netty
uses to track `ByteBuf` instances).
This commit uses the infrastructure to improve the quality of leak
checks for page recycling in the mock nio transport (the logic in
`org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased`
does not run for all tests and tracks too little information to allow for debugging
what caused a specific leak in most cases due to the lack of an equivalent of the added
`#touch` logic).

Co-authored-by: David Turner <david.turner@elastic.co>
original-brownbear added a commit that referenced this pull request Feb 25, 2021
This commit adds leak tracking infrastructure that enables assertions
about the state of objects at GC time (simplified version of what Netty
uses to track `ByteBuf` instances).
This commit uses the infrastructure to improve the quality of leak
checks for page recycling in the mock nio transport (the logic in
`org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased`
does not run for all tests and tracks too little information to allow for debugging
what caused a specific leak in most cases due to the lack of an equivalent of the added
`#touch` logic).

Co-authored-by: David Turner <david.turner@elastic.co>
@original-brownbear original-brownbear restored the leak-detection-logic branch April 18, 2023 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Network Http and internode communication implementations Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.13.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants