Skip to content

Conversation

@original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Jan 14, 2021

This makes the handling of bulk shard requests stop allocating a new byte[] for every document source. Doing so effectively doubles the peak memory use for any bulk request not handled on a coordinating node (there the bytes are already pooled via the REST layer) needlessly. With larger documents allocating the source to unpooled byte[] could also lead to humongous allocations for the document source.

Unlike the coordinator node pooling of bytes, pooling these requests per shard should behave a lot nicer in theory since the requests (BulkShardRequest) get processed "in one piece" anyway and need to be retained fully for retries as well as far as I can see.

In some quick and dirty benchmarking I could see a significant drop in GC for the PMC benchmark run on a 3-node cluster with this and a massive saving in the number of byte[] that are allocated per unit of time.

WIP, this was a holiday project and still needs some cleanup here and there
and should be broken up into pieces since it introduces tricky new test infrastructure.

But overall this commit works. It makes it so all the source bytes in bulk requests
are shared throughout the lifecycle of the bulk request instead of copying them to
unpooled buffers when deserializing indexing requests. The essentially cuts the peak
memory use for bulk requests in almost half (though by retaining buffers for longer
that effect is reduced a little compared to GC's ability to more fine-grained collected
byte arrays). Still overall this reduces GC times by 2/3 when running the rally PMC indexing
track, reduces the risk of humungous allocations with G1GC and makes memory use less spiky so
on balance I think this is the direction to go.
@original-brownbear original-brownbear added >non-issue WIP :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. :Distributed Coordination/Network Http and internode communication implementations labels Jan 14, 2021
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 19, 2021
This commit adds leak tracking infrastructure that enables assertions
about the state of objects at GC time (simplified version of what Netty
uses to track `ByteBuf` instances).
This commit uses the infrastructure to improve the quality of leak
checks for page recycling in the mock nio transport (the logic in
`org.elasticsearch.common.util.MockPageCacheRecycler#ensureAllPagesAreReleased`
does not run for all tests and tracks too little information to allow for debugging
what caused a specific leak in most cases due to the lack of an equivalent of the added
`#touch` logic).

This is elastic#67502
@mark-vieira
Copy link
Contributor

@elasticmachine update branch

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 25, 2021
Added this assertion to have an easier time debugging
work on elastic#67502 and found that we were accessing `refcount == 0`
bytes in the `SSLOutboundBuffer` so I fixed that buffer to not
keep references to released pages.
original-brownbear added a commit that referenced this pull request Feb 25, 2021
Added this assertion to have an easier time debugging
work on #67502 and found that we were accessing `refcount == 0`
bytes in the `SSLOutboundBuffer` so I fixed that buffer to not
keep references to released pages.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 25, 2021
Added this assertion to have an easier time debugging
work on elastic#67502 and found that we were accessing `refcount == 0`
bytes in the `SSLOutboundBuffer` so I fixed that buffer to not
keep references to released pages.
original-brownbear added a commit that referenced this pull request Feb 25, 2021
Added this assertion to have an easier time debugging
work on #67502 and found that we were accessing `refcount == 0`
bytes in the `SSLOutboundBuffer` so I fixed that buffer to not
keep references to released pages.
@original-brownbear original-brownbear marked this pull request as ready for review February 27, 2021 20:14
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

TimeValue next = backoff.next();
logger.trace("Retry of bulk request scheduled in {} ms.", next.millis());
retryCancellable = scheduler.schedule(() -> this.execute(bulkRequestForRetry), next, ThreadPool.Names.SAME);
bulkRequestForRetry.incRef();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought this was inefficient relative to what we currently do but:

Retrying, will cause the whole request to be retained instead of just those requests that actually get retried. But, I think in the real world this is less of an issue with where we currently are.
If the retry happens on a coordinating node, we already have this problem (since we use shared bytes that are tracked as one thing for the whole bulk request). If the retry happens for a replication request, then the replication request is the same unit of work as the primary bulk request received so that seems fine as well. There's certainly room here for more selectively retaining bytes though but I don't think this PR makes us worse off here.

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/2 (unrelated ml)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/1

@original-brownbear original-brownbear marked this pull request as draft February 27, 2021 21:37
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete))

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.