-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Reduce Memory Required for Sending Aggregation Responses from Data Nodes #85053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Memory Required for Sending Aggregation Responses from Data Nodes #85053
Conversation
These responses can grow quite large. We dealt with this fact on the reader side by retaining pooled buffers when reading the agg responses from the wire. On the sending side though we still allocate unpooled buffers for serialization because we need to serialize including a size-prefix. We could probably make this more efficient by changing the wire format to some kind of chunked encoding in a follow-up but that would require additional complex changes on the reader side. For now, this PR serializes the bytes to pooled buffers right away and then copies the pooled buffer to the outbound buffer page-by-page, releasing pages as they get copied. This cuts the peak-memory use of sending these messages almost in half and removes unpooled buffer allocations. This is particularly useful when a data node has many shards for an aggs query and might send a number of these responses in parallel and/or rapid succession.
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Hi @original-brownbear, I've created a changelog YAML for you. |
nik9000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good idea. As would a chunked write.
I don't know the recycler stuff well enough to approve the PR though.
117c6b0 to
31b1836
Compare
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, is there a big obstacle to adding support for chunking here? Would seem nicer than this stop-gap.
I suggested a shorter implementation too (not tested it tho, beware)
| final int size = tmp.size(); | ||
| writeVInt(size); | ||
| final int bytesInLastPage; | ||
| final int remainder = size % pageSize; | ||
| final int adjustment; | ||
| if (remainder != 0) { | ||
| adjustment = 1; | ||
| bytesInLastPage = remainder; | ||
| } else { | ||
| adjustment = 0; | ||
| bytesInLastPage = pageSize; | ||
| } | ||
| final int pageCount = (size / tmp.pageSize) + adjustment; | ||
| for (int i = 0; i < pageCount - 1; i++) { | ||
| Recycler.V<BytesRef> p = tmp.pages.get(i); | ||
| final BytesRef b = p.v(); | ||
| writeBytes(b.bytes, b.offset, b.length); | ||
| tmp.pages.set(i, null).close(); | ||
| } | ||
| Recycler.V<BytesRef> p = tmp.pages.get(pageCount - 1); | ||
| final BytesRef b = p.v(); | ||
| writeBytes(b.bytes, b.offset, bytesInLastPage); | ||
| tmp.pages.set(pageCount - 1, null).close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to special-case the last page like that?
| final int size = tmp.size(); | |
| writeVInt(size); | |
| final int bytesInLastPage; | |
| final int remainder = size % pageSize; | |
| final int adjustment; | |
| if (remainder != 0) { | |
| adjustment = 1; | |
| bytesInLastPage = remainder; | |
| } else { | |
| adjustment = 0; | |
| bytesInLastPage = pageSize; | |
| } | |
| final int pageCount = (size / tmp.pageSize) + adjustment; | |
| for (int i = 0; i < pageCount - 1; i++) { | |
| Recycler.V<BytesRef> p = tmp.pages.get(i); | |
| final BytesRef b = p.v(); | |
| writeBytes(b.bytes, b.offset, b.length); | |
| tmp.pages.set(i, null).close(); | |
| } | |
| Recycler.V<BytesRef> p = tmp.pages.get(pageCount - 1); | |
| final BytesRef b = p.v(); | |
| writeBytes(b.bytes, b.offset, bytesInLastPage); | |
| tmp.pages.set(pageCount - 1, null).close(); | |
| int size = tmp.size(); | |
| writeVInt(size); | |
| int tmpPage = 0; | |
| while (size > 0) { | |
| final Recycler.V<BytesRef> p = tmp.pages.get(tmpPage); | |
| final BytesRef b = p.v(); | |
| final int writeSize = Math.min(size, b.length); | |
| writeBytes(b.bytes, b.offset, writeSize); | |
| tmp.pages.set(tmpPage, null).close(); | |
| size -= writeSize; | |
| tmpPage++; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested this and it works fine :) => updated
It makes the read side quite a bit trickier because we have to join the chunks there somehow, whereas right now we just need to read a single slice and we're good. Also, given that we had performance complains about the read side here in the past (can't find the issue right now but it was literally about the impact of a slightly slower => I don't think it's an insurmountable challenge, but it's not something I'd have time for right now. Just found this when doing some sizing experiments around aggs and this is what I can do as far as a 30 min fix goes :) Also, maybe a stop-gap is fine here for now because we're hopefully going to have better message chunking support in the not too distant future anyway so we can address it here when adding that. I generally have my doubts about these messages to some degree. They seem larger than necessary (as in tens of MBs or even 100M+ in some cases) and it might be better to look into whether that's absolutely necessary before making the format really complicated? |
|
I'm just thinking about something like this: public ReleasableBytesReference readChunkedReleasableBytesReference() throws IOException {
if (getVersion().onOrAfter(Version.V_8_2_0)) {
final var chunks = new ArrayList<ReleasableBytesReference>();
try {
do {
final var chunk = readReleasableBytesReference();
if (chunk.length() == 0) {
final var chunksArray = chunks.toArray(ReleasableBytesReference[]::new);
final var result = new ReleasableBytesReference(
CompositeBytesReference.of(chunksArray),
Releasables.wrap(chunksArray)
);
chunks.clear();
return result;
} else {
chunks.add(chunk);
}
} while (true);
} finally {
Releasables.close(chunks);
}
} else {
return readReleasableBytesReference();
}
}It wouldn't have to be chunking individual pages (although even that might not be so bad) |
|
@DaveCTurner the performance issue is that we have larger chunks inbound than outbound: so doing the above will likely be felt on reading. This is something that would be more fun to address if we could just move outbound closer to Netty and write in larger than 16k chunks there as well. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could write in 1MiB chunks too, they don't need to be single-page jobs - really anything to put an O(1) bound on the additional memory usage on the sender side. But I'm not accounting for other changes you're planning here, so this LGTM as an interim measure.
|
Thanks Nik + David! |

These responses can grow quite large. We dealt with this fact on the reader side by
retaining pooled buffers when reading the agg responses from the wire. On the sending side
though we still allocate unpooled buffers for serialization because we need to serialize
including a size-prefix. We could probably make this more efficient by changing the wire
format to some kind of chunked encoding in a follow-up but that would require additional
complex changes on the reader side. For now, this PR serializes the bytes to pooled buffers
right away and then copies the pooled buffer to the outbound buffer page-by-page,
releasing pages as they get copied. This cuts the peak-memory use of sending these messages
almost in half and removes unpooled buffer allocations. This is particularly useful
when a data node has many shards for an aggs query and might send a number of these responses
in parallel and/or rapid succession.
relates #77466 and #72370