Add explicit memory usage cap for the disk queue #30069

faec · 2022-01-27T17:38:59Z

Describe the enhancement: The disk queue necessarily uses in-memory queues for messages waiting to be written to disk, or to be written to the output after reading from disk. The amount of in-memory data can be capped, but (as with the memory queue) currently only by specifying a maximum number of events; there is no way to specify a maximum number of bytes, so the degree of control is dependent on having consistent / predictable event sizes. We should add parameters that can specify the maximum as a number of bytes instead, so it is easy to correctly tune a node's memory usage independent of its specific event flow.

Describe a specific use case for the enhancement or feature: This feature was not a priority in the initial release because the most urgent use case we understood at the time was data persistence. However, we are now seeing it deployed as an alternative to the memory queue specifically to work around the constraints of low-memory nodes, and these configuration options could greatly increase its effectiveness in that setting.

Technical sketch

There are two changes needed to support this feature, one for the intake queue (where events wait to be written to disk), and one for the output queue (where events that have been read from disk wait to be assigned to an output worker).

Intake queue

This change should be fairly simple: in libbeat/publisher/pipeline/queue/diskqueue/core_loop.go, the function diskQueue.canAcceptFrameOfSize() decides whether to accept a new event into the intake queue (diskQueue.pendingFrames). Currently its check is based only on the number of events already in the queue, but adding a size check would be straightforward (there are already helpers that calculate the size of the intake queue).

Output queue

This will require more significant changes, but is still feasible. The output queue is currently in the channel readerLoop.output in libbeat/publisher/pipeline/queue/diskqueue/reader_loop.go. Right now the simple event-count cap is implemented only by setting the buffer size of the channel.

To make it aware of size constraints:

The core loop (core_loop.go) must track how many bytes have been read from disk that have not yet been claimed by a consumer. This will likely be a byte counter in the main diskQueue structure. (The number of bytes read / allocated by the readerLoop as it reads from disk is already reported back in the readerLoopResponse, but currently it is only used to track queue position; handleReaderLoopResponse should also update the number of outstanding bytes in memory.)
The event consumer (consumer.go), which reads from readerLoop.output, must inform the core loop of how much data it has claimed (which can then be subtracted from the total outstanding).
The readerLoopRequest sent in core_loop.go:maybeReadPending must now calculate its endPosition based not only on how much total data is available, but on how much memory is free.

Subtlety: currently the reader loop can send events to the outputs while they are still being read, i.e. before it has sent the response to the core loop confirming how many bytes are used. Thus, "acknowledgements" from the consumer may come in before we have confirmation from the reader loop that the memory was occupied in the first place.

This is ok, however (as long as the books are balanced): when the core loop sends the reader loop request with the memory cap, that memory should already be considered "used", and thus we can safely "free" that memory quota when it is claimed by a consumer. The exact byte count in the reader loop response is only needed to detect when the real memory use is less than the amount reserved when sending the request.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-27T17:39:00Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2022-02-10T13:34:55Z

Moved into 8.3 when the shipper work will start, we'll have to touch the queue at that point anyway. Needed to make room for a customer request in 8.2.

pierrehilbert · 2023-01-31T17:27:37Z

We should start by doing it for the memory queue and it will reduce the level of complexity to do this one and change the estimate from L to a smaller one.

botelastic · 2024-01-31T18:01:49Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

zez3 · 2024-02-01T10:22:01Z

Has thiw been alredy implemented?

pierrehilbert · 2024-02-01T11:45:46Z

No it has not been implemented yet.

faec added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 27, 2022

faec added the 8.2 candidate label Jan 27, 2022

jlind23 added 8.2-candidate debugging v8.2.0 and removed 8.2 candidate 8.2-candidate labels Jan 28, 2022

cmacknz added v8.3.0 8.3-candidate and removed v8.2.0 v8.3.0 labels Feb 10, 2022

cmacknz mentioned this issue Mar 14, 2022

[Meta][Project] Implement the Elastic Agent Data Shipper elastic/elastic-agent-shipper#3

Closed

3 tasks

cmacknz mentioned this issue Mar 22, 2022

[Meta] Elastic Agent Shipper Project elastic/elastic-agent-shipper#16

Open

100 tasks

jlind23 added 8.4-candidate and removed 8.3-candidate labels Mar 23, 2022

cmacknz mentioned this issue Apr 14, 2022

Define a queue metrics reporter interface #31289

Merged

jlind23 added v8.4.0 estimation:Month Task that represents a month of work. and removed 8.4-candidate labels May 24, 2022

jlind23 assigned faec Jun 1, 2022

faec added v8.5.0 and removed v8.4.0 labels Jul 21, 2022

leehinman mentioned this issue Sep 26, 2022

[Meta] disk queue journey to GA elastic/elastic-agent-shipper#118

Open

13 tasks

jlind23 removed the v8.5.0 label Nov 22, 2022

cmacknz unassigned faec Jan 5, 2023

botelastic bot added the Stalled label Jan 31, 2024

botelastic bot removed the Stalled label Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit memory usage cap for the disk queue #30069

Add explicit memory usage cap for the disk queue #30069

faec commented Jan 27, 2022

elasticmachine commented Jan 27, 2022

cmacknz commented Feb 10, 2022

pierrehilbert commented Jan 31, 2023

botelastic bot commented Jan 31, 2024

zez3 commented Feb 1, 2024

pierrehilbert commented Feb 1, 2024

Add explicit memory usage cap for the disk queue #30069

Add explicit memory usage cap for the disk queue #30069

Comments

faec commented Jan 27, 2022

Technical sketch

Intake queue

Output queue

elasticmachine commented Jan 27, 2022

cmacknz commented Feb 10, 2022

pierrehilbert commented Jan 31, 2023

botelastic bot commented Jan 31, 2024

zez3 commented Feb 1, 2024

pierrehilbert commented Feb 1, 2024