Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit memory usage cap for the disk queue #30069

Open
Tracked by #118 ...
faec opened this issue Jan 27, 2022 · 6 comments
Open
Tracked by #118 ...

Add explicit memory usage cap for the disk queue #30069

faec opened this issue Jan 27, 2022 · 6 comments
Labels
debugging estimation:Month Task that represents a month of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@faec
Copy link
Contributor

faec commented Jan 27, 2022

Describe the enhancement: The disk queue necessarily uses in-memory queues for messages waiting to be written to disk, or to be written to the output after reading from disk. The amount of in-memory data can be capped, but (as with the memory queue) currently only by specifying a maximum number of events; there is no way to specify a maximum number of bytes, so the degree of control is dependent on having consistent / predictable event sizes. We should add parameters that can specify the maximum as a number of bytes instead, so it is easy to correctly tune a node's memory usage independent of its specific event flow.

Describe a specific use case for the enhancement or feature: This feature was not a priority in the initial release because the most urgent use case we understood at the time was data persistence. However, we are now seeing it deployed as an alternative to the memory queue specifically to work around the constraints of low-memory nodes, and these configuration options could greatly increase its effectiveness in that setting.

Technical sketch

There are two changes needed to support this feature, one for the intake queue (where events wait to be written to disk), and one for the output queue (where events that have been read from disk wait to be assigned to an output worker).

Intake queue

This change should be fairly simple: in libbeat/publisher/pipeline/queue/diskqueue/core_loop.go, the function diskQueue.canAcceptFrameOfSize() decides whether to accept a new event into the intake queue (diskQueue.pendingFrames). Currently its check is based only on the number of events already in the queue, but adding a size check would be straightforward (there are already helpers that calculate the size of the intake queue).

Output queue

This will require more significant changes, but is still feasible. The output queue is currently in the channel readerLoop.output in libbeat/publisher/pipeline/queue/diskqueue/reader_loop.go. Right now the simple event-count cap is implemented only by setting the buffer size of the channel.

To make it aware of size constraints:

  1. The core loop (core_loop.go) must track how many bytes have been read from disk that have not yet been claimed by a consumer. This will likely be a byte counter in the main diskQueue structure. (The number of bytes read / allocated by the readerLoop as it reads from disk is already reported back in the readerLoopResponse, but currently it is only used to track queue position; handleReaderLoopResponse should also update the number of outstanding bytes in memory.)
  2. The event consumer (consumer.go), which reads from readerLoop.output, must inform the core loop of how much data it has claimed (which can then be subtracted from the total outstanding).
  3. The readerLoopRequest sent in core_loop.go:maybeReadPending must now calculate its endPosition based not only on how much total data is available, but on how much memory is free.

Subtlety: currently the reader loop can send events to the outputs while they are still being read, i.e. before it has sent the response to the core loop confirming how many bytes are used. Thus, "acknowledgements" from the consumer may come in before we have confirmation from the reader loop that the memory was occupied in the first place.

This is ok, however (as long as the books are balanced): when the core loop sends the reader loop request with the memory cap, that memory should already be considered "used", and thus we can safely "free" that memory quota when it is claimed by a consumer. The exact byte count in the reader loop response is only needed to detect when the real memory use is less than the amount reserved when sending the request.

@faec faec added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2022

Moved into 8.3 when the shipper work will start, we'll have to touch the queue at that point anyway. Needed to make room for a customer request in 8.2.

@pierrehilbert
Copy link
Collaborator

We should start by doing it for the memory queue and it will reduce the level of complexity to do this one and change the estimate from L to a smaller one.

@botelastic
Copy link

botelastic bot commented Jan 31, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jan 31, 2024
@zez3
Copy link

zez3 commented Feb 1, 2024

Has thiw been alredy implemented?

@botelastic botelastic bot removed the Stalled label Feb 1, 2024
@pierrehilbert
Copy link
Collaborator

No it has not been implemented yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debugging estimation:Month Task that represents a month of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

6 participants