Add basic metrics to disk queue #33471

fearful-symmetry · 2022-10-27T18:33:47Z

What does this PR do?

This is part of fix elastic/elastic-agent-shipper#11

This adds basic health and status metrics to the disk queue: a byte limit, a current byte count, an occupied_read metric, and a reporting of the oldest ID in the queue.

I've never looked at the disk queue before this, and I still only kind of understand it, so there's probably some issues lurking around in here. In particular, there's no mutexes or anything in the segments code, and the comments do mention restricting how things like sizeOnDisk() are called, so I'm a tad concerned about how we're making multiple passes through the various queueSegment types.

This also adds a new field to the queue.Metrics type, as the unacked read field was previously events-only.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

elasticmachine · 2022-10-27T18:33:50Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2022-10-27T18:34:22Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2022-10-27T19:25:28Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-11-03T20:05:49.191+0000
Duration: 77 min 30 sec

Test stats 🧪

Test	Results
Failed	0
Passed	23741
Skipped	1951
Total	25692

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

fearful-symmetry · 2022-10-27T20:52:27Z

Tests are working locally, not sure what's up with CI....

cmacknz

Did a first pass based on my knowledge of the disk queue internals.

libbeat/publisher/queue/queue.go

libbeat/publisher/queue/diskqueue/queue.go

libbeat/publisher/queue/diskqueue/segments.go

cmacknz · 2022-11-02T19:25:16Z

libbeat/publisher/queue/diskqueue/segments.go

+	oldSegments = append(oldSegments, segments.writing...)
+	oldSegments = append(oldSegments, segments.acked...)
+
+	sort.Slice(oldSegments, func(i, j int) bool { return oldSegments[i].id < oldSegments[j].id })


Is this taking the oldest segment ID, or the oldest event ID? I think it is the former, and we want the latter.

We report the oldest event ID in the memory queue, which has no concept of a segment:

beats/libbeat/publisher/queue/memqueue/eventloop.go

Lines 155 to 161 in df5c62f

// If the queue is empty, we report the "oldest" ID as the next

// one that will be assigned. Otherwise, we report the ID attached

// to the oldest queueEntry.

oldestEntryID := l.nextEntryID

if oldestEntry := l.buf.OldestEntry(); oldestEntry != nil {

oldestEntryID = oldestEntry.id

}

Probably implementing this requires elastic/elastic-agent-shipper#27 to be done first

AH, you're right. @leehinman Do you think there's any value in reporting the oldest segment ID? The more I think about it, I'm kind of leaning towards "no."

No, I don't think oldest ID will be useful. Also it is easy to find out, the name of segment file is <id no>.seg, so just looking at the queue directory gives us this info.

cmacknz · 2022-11-02T19:26:32Z

libbeat/publisher/queue/queue.go

@@ -47,6 +47,8 @@ type Metrics struct {

 	//UnackedConsumedEvents is the count of events that an output consumer has read, but not yet ack'ed
 	UnackedConsumedEvents opt.Uint
+	//UnackedConsumedEvents is the count of bytes that an output consumer has read, but not yet ack'ed
+	UnackedConsumedBytes opt.Uint

 	//OldestActiveTimestamp is the timestamp of the oldest item in the queue.
 	OldestActiveTimestamp common.Time


Can you remove OldestActiveTimestamp from this struct since we are not implementing it, since it requires too many changes to the internals of both queues?

That'll require making changes in elastic-agent-shipper, so we may want to do that in a separate PR?

Yes let's do a separate up PR to clean this up.

cmacknz · 2022-11-02T19:30:47Z

libbeat/publisher/queue/diskqueue/segments.go

+func (segments *diskQueueSegments) unACKedReadBytes() uint64 {
+	var acc uint64
+	for _, seg := range segments.acking {
+		acc += seg.byteCount


I'm not sure this is right either, because this will count all the bytes in a segment which may consist of all the events in the segment file. Ideally we would want this to increment event by event, otherwise it will just always be the size of the segment file currently being sent.

The disk queue is a collection of segment files, each segment file containing multiple frames (events): https://github.com/elastic/beats/blob/main/libbeat/publisher/queue/diskqueue/docs/on-disk-structures.md

If this is difficult to implement in a useful way, I am in favour of just not implementing the metric at this point.

@leehinman can you chime in here? I'm kind of going off the assumption that if we're reporting values in bytes, it doesn't matter too much how those byte segments map to individual events, but I don't know enough about the backend of the disk queue to know if "currently size of whatever segment chunk just got sent" is useful or not.

I don't think the amount of events (or bytes) that have been read but not acked would be very useful. It should always be the same as amount of data the output has sent to but hasn't heard back yet. So the output should have this info.

leehinman · 2022-11-03T17:47:47Z

libbeat/publisher/queue/diskqueue/queue.go

+}
+
+// metrics response from the disk queue
+type metricsRequestResponse struct {


I think we could add number of items in the queue. On startup the segments are read, and they contain a count field. We could update that total when we add & ack events in the queue.

I'm not sure oldestEntryID is very helpful, the ID isn't a time, so if I told you the oldest ID was 13075 what would you do with that information? I think I'd rather see a guage that shows me the rate things are being added to the queue and being vacated from the queue as well as total number in the queue.

For OccupiedRead, is this number of events that we have sent but haven't received an ACK for?

I'm not sure oldestEntryID is very helpful, the ID isn't a time, so if I told you the oldest ID was 13075 what would you do with that information? I think I'd rather see a guage that shows me the rate things are being added to the queue and being vacated from the queue as well as total number in the queue.

I was thinking the same. The only value of oldestEntryID is that we can look to see if it is changing to know if the queue is draining. We will probably get very similar information from observing the current queue size, but the oldestEntryID would let us catch the edge case where the queue is almost always full but data is still moving through it.

I am not opposed to removing this metric, but I think it has some small value.

I think we could add number of items in the queue. On startup the segments are read, and they contain a count field. We could update that total when we add & ack events in the queue

Having EventCount for the disk queue would be interesting, and would mean we can always rely on the EventCount metric being populated between both queue types which I think makes the metrics easier to use.

I'm reluctant to use oldestEntryID as an indicator of a draining queue, if for some reason an event can't be delivered but we have retries, than oldestEntryID might be static but the queue could be draining.

If oldestEntryID is only ever useful during edge cases, let's just remove it. It will be better to add some more obvious metrics for those, like watching an increasing retry counter in the output metrics.

fearful-symmetry · 2022-11-03T19:08:35Z

Alright @cmacknz I had a brief zoom chat with @leehinman , we've cleared up a few things about how this should work:

I'm going to get rid of the oldest segment ID and OccupiedRead, since they're just not useful in this context
In a follow-up PR, I'm going to retool the Queue metrics struct to be a little less opinionated and give us more room to report metrics that are specific to a given queue type
Lee is going to update the meta-issue with some ideas for metrics that are specific to the disk queue that we can implement in the future

* add basic metrics to disk queue * linter * debugging testt CI issues * tinkering with tests still * remove unneeded metrics, clean up

add basic metrics to disk queue

3610c86

fearful-symmetry added Team:Elastic-Agent Label for the Agent team v8.6.0 labels Oct 27, 2022

fearful-symmetry requested review from faec and leehinman October 27, 2022 18:33

fearful-symmetry requested a review from a team as a code owner October 27, 2022 18:33

fearful-symmetry self-assigned this Oct 27, 2022

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Oct 27, 2022

linter

b1d0ad8

fearful-symmetry added 2 commits October 27, 2022 15:42

debugging testt CI issues

9c82d69

tinkering with tests still

28f0d43

cmacknz reviewed Nov 2, 2022

View reviewed changes

leehinman reviewed Nov 3, 2022

View reviewed changes

remove unneeded metrics, clean up

1eb30b6

fearful-symmetry requested review from leehinman and cmacknz November 4, 2022 17:34

leehinman approved these changes Nov 9, 2022

View reviewed changes

fearful-symmetry merged commit 2e61bb8 into elastic:main Nov 9, 2022

pierrehilbert mentioned this pull request Feb 7, 2023

Implement queue metrics for the disk queue elastic/elastic-agent-shipper#145

Open

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Add basic metrics to disk queue (#33471)

2c1ca62

* add basic metrics to disk queue * linter * debugging testt CI issues * tinkering with tests still * remove unneeded metrics, clean up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic metrics to disk queue #33471

Add basic metrics to disk queue #33471

fearful-symmetry commented Oct 27, 2022 •

edited by pierrehilbert

Loading

elasticmachine commented Oct 27, 2022

mergify bot commented Oct 27, 2022

elasticmachine commented Oct 27, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

fearful-symmetry commented Oct 27, 2022

cmacknz left a comment

cmacknz Nov 2, 2022

fearful-symmetry Nov 3, 2022

leehinman Nov 4, 2022

cmacknz Nov 2, 2022

fearful-symmetry Nov 3, 2022

cmacknz Nov 3, 2022 •

edited

Loading

cmacknz Nov 2, 2022

cmacknz Nov 2, 2022

fearful-symmetry Nov 3, 2022 •

edited

Loading

leehinman Nov 4, 2022

leehinman Nov 3, 2022

cmacknz Nov 3, 2022

cmacknz Nov 3, 2022

leehinman Nov 4, 2022

cmacknz Nov 7, 2022

fearful-symmetry commented Nov 3, 2022

	// If the queue is empty, we report the "oldest" ID as the next
	// one that will be assigned. Otherwise, we report the ID attached
	// to the oldest queueEntry.
	oldestEntryID := l.nextEntryID
	if oldestEntry := l.buf.OldestEntry(); oldestEntry != nil {
	oldestEntryID = oldestEntry.id
	}

Add basic metrics to disk queue #33471

Add basic metrics to disk queue #33471

Conversation

fearful-symmetry commented Oct 27, 2022 • edited by pierrehilbert Loading

What does this PR do?

Checklist

elasticmachine commented Oct 27, 2022

mergify bot commented Oct 27, 2022

elasticmachine commented Oct 27, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

fearful-symmetry commented Oct 27, 2022

cmacknz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fearful-symmetry Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fearful-symmetry commented Nov 3, 2022

fearful-symmetry commented Oct 27, 2022 •

edited by pierrehilbert

Loading

elasticmachine commented Oct 27, 2022 •

edited by jenkins-beats-ci bot

Loading

cmacknz Nov 3, 2022 •

edited

Loading

fearful-symmetry Nov 3, 2022 •

edited

Loading