-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic metrics to disk queue #33471
Changes from 4 commits
3610c86
b1d0ad8
9c82d69
28f0d43
1eb30b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -470,6 +470,35 @@ func (segments *diskQueueSegments) sizeOnDisk() uint64 { | |||||||||||||||
return total | ||||||||||||||||
} | ||||||||||||||||
|
||||||||||||||||
// Iterates through all segment types to find the oldest ID in the queue | ||||||||||||||||
func (segments *diskQueueSegments) oldestIDOnDisk() segmentID { | ||||||||||||||||
// not all segment types are pre-sorted, hence us appending some, taking [0] from others | ||||||||||||||||
fearful-symmetry marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
oldSegments := []*queueSegment{} | ||||||||||||||||
|
||||||||||||||||
if len(segments.reading) > 0 { | ||||||||||||||||
oldSegments = append(oldSegments, segments.reading[0]) | ||||||||||||||||
} | ||||||||||||||||
if len(segments.acking) > 0 { | ||||||||||||||||
oldSegments = append(oldSegments, segments.acking[0]) | ||||||||||||||||
} | ||||||||||||||||
|
||||||||||||||||
oldSegments = append(oldSegments, segments.writing...) | ||||||||||||||||
oldSegments = append(oldSegments, segments.acked...) | ||||||||||||||||
|
||||||||||||||||
sort.Slice(oldSegments, func(i, j int) bool { return oldSegments[i].id < oldSegments[j].id }) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this taking the oldest segment ID, or the oldest event ID? I think it is the former, and we want the latter. We report the oldest event ID in the memory queue, which has no concept of a segment: beats/libbeat/publisher/queue/memqueue/eventloop.go Lines 155 to 161 in df5c62f
Probably implementing this requires elastic/elastic-agent-shipper#27 to be done first There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AH, you're right. @leehinman Do you think there's any value in reporting the oldest segment ID? The more I think about it, I'm kind of leaning towards "no." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I don't think oldest ID will be useful. Also it is easy to find out, the name of segment file is |
||||||||||||||||
|
||||||||||||||||
return oldSegments[0].id | ||||||||||||||||
} | ||||||||||||||||
|
||||||||||||||||
// unACKedReadBytes returns the total count of bytes that have been read, but not ack'ed by the consumer | ||||||||||||||||
func (segments *diskQueueSegments) unACKedReadBytes() uint64 { | ||||||||||||||||
var acc uint64 | ||||||||||||||||
for _, seg := range segments.acking { | ||||||||||||||||
acc += seg.byteCount | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure this is right either, because this will count all the bytes in a segment which may consist of all the events in the segment file. Ideally we would want this to increment event by event, otherwise it will just always be the size of the segment file currently being sent. The disk queue is a collection of segment files, each segment file containing multiple frames (events): https://github.com/elastic/beats/blob/main/libbeat/publisher/queue/diskqueue/docs/on-disk-structures.md There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is difficult to implement in a useful way, I am in favour of just not implementing the metric at this point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @leehinman can you chime in here? I'm kind of going off the assumption that if we're reporting values in bytes, it doesn't matter too much how those byte segments map to individual events, but I don't know enough about the backend of the disk queue to know if "currently size of whatever segment chunk just got sent" is useful or not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the amount of events (or bytes) that have been read but not acked would be very useful. It should always be the same as amount of data the output has sent to but hasn't heard back yet. So the output should have this info. |
||||||||||||||||
} | ||||||||||||||||
return acc | ||||||||||||||||
} | ||||||||||||||||
|
||||||||||||||||
// segmentReader handles reading of segments. getReader sets up the | ||||||||||||||||
// reader and handles setting up the Reader to deal with the different | ||||||||||||||||
// schema version. With Schema version 2 there is the option for | ||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,6 +47,8 @@ type Metrics struct { | |
|
||
//UnackedConsumedEvents is the count of events that an output consumer has read, but not yet ack'ed | ||
UnackedConsumedEvents opt.Uint | ||
//UnackedConsumedEvents is the count of bytes that an output consumer has read, but not yet ack'ed | ||
fearful-symmetry marked this conversation as resolved.
Show resolved
Hide resolved
|
||
UnackedConsumedBytes opt.Uint | ||
|
||
//OldestActiveTimestamp is the timestamp of the oldest item in the queue. | ||
OldestActiveTimestamp common.Time | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That'll require making changes in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes let's do a separate up PR to clean this up. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add number of items in the queue. On startup the segments are read, and they contain a count field. We could update that total when we add & ack events in the queue.
I'm not sure oldestEntryID is very helpful, the ID isn't a time, so if I told you the oldest ID was 13075 what would you do with that information? I think I'd rather see a guage that shows me the rate things are being added to the queue and being vacated from the queue as well as total number in the queue.
For
OccupiedRead
, is this number of events that we have sent but haven't received an ACK for?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same. The only value of oldestEntryID is that we can look to see if it is changing to know if the queue is draining. We will probably get very similar information from observing the current queue size, but the oldestEntryID would let us catch the edge case where the queue is almost always full but data is still moving through it.
I am not opposed to removing this metric, but I think it has some small value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having
EventCount
for the disk queue would be interesting, and would mean we can always rely on theEventCount
metric being populated between both queue types which I think makes the metrics easier to use.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to use oldestEntryID as an indicator of a draining queue, if for some reason an event can't be delivered but we have retries, than oldestEntryID might be static but the queue could be draining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If oldestEntryID is only ever useful during edge cases, let's just remove it. It will be better to add some more obvious metrics for those, like watching an increasing retry counter in the output metrics.