Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KIP 714 New Telemetry Metrics #4808
base: master
Are you sure you want to change the base?
KIP 714 New Telemetry Metrics #4808
Changes from 1 commit
80cec15
e5974c3
e979a1b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many ways the user can call poll, see where
rd_kafka_app_poll_blocking
is called, those places aren't considered at the moment. So you can convertrd_kafka_app_poll_blocking
tord_kafka_app_poll_start
with a parameter that says if it's blocking that corresponds totimeout_ms
and do the calculation there.Given we're in the hot path let's reduce system calls, you can get the now value here and pass it to
rd_timeout_init0
and then pass it tord_kafka_app_poll_start
too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this thing further, since we are only refactoring how are we changing the order to reduce system call in the hot path ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System call here is for getting the monotonic clock, we want to do it only once to reduce number of these calls (it's already done in rd_timeout_init).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be useful for something else than telemetry so we can call it
rk->rk_ts_last_poll_start
also the check is if it non-zero given it's initialized to zero.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ts_fetch_cb_last
can be calledrk->rk_ts_last_poll_end
and set in two places: either when we callrd_kafka_app_polled
(except on subscribe) or just before the callback as done below. If it's non-zero it's not replaced inrd_kafka_app_polled
.idle_interval
difference must be the opposite. As we haven't overwrittents_fetch_last
it's still last fetch start not current one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to double here brings an overhead, it's not necessary as we're then truncating it. It's enough to multiply idle_interval by 1000000 and divide by
poll_interval
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize
poll_idle_ratio
to 0 and avoid doingidle_interval
calculation and this division ifpoll_interval
is zero.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this needs a discussion, as poll_interval cannot be exactly zero some slippage is expected so 1000 seems fine. Also we should not add false data points so only calculate the case where poll_interval > 1000. 1e7 -> 1e6 so the calculator uses SIX_ORDERS_MAGNITUDE to get value between 0 to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be zero as it's in microseconds, it happenned to me while testing. 1e6 is a double so it will cause a conversion that we don't need here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rk_avg_poll_idle_ratio
(rk_avg_current
andrk_avg_rollover
) needs to be initialized onrd_kafka_new
if type is consumer and destroyed onrd_kafka_destroy_final
with same conditionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following comment about
rk_ts_last_poll_end
this can be set only in else (callback) block.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line left with previous name is causing a compilation error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this call according to comment below about
rk_avg_commit_latency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebalance latency and commit latency need to be moved to
rd_kafka_s
as they're not per broker. Even for therkb_avg_commit_latency
we don't need an average per broker as the coordinator is only one broker at a time for this client's consumer group.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: move rebalance latency and commit to
rd_kafka_s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avg is in
rk
nowThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This state is not reached with 848 group so this switch block needs to be changed a bit: when current state is
RD_KAFKA_CGRP_JOIN_STATE_INIT
orRD_KAFKA_CGRP_JOIN_STATE_STEADY
, and new state is different because of previous condition, it sets the rebalance start time; when new state isRD_KAFKA_CGRP_JOIN_STATE_STEADY
it calculates the rebalance time and adds it to the avg.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary with previous changes, it will handle both types of consumer groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hasn't been removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, means that I will make the change i just have not raised the Pr for that commit so left unresolved. These will be reflected with the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's need to add
rebalance_latency_total
to keep previous value necessary to calculate delta temporalityThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we called them
rd_avg_current
andrd_avg_rollover
on the broker as it's a collection ofrd_avg_t
, use same namesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to unit test calculation -> encode -> decode for other metrics too not only for
connects
so you can pass aset_metric_value
that takesrk
andrkb
that adds to the avgs. That function is called in place ofrkb->rkb_c.connects.val = 1;
we can make that calculated values need to be equal to 1 or you can pass expected value to this function.
Add these tests for new metrics only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a discussion.