KIP 714 New Telemetry Metrics #4808

mahajanadhitya · 2024-08-08T19:11:02Z

No description provided.

* Initial Commit * changes * minor * changes * Commit Latency Metric * changes * changes * Style fix --------- Co-authored-by: Anchit Jain <anjain@confluent.io>

emasab

First round of comments

src/rdkafka_op.h

src/rdkafka.c

tests/0150-telemetry_mock.c

emasab · 2024-08-14T10:54:35Z

src/rdkafka.c

@@ -3218,10 +3219,21 @@ rd_kafka_consume0(rd_kafka_t *rk, rd_kafka_q_t *rkq, int timeout_ms) {
 rd_kafka_app_poll_blocking(rk);

 rd_kafka_yield_thread = 0;
+ rd_ts_t now = rd_clock();


There are many ways the user can call poll, see where rd_kafka_app_poll_blocking is called, those places aren't considered at the moment. So you can convert rd_kafka_app_poll_blocking to rd_kafka_app_poll_start with a parameter that says if it's blocking that corresponds to timeout_ms and do the calculation there.

Given we're in the hot path let's reduce system calls, you can get the now value here and pass it to rd_timeout_init0 and then pass it to rd_kafka_app_poll_start too.

Can you explain this thing further, since we are only refactoring how are we changing the order to reduce system call in the hot path ?

System call here is for getting the monotonic clock, we want to do it only once to reduce number of these calls (it's already done in rd_timeout_init).

emasab · 2024-08-14T11:00:52Z

src/rdkafka.c

@@ -3218,10 +3219,21 @@ rd_kafka_consume0(rd_kafka_t *rk, rd_kafka_q_t *rkq, int timeout_ms) {
 rd_kafka_app_poll_blocking(rk);

 rd_kafka_yield_thread = 0;
+ rd_ts_t now = rd_clock();
+ if (rk->rk_telemetry.ts_fetch_last != -1) {


It can be useful for something else than telemetry so we can call it rk->rk_ts_last_poll_start also the check is if it non-zero given it's initialized to zero.

emasab · 2024-08-14T14:59:10Z

src/rdkafka_telemetry_encode.c

+ .rkb_avg_rebalance_latency,
+ &rkb->rkb_telemetry.rd_avg_current
+ .rkb_avg_rebalance_latency);
+ rd_avg_destroy(


Fetch latency rollover is to be done only if (rk->rk_type == RD_KAFKA_CONSUMER) {

emasab · 2024-08-14T15:00:12Z

src/rdkafka_telemetry_encode.c

+ rd_avg_destroy(
+ &rk->rk_telemetry.rk_avg_rollover.rk_avg_poll_idle_ratio);
+ rd_avg_rollover(
+ &rk->rk_telemetry.rk_avg_current.rk_avg_poll_idle_ratio,
+ &rk->rk_telemetry.rk_avg_rollover.rk_avg_poll_idle_ratio);


It's to be done only if rk->rk_type == RD_KAFKA_CONSUMER, also add instructions for rk_avg_rebalance_latency and rk_avg_commit_latency

emasab · 2024-08-14T15:02:25Z

src/rdkafka_telemetry_encode.c

+ rk->rk_telemetry.ts_fetch_last = -1;
+ rk->rk_telemetry.ts_fetch_cb_last = -1;


These two (renamed) don't need to be reset on push but just initialized to zero by default and used afterwards

emasab · 2024-08-14T15:05:12Z

src/rdkafka_telemetry_encode.h

+ "consumer coordinator.",
+ .unit = "ms",
+ .is_int = rd_true,
+ .is_per_broker = rd_true,


Per broker is rd_false everywhere except for node.request.latency, it's different from where we store the values, see KIP labels

emasab · 2024-08-14T15:06:24Z

src/rdkafka_telemetry_encode.h

+ .is_per_broker = rd_true,
+ .type = RD_KAFKA_TELEMETRY_METRIC_TYPE_SUM},
+ [RD_KAFKA_TELEMETRY_METRIC_CONSUMER_FETCH_MANAGER_FETCH_LATENCY_AVG] =
+ {.name = "consumer.fetch.manager.fetch.latency.avg ",


Remove these additional spaces in metric names within this file

Suggested change

{.name = "consumer.fetch.manager.fetch.latency.avg ",

{.name = "consumer.fetch.manager.fetch.latency.avg",

emasab

Comments after first changes and about unit and mock tests

emasab · 2024-08-20T08:37:57Z

src/rdkafka.c

@@ -3889,6 +3901,7 @@ rd_kafka_op_res_t rd_kafka_poll_cb(rd_kafka_t *rk,

 switch ((int)rko->rko_type) {
 case RD_KAFKA_OP_FETCH:
+ rk->rk_telemetry.ts_fetch_cb_last = rd_clock();


This line left with previous name is causing a compilation error

emasab · 2024-08-20T08:42:32Z

src/rdkafka_broker.c

-
+ if (rkbuf->rkbuf_reqhdr.ApiKey == RD_KAFKAP_Fetch) {
+ rd_avg_add(&rkb->rkb_rk->rk_telemetry.rd_avg_current
+ .rk_avg_fetch_latency,


Fetch latency needs to stay per broker as it's useful to have this information differently from rk_avg_commit_latency that is increased by a single broker at a time (the group coordinator).

But ithas per_broker set to false in the KIP table itself.

I will move it from rk to rkb

emasab · 2024-08-20T08:44:05Z

src/rdkafka_cgrp.c

-
+ if(join_state == RD_KAFKA_CGRP_JOIN_STATE_STEADY){
+ rd_avg_add(&rkcg->rkcg_curr_coord->rkb_telemetry.rd_avg_current
+ .rkb_avg_rebalance_latency,


This avg is in rk now

emasab · 2024-08-20T08:46:52Z

src/rdkafka_cgrp.c

+ .rkb_avg_rebalance_latency,
+ rd_clock() - rkcg->rkcg_ts_rebalance_start);
+ }
+ switch ((int)rkcg->rkcg_join_state) {


You can change the switch here and join it with previous condition in a if ... else if as it's on different variables: join_state in one case and rkcg->rkcg_join_state in the other.

emasab · 2024-08-20T08:54:31Z

src/rdkafka.c

@@ -939,7 +939,19 @@ void rd_kafka_destroy_final(rd_kafka_t *rk) {
 /* Synchronize state */
 rd_kafka_wrlock(rk);
 rd_kafka_wrunlock(rk);
+ if(rk->rk_type == RD_KAFKA_CONSUMER){


There's already an if condition for the consumer later, add these instructions there

emasab · 2024-08-20T16:13:17Z

tests/0150-telemetry_mock.c

+ * successful PushTelemetry requests.
+ * See `requests_expected` for detailed expected flow.
+ */
+void do_test_telemetry_get_subscription_push_telemetry_consumer(void) {


please add static to all functions in this file except the main one

this should be the existing do_test_telemetry_get_subscription_push_telemetry but taking rd_kafka_type_t type as a parameter. So we call the same function before with the producer and then with the consumer.

Testing with the producer and consumer should be done for all the tests in this file, that's the next thing you should work on

these are not new metric specific, i need to discuss these

emasab · 2024-08-20T16:20:55Z

tests/0150-telemetry_mock.c

+ mcluster = test_mock_cluster_new(1, &bootstraps);
+ rd_kafka_mock_telemetry_set_requested_metrics(mcluster,
+ expected_metrics, 1);
+ rd_kafka_mock_telemetry_set_push_interval(mcluster, push_interval);
+ rd_kafka_mock_start_request_tracking(mcluster);


This part can be refactored to a create_mcluster function and used in all the tests

emasab · 2024-08-20T16:23:17Z

tests/0150-telemetry_mock.c

+ test_conf_init(&conf, NULL, 30);
+ test_conf_set(conf, "bootstrap.servers", bootstraps);
+ test_conf_set(conf, "debug", "telemetry");
+ consumer = test_create_handle(RD_KAFKA_CONSUMER, conf);


Creating the handle can also be refactored and used in all the tests. The consumer must subscribe to a topic that is created and pre populated in advance so we can see metrics in logs consumer tests.

emasab · 2024-08-20T16:30:22Z

tests/0150-telemetry_mock.c

+
+ /* Poll for enough time for two pushes to be triggered, and a little
+ * extra, so 2.5 x push interval. */
+ test_poll_timeout(consumer, push_interval * 2.5);


test_poll_timeout should call test_consumer_poll_timeout with the consumer or test_produce_msgs with the consumer so we can see metric in logs that are not zero and check if they're correct. A later refactor would be to automatically check the metric values but for this step that is enough.

emasab · 2024-08-20T16:33:01Z

tests/0150-telemetry_mock.c

+ * extra, so 2.5 x push interval. */
+ test_poll_timeout(consumer, push_interval * 2.5);
+
+ requests = rd_kafka_mock_get_requests(mcluster, &request_cnt);


requests can be got and destroyed directly in test_telemetry_check_protocol_request_times so we can just pass mcluster to it and remove this repeated part of the code in all the subtests.

Since these are dependent on all the tests, I will come back on these, in the office hours !

mahajanadhitya · 2024-09-10T09:13:05Z

Librdkafka KIP 714 New Metrics -> Addition of Telemetry Metrics
Branch Name : feature/714_NewMetrics
Unit Tests : $ TESTS=0000 make
Mock Integration Tests : $ TESTS=0150 make
KIP 714 Link
Note : The Metric for Fetch Coordinator Latency has been changed to per_broker instead of per_rk which has diverged from the main KIP 714 table
Any warnings via configure, make or make install are not subject to my changes.

Formatting done via,
$ make style-fix

anchitj · 2024-09-12T11:16:08Z

src/rdkafka_telemetry_decode.c

+ RD_KAFKA_TELEMETRY_METRIC_CONSUMER_POLL_IDLE_RATIO_AVG,
+ RD_KAFKA_TELEMETRY_METRIC_CONSUMER_COORDINATOR_COMMIT_LATENCY_AVG,
+ RD_KAFKA_TELEMETRY_METRIC_CONSUMER_COORDINATOR_COMMIT_LATENCY_MAX};
+ for (int i = 0; i < 6; i++) {


initialize before using in loop, this is causing CI failure

anchitj

Overall looks good, some minor changes. Please also fix the CI

anchitj · 2024-09-13T08:12:56Z

src/rdkafka_broker.h

+ * between buf_enq0
+ * and writing to socket
+ */
+ rd_avg_t rd_avg_fetch_latency; /**< Current fetch


This should be named rkb_avg_fetch_latency since this is at broker level.

anchitj · 2024-09-13T08:13:51Z

src/rdkafka_broker.h

 } rd_avg_current;
 struct {
 rd_avg_t rkb_avg_rtt; /**< Rolled over RTT avg */
 rd_avg_t
 rkb_avg_throttle; /**< Rolled over throttle avg */
 rd_avg_t rkb_avg_outbuf_latency; /**< Rolled over outbuf
 * latency avg */
+ rd_avg_t rd_avg_fetch_latency; /**< Rolled over fetch


rkb_avg_fetch_latency

anchitj · 2024-09-13T08:14:20Z

src/rdkafka_int.h

@@ -692,12 +695,37 @@ struct rd_kafka_s {
 int *matched_metrics;
 size_t matched_metrics_cnt;

+


anchitj · 2024-09-13T08:44:47Z

tests/0150-telemetry_mock.c

- rd_kafka_mock_request_t **requests_actual,
- size_t actual_cnt,
+ rd_kafka_mock_cluster_t *mcluster,
 rd_kafka_telemetry_expected_request_t *requests_expected,
 size_t expected_cnt) {
+ size_t actual_cnt;
+ rd_kafka_mock_request_t **requests_actual =
+ rd_kafka_mock_get_requests(mcluster, &actual_cnt);


Why was this changed? We either should also remove rd_kafka_mock_get_requests in the methods or keep this as it is

anchitj · 2024-09-13T08:45:35Z

Please also resolve the earlier comments if they've been fixed

KIP 714 New Telemetry Metrics (#7)

80cec15

* Initial Commit * changes * minor * changes * Commit Latency Metric * changes * changes * Style fix --------- Co-authored-by: Anchit Jain <anjain@confluent.io>

mahajanadhitya requested a review from a team as a code owner August 8, 2024 19:11

emasab requested changes Aug 14, 2024

View reviewed changes

PR Changes 1

e5974c3

emasab requested changes Aug 20, 2024

View reviewed changes

Incremental Changes

e979a1b

anchitj reviewed Sep 12, 2024

View reviewed changes

anchitj requested changes Sep 13, 2024

View reviewed changes

		rk->rk_telemetry.ts_fetch_last = -1;
		rk->rk_telemetry.ts_fetch_cb_last = -1;

	{.name = "consumer.fetch.manager.fetch.latency.avg ",
	{.name = "consumer.fetch.manager.fetch.latency.avg",

		@@ -692,12 +695,37 @@ struct rd_kafka_s {
		int *matched_metrics;
		size_t matched_metrics_cnt;

KIP 714 New Telemetry Metrics #4808

Are you sure you want to change the base?

KIP 714 New Telemetry Metrics #4808

Conversation

mahajanadhitya commented Aug 8, 2024

emasab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emasab Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emasab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahajanadhitya commented Sep 10, 2024

Choose a reason for hiding this comment

anchitj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anchitj commented Sep 13, 2024

emasab Aug 14, 2024 •

edited

Loading