Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tcp long connections metrics #1249

Open
wants to merge 62 commits into
base: main
Choose a base branch
from

Conversation

yp969803
Copy link
Contributor

@yp969803 yp969803 commented Feb 24, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:
The pr introduces new feature of tcp_long_conn metrics
Which issue(s) this PR fixes:

Fixes #1211

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
Yes

Tcp long connections metrics

@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nlgwcy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yp969803
Copy link
Contributor Author

yp969803 commented Feb 24, 2025

@nlgwcy @LemmyHuang @xiangxinyong can you review the pr, have i writen correct ebpf code, i have done the changes which has been told in previous comments in the proposal.

Copy link
Collaborator

@LiZhenCheng9527 LiZhenCheng9527 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please present the results of the long connection metrics
  2. The original destination address problem is still unhandle. If you plan to optimize later, you can create a sub-issue under the existing lfx issue to track the task.

@@ -55,6 +55,7 @@ type MetricController struct {
EnableAccesslog atomic.Bool
EnableMonitoring atomic.Bool
EnableWorkloadMetric atomic.Bool
EnableLongTCPMetric atomic.Bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to distinguish between metrics for long connections and short connections.
As long as metrics is turned on, we should handle all types of connections.

#define kmesh_perf_map km_perf_map
#define kmesh_perf_info km_perf_info
#define map_of_long_tcp_conns km_longtcpconns_map
#define long_tcp_conns_events km_longtcpconns_events
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BPF_OBJ_NAME_LEN = 16
So I think, this name of map is too long.

Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
…userspace

Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
…km_longconn_ev ringbuff

Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch from b79a168 to a698cac Compare February 27, 2025 08:53
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch 2 times, most recently from c274149 to 3df32ec Compare March 21, 2025 06:39
Signed-off-by: Yash Patel <yp969803@gmail.com>

rfac: sendmsg.c

Signed-off-by: Yash Patel <yp969803@gmail.com>

chore: generated bpf2go

Signed-off-by: Yash Patel <yp969803@gmail.com>

rfac: added is_managed_by_kmesh to sendmsg.c

Signed-off-by: Yash Patel <yp969803@gmail.com>
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch from 3df32ec to e26073e Compare March 21, 2025 06:44
@yp969803
Copy link
Contributor Author

    bpf_map.go:60: bpf init failed: bpf Load failed: load program: permission denied:
        	; int sockops_active_prog(struct bpf_sock_ops *skops)
        	0: (bf) r6 = r1
        	; __u64 sock_cookie = bpf_get_socket_cookie(skops);
        	1: (85) call bpf_get_socket_cookie#46
        	2: (bf) r7 = r0
        	; if (skops->family != AF_INET && skops->family != AF_INET6)
        	3: (61) r1 = *(u32 *)(r6 +20)
        	; if (skops->family != AF_INET && skops->family != AF_INET6)
        	4: (bf) r2 = r1
        	5: (47) r2 |= 8
        	6: (15) if r2 == 0xa goto pc+1

@LiZhenCheng9527 @nlgwcy can you look at the above error during bpf_compaitliblity test, why is this happening, i have also shorten the program length so that number of instruction sets not exceeds

@yp969803
Copy link
Contributor Author

yp969803 commented Mar 22, 2025

so the problem arrises in the line of code, with the error

	303: (18) r1 = 0xffff8ff1b4577000     ; R1_w=map_ptr(map=km_orig_dst,ks=8,vs=36)
        	305: (85) call bpf_map_lookup_elem#1
        	R2 type=sock_or_null expected=fp, pkt, pkt_meta, map_key, map_value, mem, ringbuf_mem, buf, trusted_ptr_
        	processed 390 insns (limit 1000000) max_states_per_insn 1 total_states 33 peak_states 33 mark_read 5

when i try to comment these lines and run unit tests, verifier accept it. these lines are written previously in the sockops_prog

     __u64 *current_sk = (__u64 *)skops->sk;
        struct bpf_sock_tuple *dst = bpf_map_lookup_elem(&map_of_orig_dst, current_sk);

in the program

SEC("sockops_active")
int sockops_active_prog(struct bpf_sock_ops *skops)
{
    __u64 sock_cookie = bpf_get_socket_cookie(skops);

    if (skops->family != AF_INET && skops->family != AF_INET6)
        return 0;

    switch (skops->op) {
    case BPF_SOCK_OPS_TCP_CONNECT_CB:
        skops_handle_kmesh_managed_process(skops);
        break;

    case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
        if (!is_managed_by_kmesh(skops))
            break;
        observe_on_connect_established(skops->sk, sock_cookie, OUTBOUND);
        if (bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_STATE_CB_FLAG) != 0
            || bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RETRANS_CB_FLAG) != 0
            || bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_RTT_CB_FLAG) != 0) {
            BPF_LOG(ERR, SOCKOPS, "set sockops cb failed!\n");
        }
        __u64 *current_sk = (__u64 *)skops->sk;
        struct bpf_sock_tuple *dst = bpf_map_lookup_elem(&map_of_orig_dst, current_sk);
        if (dst != NULL)
            enable_encoding_metadata(skops);
        break;

    default:
        break;
    }
    return 0;
}

@nlgwcy @LiZhenCheng9527

Signed-off-by: Yash Patel <yp969803@gmail.com>

chore: generated bpf2go

Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
@yp969803 yp969803 marked this pull request as ready for review March 24, 2025 08:51
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch 2 times, most recently from 9a01a5a to 1ebbcb3 Compare March 24, 2025 09:04
@@ -8,6 +8,7 @@

#include "bpf_common.h"
#include "config.h"
#include "workload.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probe should not include workload to prevent cicular include

}

construct_orig_dst_info(sk, info);
construct_orig_dst_info(sk, &info_vals);
__builtin_memcpy(info, &info_vals, sizeof(struct tcp_probe_info));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure why memcpy here

Comment on lines 192 to 196
__u32 *index = bpf_map_lookup_elem(&map_of_soc_id_counter, &zero);
if (index && *index < MAP_SIZE_OF_TCP_CONNS) {
bpf_map_update_elem(&map_of_soc_id, index, &storage->sock_cookie, BPF_ANY);
*index += 1;
bpf_map_update_elem(&map_of_soc_id_counter, &zero, index, BPF_ANY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments about these two maps operation

}

static inline void
refresh_tcp_conn_info_on_state_change(struct bpf_tcp_sock *tcp_sock, struct sock_storage_data *storage, __u32 state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make it use refresh_tcp_conn_info

__uint(max_entries, 1);
__type(key, int);
__type(value, __u64);
} tcp_conn_last_flush SEC(".maps");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please obey the kmesh mao naming convention

// Initialize last flush time if not set
bpf_map_update_elem(&tcp_conn_last_flush, &key, &init_time, BPF_ANY);
} else if ((now - *last_time) >= TIMER_INTERVAL_NS) {
flush_tcp_conns();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this is flushing all the connections probe info, i think it is a little bit heavy.

Can you do that for this particular socket?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may get the bpf_tcp_sock from struct sk_msg_md

struct bpf_sock *sk = msg->sk;
struct bpf_tcp_sock *tcp_sk = bpf_skc_to_tcp_sock(sk);

@yp969803 yp969803 force-pushed the tcp_long_connection_code branch from 1ebbcb3 to 249fde9 Compare March 27, 2025 11:29
@hzxuzhonghu
Copy link
Member

@yp969803 Pleas rely if you have resolved that or have some diff idea

@yp969803
Copy link
Contributor Author

@hzxuzhonghu i have seen your comments now, i will resolve it asap !!

Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
@yp969803 yp969803 force-pushed the tcp_long_connection_code branch from 5883b74 to c9b2790 Compare March 28, 2025 11:15
@yp969803
Copy link
Contributor Author

yp969803 commented Mar 28, 2025

@hzxuzhonghu done the changes, most of the ebpf code is complete, just adding another hook for recv_bytes is left i will do this in end, from now i will start writing test.
@LiZhenCheng9527 @nlgwcy

Signed-off-by: Yash Patel <yp969803@gmail.com>
Copy link

codecov bot commented Mar 30, 2025

Codecov Report

Attention: Patch coverage is 10.16949% with 53 lines in your changes missing coverage. Please review.

Project coverage is 45.28%. Comparing base (e5ead63) to head (c0a044a).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/telemetry/metric.go 1.92% 51 Missing ⚠️
pkg/controller/telemetry/accesslog.go 71.42% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (10.16%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Files with missing lines Coverage Δ
pkg/controller/telemetry/accesslog.go 90.32% <71.42%> (ø)
pkg/controller/telemetry/metric.go 48.36% <1.92%> (-3.07%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7fc5794...c0a044a. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Signed-off-by: Yash Patel <yp969803@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[lfx-mentorship-2025-Mar-May] Metrics for TCP Long Connection
4 participants