-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beyla misinterpreting client call between to k8s service as a server span #679
Comments
Thanks for reporting this. I am away this week, but I will take a look at this on Monday. Do you know what ports are used for the connection? Are they by any chance above 32,000? I am asking because sometimes when we initially attach we see network events, without first seeing the accept or the connect. So we try to guess if it was client or server based on the ports used, assuming the client side ephemeral port is higher. It should be only one event like this, but if you see more then it's a different bug. |
Thanks for the quick reply
Ah interesting. The server is listening on 9080 which is correctly reported in the span as |
Thanks for making that great guide, it was very easy to follow. I have been unable to reproduce it, but I think I know what the issue is and I have a fix in mind. I'm going to work on this today, but if I don't get a fix in, I should have time to finish this on Wednesday. |
I made a fix attempt. I hope this resolves the issue, if it does we can then manually close this issue. If not, I'll keep on digging. |
I figured it might be tough to reproduce 🙂
That's awesome, thank you! Let me try it out from the PR and get back to you |
@grcevski I built a new docker image with just $ kubectl logs -n beyla-repro daemonsets/beyla-agent | head -100
Found 3 pods, using pod/beyla-agent-6fznz
time=2024-03-18T19:41:59.759Z level=INFO msg="Grafana Beyla" Version=v1.4.0-7-gd01d645d Revision=d01d645d "OpenTelemetry SDK Version"=1.23.1
time=2024-03-18T19:41:59.760Z level=INFO msg="starting Beyla in Application Observability mode"
time=2024-03-18T19:41:59.870Z level=INFO msg="using hostname" component=traces.ReadDecorator function=instance_ID_hostNamePIDDecorator hostname=beyla-agent-6fznz
time=2024-03-18T19:41:59.870Z level=WARN msg="No route match patterns configured. Without route definitions Beyla will not be able to generate a low cardinality route for trace span names. For optimal experience, please define your application HTTP route patterns or enable the route 'heuristic' mode. For more information please see the documentation at: https://grafana.com
/docs/beyla/latest/configure/options/#routes-decorator . If your application is only using gRPC you can ignore this warning." component=RoutesProvider
time=2024-03-18T19:41:59.873Z level=INFO msg="Starting main node" component=beyla.Instrumenter
time=2024-03-18T19:42:00.535Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/beyla pid=3192191
time=2024-03-18T19:42:05.069Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/beyla pid=3192191 error="loading and assigning BPF objects: field KprobeTcpSendmsg: program kprobe_tcp_sendmsg: load program: argument list too long: 892: (79) r1 = *(u64 *)(r7 +312): R0_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R1_w=inv(id=0) R6=i
nv(id=24,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7=map_value(id=0,off=0,ks=44,vs=448,imm=0) R8=inv(id=0) R9_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R10=fp0 fp-16=mmmmmmmm fp-40=????mmmm fp-48=???????m fp-56=mmmmmmmm fp-64=????mmmm fp-72=mmmmmmmm fp-80=mmmmmmmm fp-88=mmmmmmmm fp-96=mmmmmmmm fp-104=mmmmmmmm fp-112=mmmmmmmm fp-120=mmmmmmmm fp-128=mmmmmmmm fp-152=mmmmm
mmm fp-160=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=mmmmmmmm fp-224=mmmmmmmm fp-232=mmmmmmmm fp-240=00000000 fp-248=mmmmm (truncated, 734 line(s) omitted)"
Error Log:
; int BPF_KPROBE(kprobe_tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size) {
0: (79) r9 = *(u64 *)(r1 +96)
1: (79) r7 = *(u64 *)(r1 +104)
2: (79) r8 = *(u64 *)(r1 +112)
; u64 id = bpf_get_current_pid_tgid();
3: (85) call bpf_get_current_pid_tgid#14
4: (bf) r6 = r0
; u64 id = bpf_get_current_pid_tgid();
5: (7b) *(u64 *)(r10 -184) = r6
; u32 host_pid = id >> 32;
6: (77) r6 >>= 32
; u32 host_pid = id >> 32;
7: (63) *(u32 *)(r10 -128) = r6
; if (!filter_pids) {
8: (18) r1 = 0xffff8e2ac552eb10
10: (61) r1 = *(u32 *)(r1 +0)
R0_w=inv(id=1) R1_w=map_value(id=0,off=0,ks=4,vs=24,imm=0) R6_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7_w=inv(id=0) R8_w=inv(id=0) R9_w=inv(id=0) R10=fp0 fp-128=inv fp-184_w=mmmmmmmm
; if (!filter_pids) {
11: (15) if r1 == 0x0 goto pc+175
last_idx 11 first_idx 0
regs=2 stack=0 before 10: (61) r1 = *(u32 *)(r1 +0)
12: (bf) r2 = r10
; (can provide more logs if needed) Am I missing a step to rebuild the BPF programs? |
Hm, that's a strange error. What's your host OS and kernel version if Linux, where you built the docker image? I can look more into it on Wednesday, but our tests seem to pass, must be local host toolchain problem somehow. |
It's an internal debian based distro, kernel version 6.5.13. I can try building on a fresh VM with a vanilla OS version. Is this the right way to build a new image? IMG_REGISTRY=us-central1-docker.pkg.dev IMG_ORG=myorg IMG_NAME=beyla-foo VERSION=pr-688 make image-build-push |
That looks right to me and it's a similar kernel that I use. Let me look into it. |
Thanks. I tried rebuilding on a new VM with
and seeing the same issue so I think it's unrelated to build environment. Some observations
|
It seems I had left in an earlier commit a debug print in a loop that's unrolled. This sometimes causes issues on older kernels. I pushed an update, can you let me know if this resolves the issue? |
I'm still seeing the same error on 797ebdb 🙁 Details
|
Do you mind attaching the full log for me? It usually has a bit of useful info at the end. thanks a ton! |
Yup no problem! |
I just pushed another change, it looks like we make the program too large. I cut the loop number in half for http2 to see if that helps, but I'm going to need to rework the code a bit more to get back to 8 iterations. I guess our http2 test might fail now, but it should be good to try about this other bug. |
@grcevski that fixed the crash and I can confirm the PR change fixed the original issue. I tried restarting the beyla daemonset a few times and no longer see any client IPs in the Really appreciate you working with me to solve this 😄 |
Thanks a bunch for testing this out, I really appreciate it! We can merge the fix as is, but I can try the image you have to see if I can put back the loop to 8 for searching HTTP2 frames. |
I'm seeing Beyla produce spans with SpanKind
SERVER
that have the client k8s Service's ClusterIP address as theserver.address
. The server span should have the server's pod IP as theserver.address
and the client should be the calling pod's IP. Because the service's ClusterIP is present, it makes me think it should be a client span.I am able to reproduce this issue with the BookInfo sample app from Istio (not using Istio itself) between
reviews-v2
(Java) andratings
(NodeJS) apps. The bug occurs for the first few minutes of Beyla running and doesn't recur later.For example this OTLP is being produced:
OTLP ResourceSpan
SERVER
k8s.pod.name: reviews-v2-b7dcd98fb-nzgp5
client.address: 10.4.0.122
is podreviews-v2-b7dcd98fb-nzgp5
should be the server since it's a server spanserver.address: 10.8.11.125
is not a pod but theratings
service's ClusterIPIt seems like Beyla is misinterpreting an outgoing HTTP request made on the reviews-v2 pod to the ratings service as a server span.
Reproducer
See my Github branch main...aabmass:beyla:spankind-bug-repro. I am seeing this consistently in GKE every time I restart the Beyla Daemonset.
The text was updated successfully, but these errors were encountered: