Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beyla misinterpreting client call between to k8s service as a server span #679

Closed
aabmass opened this issue Mar 12, 2024 · 17 comments · Fixed by #688
Closed

Beyla misinterpreting client call between to k8s service as a server span #679

aabmass opened this issue Mar 12, 2024 · 17 comments · Fixed by #688
Assignees

Comments

@aabmass
Copy link

aabmass commented Mar 12, 2024

I'm seeing Beyla produce spans with SpanKind SERVER that have the client k8s Service's ClusterIP address as the server.address. The server span should have the server's pod IP as the server.address and the client should be the calling pod's IP. Because the service's ClusterIP is present, it makes me think it should be a client span.

I am able to reproduce this issue with the BookInfo sample app from Istio (not using Istio itself) between reviews-v2 (Java) and ratings (NodeJS) apps. The bug occurs for the first few minutes of Beyla running and doesn't recur later.

For example this OTLP is being produced:

OTLP ResourceSpan
Resource SchemaURL: https://opentelemetry.io/schemas/1.19.0
Resource attributes:
     -> k8s.deployment.name: Str(reviews-v2)
     -> k8s.namespace.name: Str(beyla-repro)
     -> k8s.node.name: Str(gke-cluster-1-pool-1-6a06028c-7xb3)
     -> k8s.pod.name: Str(reviews-v2-b7dcd98fb-nzgp5)
     -> k8s.pod.start_time: Str(2024-03-12 19:38:00 +0000 UTC)
     -> k8s.pod.uid: Str(ad795a70-093d-4d95-a74b-28cd85302177)
     -> k8s.replicaset.name: Str(reviews-v2-b7dcd98fb)
     -> service.instance.id: Str(beyla-agent-dcgx9-3039053)
     -> service.name: Str(reviews-v2)
     -> service.namespace: Str(beyla-repro)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(beyla)
ScopeSpans #0
ScopeSpans SchemaURL: 
InstrumentationScope github.com/grafana/beyla
...
Span #8
    Trace ID       : ea949df52a54718de0bef345ee7346ba
    Parent ID      : 
    ID             : 6aada77a343ebd5e
    Name           : GET /**
    Kind           : Server
    Start time     : 2024-03-12 21:22:43.552211672 +0000 UTC
    End time       : 2024-03-12 21:22:43.553105469 +0000 UTC
    Status code    : Unset
    Status message : 
Attributes:
     -> http.request.method: Str(GET)
     -> http.response.status_code: Int(200)
     -> url.path: Str(/ratings/0)
     -> client.address: Str(10.4.0.122)
     -> server.address: Str(10.8.11.125)
     -> server.port: Int(9080)
     -> http.request.body.size: Int(252)
     -> http.route: Str(/**)
  • SpanKind is SERVER
  • Resource k8s.pod.name: reviews-v2-b7dcd98fb-nzgp5
  • client.address: 10.4.0.122 is pod reviews-v2-b7dcd98fb-nzgp5 should be the server since it's a server span
  • server.address: 10.8.11.125 is not a pod but the ratings service's ClusterIP

It seems like Beyla is misinterpreting an outgoing HTTP request made on the reviews-v2 pod to the ratings service as a server span.

Reproducer

See my Github branch main...aabmass:beyla:spankind-bug-repro. I am seeing this consistently in GKE every time I restart the Beyla Daemonset.

@aabmass aabmass changed the title Beyla misinterpreting client call between from k8s pod to service as a server span Beyla misinterpreting client call between to k8s service as a server span Mar 12, 2024
@grcevski
Copy link
Contributor

Thanks for reporting this. I am away this week, but I will take a look at this on Monday. Do you know what ports are used for the connection? Are they by any chance above 32,000? I am asking because sometimes when we initially attach we see network events, without first seeing the accept or the connect. So we try to guess if it was client or server based on the ports used, assuming the client side ephemeral port is higher.

It should be only one event like this, but if you see more then it's a different bug.

@grcevski grcevski self-assigned this Mar 13, 2024
@aabmass
Copy link
Author

aabmass commented Mar 13, 2024

Thanks for the quick reply

I am asking because sometimes when we initially attach we see network events, without first seeing the accept or the
connect. So we try to guess if it was client or server based on the ports used, assuming the client side ephemeral port is higher.

Ah interesting. The server is listening on 9080 which is correctly reported in the span as server.port: Int(9080). I am seeing a few spans like this but it does subside within a few minutes.

@grcevski
Copy link
Contributor

Thanks for making that great guide, it was very easy to follow. I have been unable to reproduce it, but I think I know what the issue is and I have a fix in mind. I'm going to work on this today, but if I don't get a fix in, I should have time to finish this on Wednesday.

@grcevski
Copy link
Contributor

I made a fix attempt. I hope this resolves the issue, if it does we can then manually close this issue. If not, I'll keep on digging.

@aabmass
Copy link
Author

aabmass commented Mar 18, 2024

I figured it might be tough to reproduce 🙂

I made a fix attempt. I hope this resolves the issue, if it does we can then manually close this issue. If not, I'll keep on digging.

That's awesome, thank you! Let me try it out from the PR and get back to you

@aabmass
Copy link
Author

aabmass commented Mar 18, 2024

@grcevski I built a new docker image with just docker build -t <tag> . from the repo root in your PR. Beyla is crashing with logs like

$ kubectl logs -n beyla-repro daemonsets/beyla-agent | head -100                                                                                                                                                                                                                                    
Found 3 pods, using pod/beyla-agent-6fznz                                                                                                                                                                                                                                                                                                                                              
time=2024-03-18T19:41:59.759Z level=INFO msg="Grafana Beyla" Version=v1.4.0-7-gd01d645d Revision=d01d645d "OpenTelemetry SDK Version"=1.23.1                                                                                                                                                                                                                                           
time=2024-03-18T19:41:59.760Z level=INFO msg="starting Beyla in Application Observability mode"                                                                                                                                                                                                                                                                                        
time=2024-03-18T19:41:59.870Z level=INFO msg="using hostname" component=traces.ReadDecorator function=instance_ID_hostNamePIDDecorator hostname=beyla-agent-6fznz                                                                                                                                                                                                                      
time=2024-03-18T19:41:59.870Z level=WARN msg="No route match patterns configured. Without route definitions Beyla will not be able to generate a low cardinality route for trace span names. For optimal experience, please define your application HTTP route patterns or enable the route 'heuristic' mode. For more information please see the documentation at: https://grafana.com
/docs/beyla/latest/configure/options/#routes-decorator . If your application is only using gRPC you can ignore this warning." component=RoutesProvider                                                                                                                                                                                                                                 
time=2024-03-18T19:41:59.873Z level=INFO msg="Starting main node" component=beyla.Instrumenter                                                                                                                                                                                                                                                                                         
time=2024-03-18T19:42:00.535Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/beyla pid=3192191                                                                                                                                                                                                                                                           
time=2024-03-18T19:42:05.069Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/beyla pid=3192191 error="loading and assigning BPF objects: field KprobeTcpSendmsg: program kprobe_tcp_sendmsg: load program: argument list too long: 892: (79) r1 = *(u64 *)(r7 +312): R0_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R1_w=inv(id=0) R6=i
nv(id=24,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7=map_value(id=0,off=0,ks=44,vs=448,imm=0) R8=inv(id=0) R9_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R10=fp0 fp-16=mmmmmmmm fp-40=????mmmm fp-48=???????m fp-56=mmmmmmmm fp-64=????mmmm fp-72=mmmmmmmm fp-80=mmmmmmmm fp-88=mmmmmmmm fp-96=mmmmmmmm fp-104=mmmmmmmm fp-112=mmmmmmmm fp-120=mmmmmmmm fp-128=mmmmmmmm fp-152=mmmmm
mmm fp-160=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=mmmmmmmm fp-224=mmmmmmmm fp-232=mmmmmmmm fp-240=00000000 fp-248=mmmmm (truncated, 734 line(s) omitted)"                                                                                                                                                                                     
Error Log:                                                                                                                                                                                                                                                                                                                                                                             
 ; int BPF_KPROBE(kprobe_tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size) {                                                                                                                                                                                                                                                                                              
0: (79) r9 = *(u64 *)(r1 +96)                                                                                                                                                               
1: (79) r7 = *(u64 *)(r1 +104)                                                                                                                                                              
2: (79) r8 = *(u64 *)(r1 +112)                                                                                                                                                              
; u64 id = bpf_get_current_pid_tgid();                                                                                                                                                      
3: (85) call bpf_get_current_pid_tgid#14                                                                                                                                                    
4: (bf) r6 = r0                                                                                                                                                                             
; u64 id = bpf_get_current_pid_tgid();                                                                                                                                                      
5: (7b) *(u64 *)(r10 -184) = r6                                                                                                                                                             
; u32 host_pid = id >> 32;                                                                                                                                                                  
6: (77) r6 >>= 32                                                                                                                                                                           
; u32 host_pid = id >> 32;                                                                                                                                                                  
7: (63) *(u32 *)(r10 -128) = r6                                                                                                                                                             
; if (!filter_pids) {                                                                                                                                                                       
8: (18) r1 = 0xffff8e2ac552eb10                                                                                                                                                             
10: (61) r1 = *(u32 *)(r1 +0)                                                                                                                                                               
 R0_w=inv(id=1) R1_w=map_value(id=0,off=0,ks=4,vs=24,imm=0) R6_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7_w=inv(id=0) R8_w=inv(id=0) R9_w=inv(id=0) R10=fp0 fp-128=inv fp-184_w=mmmmmmmm
; if (!filter_pids) {                                                                                                                                                                       
11: (15) if r1 == 0x0 goto pc+175                                                                                                                                                           
last_idx 11 first_idx 0                                                                                                                                                                     
regs=2 stack=0 before 10: (61) r1 = *(u32 *)(r1 +0)                                                                                                                                         
12: (bf) r2 = r10                                                                                                                                                                           
;

(can provide more logs if needed)

Am I missing a step to rebuild the BPF programs?

@grcevski
Copy link
Contributor

Hm, that's a strange error. What's your host OS and kernel version if Linux, where you built the docker image? I can look more into it on Wednesday, but our tests seem to pass, must be local host toolchain problem somehow.

@aabmass
Copy link
Author

aabmass commented Mar 19, 2024

It's an internal debian based distro, kernel version 6.5.13. I can try building on a fresh VM with a vanilla OS version. Is this the right way to build a new image?

IMG_REGISTRY=us-central1-docker.pkg.dev IMG_ORG=myorg IMG_NAME=beyla-foo VERSION=pr-688 make image-build-push

@grcevski
Copy link
Contributor

That looks right to me and it's a similar kernel that I use. Let me look into it.

@aabmass
Copy link
Author

aabmass commented Mar 19, 2024

Thanks. I tried rebuilding on a new VM with

Linux <host> 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

and seeing the same issue so I think it's unrelated to build environment. Some observations

  • I don't see the issue when running Beyla locally on the 6.1 kernel
  • The GKE cluster is using this COS image
    $ uname -a                                                                                                                    
    Linux gke-cluster-1-pool-1-6a06028c-7xb3 5.15.133+ #1 SMP Sat Dec 30 13:01:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
    
  • I built Beyla from HEAD and don't see the issue at all when running in that same GKE cluster. So it seems related to the PR.

Full errorlog

@grcevski
Copy link
Contributor

It seems I had left in an earlier commit a debug print in a loop that's unrolled. This sometimes causes issues on older kernels. I pushed an update, can you let me know if this resolves the issue?

@aabmass
Copy link
Author

aabmass commented Mar 20, 2024

I'm still seeing the same error on 797ebdb 🙁

Details

time=2024-03-20T17:56:28.215Z level=INFO msg="Grafana Beyla" Version=v1.4.0-8-g797ebdb2 Revision=797ebdb2 "OpenTelemetry SDK Version"=1.23.1
time=2024-03-20T17:56:28.216Z level=INFO msg="starting Beyla in Application Observability mode"
time=2024-03-20T17:56:28.321Z level=INFO msg="using hostname" component=traces.ReadDecorator function=instance_ID_hostNamePIDDecorator hostname=beyla-agent-6fpwc
time=2024-03-20T17:56:28.321Z level=WARN msg="No route match patterns configured. Without route definitions Beyla will not be able to generate a low cardinality route for trace span names. For optimal experience, please define your application HTTP route patterns or enable the route 'heuristic' mode. For more information please see the documentation at: https://grafana.com/docs/beyla/latest/configure/options/#routes-decorator . If your application is only using gRPC you can ignore this warning." component=RoutesProvider
time=2024-03-20T17:56:28.340Z level=INFO msg="Starting main node" component=beyla.Instrumenter
time=2024-03-20T17:56:28.835Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/python3.7 pid=3005143
time=2024-03-20T17:56:28.836Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/ruby pid=3004755
time=2024-03-20T17:56:28.837Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/node pid=3004889
time=2024-03-20T17:56:28.838Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/python3.11 pid=3005753
time=2024-03-20T17:56:28.839Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/opt/java/openjdk/bin/java pid=3005017
time=2024-03-20T17:56:28.840Z level=INFO msg="new process for already instrumented executable" component=discover.TraceAttacher pid=3005143 child=[] exec=/usr/local/bin/python3.7
time=2024-03-20T17:56:28.841Z level=INFO msg="new process for already instrumented executable" component=discover.TraceAttacher pid=3005089 child=[] exec=/opt/java/openjdk/bin/java
time=2024-03-20T17:56:28.841Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/otelcol-contrib pid=3005755
time=2024-03-20T17:56:28.927Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/beyla pid=3675805
time=2024-03-20T17:56:36.278Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/usr/local/bin/python3.7 pid=3005143 error="loading and assigning BPF objects: field KprobeTcpSendmsg: program kprobe_tcp_sendmsg: load program: argument list too long: 892: (79) r1 = *(u64 *)(r7 +312): R0_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R1_w=inv(id=0) R6=inv(id=24,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7=map_value(id=0,off=0,ks=44,vs=448,imm=0) R8=inv(id=0) R9_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R10=fp0 fp-16=mmmmmmmm fp-40=????mmmm fp-48=???????m fp-56=mmmmmmmm fp-64=????mmmm fp-72=mmmmmmmm fp-80=mmmmmmmm fp-88=mmmmmmmm fp-96=mmmmmmmm fp-104=mmmmmmmm fp-112=mmmmmmmm fp-120=mmmmmmmm fp-128=mmmmmmmm fp-152=mmmmmmmm fp-160=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=mmmmmmmm fp-224=mmmmmmmm fp-232=mmmmmmmm fp-240=00000000 fp-248=mmmmm (truncated, 734 line(s) omitted)"
...

@grcevski
Copy link
Contributor

Do you mind attaching the full log for me? It usually has a bit of useful info at the end.

thanks a ton!

@aabmass
Copy link
Author

aabmass commented Mar 20, 2024

Yup no problem!

beyla-errorlog-797ebdb2.txt

@grcevski
Copy link
Contributor

I just pushed another change, it looks like we make the program too large. I cut the loop number in half for http2 to see if that helps, but I'm going to need to rework the code a bit more to get back to 8 iterations. I guess our http2 test might fail now, but it should be good to try about this other bug.

@aabmass
Copy link
Author

aabmass commented Mar 22, 2024

@grcevski that fixed the crash and I can confirm the PR change fixed the original issue. I tried restarting the beyla daemonset a few times and no longer see any client IPs in the server.address

Really appreciate you working with me to solve this 😄

@grcevski
Copy link
Contributor

Thanks a bunch for testing this out, I really appreciate it! We can merge the fix as is, but I can try the image you have to see if I can put back the loop to 8 for searching HTTP2 frames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants