Memory peaks on agent v1.19.2 #2452

zigmund · 2020-09-03T04:18:28Z

Describe the bug
One group of jaeger agents (3 replicas), deployed on kubernetes. Not so loaded system - 500-700 spans per second.
Used v1.13.1 for year without any issues, upgraded all componenet to 1.19.2 (also replaced Tchannel with gRPC) and agent memory usage become unstable. Before upgrade agent instances used as low as 16 mb with 64 limit, but after upgrade memory usage peaks appeared and agent got oomkilled. I raised limits, but even 512 mb is not enough.

After few hours and tens of oomkills, I downgraded agent to 1.18.1 version and see no issues so far.

To Reproduce
Steps to reproduce the behavior:

Deploy agent v1.19.2
Send 500-700 spans per second
???
Observe memory usage peaks.

Expected behavior
Agent memory usage close to lower versions.

Screenshots

Version (please complete the following information):

OS: Kubernetes nodes based on Ubuntu 20.04 LTS, 5.4.0-39 kernel
Jaeger version: 1.19.2
Deployment: Kubernetes

What troubleshooting steps did you try?
Collected /metrics, will provide if needed.

Additional context

jpkrohling · 2020-09-03T08:52:26Z

When you downgraded to 1.18.1, did you change any other configuration? Would you mind sharing your agent configuration?

zigmund · 2020-09-03T09:34:49Z

@jpkrohling the only agent parameter used before and after downgrade is --reporter.grpc.host-port=jaeger-collector:14250.

Also played for a while after got first OOMkills with parameters, but reverted changes singe they didn't helped much:

        - --processor.jaeger-binary.server-queue-size=...
        - --processor.jaeger-binary.workers=...
        - --processor.jaeger-compact.server-queue-size=...
        - --processor.jaeger-compact.workers=...
        - --processor.zipkin-compact.server-queue-size=...
        - --processor.zipkin-compact.workers=...

jpkrohling · 2020-09-03T09:54:52Z

I'll take a look at list of commits affecting the agent between 1.18.1 and 1.19.2, perhaps I can spot a suspect there. I'll also try to reproduce the problem here. Do you think it would be reproducible with minikube and a simple agent with 64Mi of requests , 256Mi limit? Jaeger's tracegen would probably generate enough load ;-)

jpkrohling · 2020-09-03T11:53:56Z

@zigmund I'll need your help here. I tried to reproduce your problem but didn't succeed. I have a local minikube cluster with 20Gi of memory available, initialized with:

$ minikube start --vm-driver kvm2 --cpus 6 --memory 20480 --container-runtime=crio --addons=ingress

Using the Jaeger Operator, I provisioned a Jaeger instance with the following spec:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: agent-as-daemonset
spec:
  agent:
    image: jaegertracing/jaeger-agent:latest
    strategy: DaemonSet
    resources:
      requests:
        memory: 64Mi
      limits:
        memory: 256Mi
  storage:
    options:
      memory:
        max-traces: 100000

And here's a trace generator, creating 1 trace per millisecond:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tracegen
  annotations:
    "sidecar.jaegertracing.io/inject": "false"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tracegen
  template:
    metadata:
      labels:
        app: tracegen
    spec:
      containers:
      - name: tracegen
        image: jaegertracing/jaeger-tracegen:1.19.2
        env:
        - name: JAEGER_AGENT_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        args:
        - -pause=1ms
        - -duration=1h
        - -workers=1

The agent pod was up for the whole time, and the metrics shows me that the agent didn't even blink. I changed the Jaeger CR to use 1.18.1, to see if latest was in the same ballpark, and I couldn't spot any differences:

The first part of the graph is for an early test, where I set the tracegen to have 10 workers and no pause. I also forgot to set the pod limits and left for lunch. If I'm reading this data correctly, it shows about 200Mi of memory consumption for about 4M spans per minute.

The second part is jaegertracing/jaeger-agent:latest, with a 64Mi of requests and 256Mi of limit. The tracegen is with a pause of 1ms and 1 worker.

The third part is the same a second part, only with jaegertracing/jaeger-agent:1.18.1 as the image.

Here's how I need your help: are you able to reproduce your performance problems based on this setup? If you can't, could you perhaps then run a git bisect with the first good commit being v1.18.1 and the bad as being v1.19.2? Like:

$ git bisect start 
$ git bisect good v1.18.1
$ git bisect bad v1.19.2
$ make docker # this is the easiest, but will build everything, which might take a while
$ podman tag jaegertracing/jaeger-agent:latest quay.io/jpkroehling/jaeger-agent:issue2452-step1
$ podman push quay.io/jpkroehling/jaeger-agent:issue2452-step1

Change the last two commands accordingly, to tag and push to a namespace you own. At this point, configure your agent to use the image you just pushed.

Then, test it:

$ # test it
$ git bisect good # if the performance is good, or use "bad" if it shows the problem already

Rinse and repeat, until git says that it found the problematic commit.

If you don't have Jaeger source code locally to run the previous steps, you should probably do this first (see CONTRIBUTING.md):

$ git clone https://github.com/jaegertracing/jaeger.git
$ cd jaeger
$ git submodule update --init --recursive
$ make install-tools

zigmund · 2020-09-04T03:16:11Z

@jpkrohling I'll try to test with git bisect, but it will take some time.

zigmund · 2020-09-04T07:22:30Z

It's really hard to test good steps since it could take from minutes to much more time to catch first oom. But at least already reproduced oom's at the first two steps. At the moment I'm testing the third step for an hour already and I don't know if it is enough to be certain that the step is good.

jpkrohling · 2020-09-04T08:06:17Z

How fast did you get OOMs for the first two steps? If they are at around 15m, it might be safe to say that a commit is good if it's not crashing after 30m.

zigmund · 2020-09-04T08:11:31Z

@jpkrohling at the first two steps got first ooms in few minutes.
Got compile error at 4 and 5 steps, testing 6 step at the moment. It seems that be17169 affected memory usage behavior in out environment.

zigmund · 2020-09-04T09:05:00Z

@jpkrohling completed 6 step with hour+ test period.

be17169a417954a347408708182519e267618500 is the first bad commit
commit be17169a417954a347408708182519e267618500
Author: Pavol Loffay <ploffay@redhat.com>
Date:   Fri Jun 26 16:11:45 2020 +0200

    Upgrade thrift to 0.13 (#2311)
    
    * Upgrade thrift to 0.13
    
    Signed-off-by: Pavol Loffay <ploffay@redhat.com>
    
    * Add context to sampling managers
    
    Signed-off-by: Pavol Loffay <ploffay@redhat.com>

jpkrohling · 2020-09-04T10:12:28Z

cc @pavolloffay

pavolloffay · 2020-09-04T10:55:49Z

Here is the upgrade to thrift 0.13 PR #2311. The upgrade was done due to security issues in the old version and also it was needed in OpenTelemetry collector IIRC.

I am not sure what is causing the regression.

jpkrohling · 2020-09-07T09:25:54Z

@zigmund are you able to get us a memory profile for the running agent? The admin port has pprof endpoints, which you can browse like this:

$ go tool pprof -http localhost:6060 http://your-agent-host:14271/debug/pprof/heap

This will get a web UI on http://localhost:6060 for browsing your agent's memory.

zigmund · 2020-09-07T17:41:37Z

@jpkrohling I'll try but don't think it will help. Agent allocates large amount of memory and releases it next moment, so I have to catch that moment with profiler(?)

jpkrohling · 2020-09-08T09:05:08Z

That's true. Ideally, you'd capture it every N minutes, so that we'd get at least one profile when the memory is ramping up. There are a few experimental tools that can help with that, but I'm really not sure about their state, like conprof (cc @brancz).

Once we understand better where this memory usage is coming from, we can probably try to reproduce it ourselves, which is the first step to fix the problem.

brancz · 2020-09-08T09:48:50Z

Conprof is not well optimized itself but used by various organizations so it's definitely worth a try :)

zigmund · 2020-09-09T08:33:32Z

@jpkrohling lowered replicas to 1 and collected profile every second via forwarded port with curl:

while true; do curl http://localhost:14271/debug/pprof/heap -o `date +%s`.pb.gz; sleep 1; done

Raised limit to 2Gb to avoid oom but agent still been killed. :D
But at least I got few identical profiles right before oom. Here is one of them

jpkrohling · 2020-09-09T08:48:21Z

Looks interesting

I'll try to reproduce it here again. Just to confirm: data is flowing from the agent to the collector, right?

zigmund · 2020-09-09T08:50:34Z

@jpkrohling yes, agent sends data to collector.

jpkrohling · 2020-09-09T14:08:12Z

@zigmund it's still quite difficult to tell what's going on. The profile you sent seems to indicate that all memory is being allocated in the ReadString from Thrift's Compact protocol, which seems to indicate that you have a span bigger than 930MiB which I doubt is the case, especially because a batch has a max size equal to the limits of a UDP packet, typically 64KiB. Either that, or quite a bunch of spans are locked in that function, which also seems unlikely.

I tried to put as much pressure as I could on the agent, but it keeps with a minimal memory usage. I don't think Elasticsearch is part of the equation here, but I used it anyway as the backend, without luck. Even my memory profile is quite different than yours: every time I get a snapshot, it's mostly showing 1MiB here or 512KiB there, which is in line with the inflight data (like the image at the end).

Could you also get us a goroutine snapshot? If we have a single call to ReadString, it would probably be apparent there.

I would also like to ask you for your exact runtime configuration, like:

where are you deploying it? Bare metal? VM? Kubernetes?
the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)
same for the Agent
which client are you using? Jaeger Go Client? OpenTelemetry? Something else?

zigmund · 2020-09-09T19:40:20Z

@jpkrohling

Could you also get us a goroutine snapshot?

Do you mean profile from /debug/pprof/goroutine?debug=2?

where are you deploying it? Bare metal? VM? Kubernetes?

Kubernetes

the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)

Env vars passed via configmap:

SPAN_STORAGE_TYPE=elasticsearch
ES_SERVER_URLS=http://es-node1:9200,http://es-node2:9200,http://es-node3:9200
ES_NUM_SHARDS=3
ES_NUM_REPLICAS=0
ES_MAX_SPAN_AGE=24h

same for the Agent

Single argument: --reporter.grpc.host-port=jaeger-collector:14250

which client are you using? Jaeger Go Client? OpenTelemetry? Something else?

golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing

jpkrohling · 2020-09-09T20:11:11Z

Do you mean profile from /debug/pprof/goroutine?debug=2?

Yes, please

golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing

Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...

zigmund · 2020-09-09T21:01:17Z

Yes, please

I'll try to catch memory eating profile with method I used before.

Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...

I don't think it's possible since we're have tens of services pointed to same jaeger agent endpoint. I probably could configure nginx and node.js services by myself... but all other services' are out of my control.

jpkrohling · 2020-09-16T12:07:36Z

@zigmund, do you have any news for us? I'm really curious on how this develops :-)

zigmund · 2020-09-16T14:51:56Z

@jpkrohling sorry, I'm now in business trip for few days. Will test as soon as possible.

zigmund · 2020-09-17T05:59:00Z

@jpkrohling, checked metrics.

sum(rate(jaeger_agent_thrift_udp_t_processor_handler_errors_total[1m])) by (model, protocol):

{model="jaeger",protocol="binary"} | 0
{model="jaeger",protocol="compact"} | 0.06667950988310203
{model="zipkin",protocol="compact"} | 0

sum(rate(jaeger_agent_thrift_udp_server_packets_processed_total[1m])) by (model, protocol):

{model="jaeger",protocol="binary"} | 0.35553778133262237
{model="jaeger",protocol="compact"} | 73.39500099980003
{model="zipkin",protocol="compact"} | 0

Currently unable to test debug image v2, will try to do it later.

zigmund · 2020-09-17T11:20:18Z

@jpkrohling, I think I found two agent-breaking services, both are php-written. Will try to catch breaking traces, but don't know how at the moment. :D

jpkrohling · 2020-09-17T14:08:07Z

I think I found two agent-breaking services, both are php-written

That's actually great news. This means that the Agent might not be at fault here at all for common cases. We still need to figure out what type of incoming data is causing the problem, but I believe it does change the criticality of the issue.

Will try to catch breaking traces, but don't know how at the moment.

I can't help much here, but perhaps you can start by checking which operations are expected to exist vs. which ones actually exist. The missing one(s) might be worth double-checking.

zigmund · 2020-09-21T03:41:10Z

start by checking which operations are expected to exist vs. which ones actually exist.

These two services are pretty large monolith php applications, traces often consist of hundreds spans, so It's near impossible to catch missing.

jpkrohling · 2020-09-21T07:34:14Z

Alright, let us know once you have more info then, or if there's anything we can do on our side to help you.

jpkrohling · 2020-09-24T08:39:27Z

@zigmund, were you able to find a pattern already? I'm worried that we might have a bigger issue, as bad clients should never be able to bring down the agent...

zigmund · 2020-09-24T11:17:41Z

@jpkrohling, unfortunately, didn't found any additional info. :(

jpkrohling · 2020-09-24T14:33:38Z

Are you still experiencing those problems, or have you disabled tracing for the services that are causing problems?

zigmund · 2020-09-24T14:40:21Z

@jpkrohling I'm using your extra verbose image for debug and reverting back to 1.18.1 for regular usage. The services' tracing is enabled.

jpkrohling · 2020-09-24T14:53:10Z

Got it. We might have newer versions of the models/protobuf soon, which might help here. I'll ping you once we have it ready, in case you are able to try it out before we release it.

zigmund · 2020-09-24T15:35:56Z

OK, I'm ready to test.

jpkrohling · 2020-10-09T08:16:54Z

@zigmund We released 1.20 some days ago. Would you be able to try it out?

zigmund · 2020-10-12T04:03:28Z

@jpkrohling checked 1.20 - same behavior as 1.19.2

jpkrohling · 2020-10-12T07:50:23Z

Thanks for the confirmation. I'm leaving this open, but unless there's new information, there's not much we can do to reproduce and finally fix this problem.

quinluong · 2020-12-14T02:44:23Z

Hi @zigmund
Thank you so much because of this issue. I'm facing to the same problem. I'm also using jukylin/jaeger-php for my project. After a long day for debugging, I figure out that the reason is about my data. It isn't all data, just only malformed data - where percentage is very small. So it is really difficult to debug if you are on production environment.
You can read this comment #2638 (comment). Thank you so much @jpkrohling
Hope you can resolve this issue.

zigmund added the bug label Sep 3, 2020

ghost added the needs-triage label Sep 3, 2020

jpkrohling added needs-info The ticket does not contain sufficient information performance and removed needs-triage labels Sep 3, 2020

jpkrohling removed the needs-info The ticket does not contain sufficient information label Sep 3, 2020

jpkrohling added the needs-info The ticket does not contain sufficient information label Sep 3, 2020

jpkrohling added this to the Release 1.20.0 milestone Sep 10, 2020

jpkrohling mentioned this issue Sep 18, 2020

Debug images with delve and other base image #2493

Closed

jpkrohling modified the milestones: Release 1.20.0, Release 1.21.0 Sep 28, 2020

jpkrohling mentioned this issue Oct 9, 2020

Create debug docker images for jaeger backends #2545

Merged

joe-elliott modified the milestones: Release 1.21.0, Release 1.22.0 Nov 16, 2020

jpkrohling mentioned this issue Dec 1, 2020

Supportability: print own metrics in logs open-telemetry/opentelemetry-collector#2098

Closed

jpkrohling mentioned this issue Dec 9, 2020

jaeger-agent reproducible memory leak in 1.21.0 #2638

Closed

jpkrohling mentioned this issue Feb 3, 2021

Use fork of Apache Thrift #2780

Merged

yurishkuro closed this as completed in #2780 Feb 4, 2021

albertteoh mentioned this issue Feb 17, 2021

Remove temporary usage of a Thrift fork #2781

Closed

Memory peaks on agent v1.19.2 #2452

Memory peaks on agent v1.19.2 #2452

Comments

zigmund commented Sep 3, 2020 • edited Loading

Additional context

jpkrohling commented Sep 3, 2020

zigmund commented Sep 3, 2020

jpkrohling commented Sep 3, 2020

jpkrohling commented Sep 3, 2020 • edited Loading

zigmund commented Sep 4, 2020

zigmund commented Sep 4, 2020

jpkrohling commented Sep 4, 2020

zigmund commented Sep 4, 2020

zigmund commented Sep 4, 2020

jpkrohling commented Sep 4, 2020

pavolloffay commented Sep 4, 2020

jpkrohling commented Sep 7, 2020

zigmund commented Sep 7, 2020

jpkrohling commented Sep 8, 2020 • edited Loading

brancz commented Sep 8, 2020

zigmund commented Sep 9, 2020

jpkrohling commented Sep 9, 2020

zigmund commented Sep 9, 2020

jpkrohling commented Sep 9, 2020

zigmund commented Sep 9, 2020

jpkrohling commented Sep 9, 2020

zigmund commented Sep 9, 2020

jpkrohling commented Sep 16, 2020

zigmund commented Sep 16, 2020

zigmund commented Sep 17, 2020

zigmund commented Sep 17, 2020

jpkrohling commented Sep 17, 2020

zigmund commented Sep 21, 2020

jpkrohling commented Sep 21, 2020

jpkrohling commented Sep 24, 2020 • edited Loading

zigmund commented Sep 24, 2020

jpkrohling commented Sep 24, 2020

zigmund commented Sep 24, 2020

jpkrohling commented Sep 24, 2020

zigmund commented Sep 24, 2020

jpkrohling commented Oct 9, 2020

zigmund commented Oct 12, 2020

jpkrohling commented Oct 12, 2020

quinluong commented Dec 14, 2020

zigmund commented Sep 3, 2020 •

edited

Loading

jpkrohling commented Sep 3, 2020 •

edited

Loading

jpkrohling commented Sep 8, 2020 •

edited

Loading

jpkrohling commented Sep 24, 2020 •

edited

Loading