-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jaeger-agent reproducible memory leak in 1.21.0 #2638
Comments
Thanks for the detailed report, @Mario-Hofstaetter! I'll try to reproduce this, but I'm not sure I'll be able to build/run a dotnet example on Linux. Would you be able to capture the payload the opentelemetry-dotnet library is sending? This would make it easier to reproduce/fix this. |
This might be a duplicate of #2459. |
@jpkrohling I guess in my case, this here was the cause and not badger. The original author of #2459 (lichenglin) did not state which instrumentation library he is using.
I'm pretty sure I can provide a small solution that will run on linux (it's .net core after all), at least a docker-compose setup. Thanks. |
That would be awesome!
Yes, something like that. If you are using plain HTTP for the communication, instead of gRPC, it would be even easier as I could just use the HTTP payload with cURL. |
I think we can make use of |
@Ashmita152 Is this url available by default or does it need to be enabled? It downloads a |
It will be very useful to have that to debug more. I am new to golang but I can give it a try. Someone will definitely be able to help. |
@Mario-Hofstaetter, you can open http://localhost:14269/debug/pprof/ in your browser, and navigate over the raw data. You might also want to try this:
This should open your web browser with the current heap as an image. You can try the same for the other profiles (see the first URL for the available profiles). |
@Mario-Hofstaetter do you have any news on this? Are you able to provide with a dotnet application that runs on Linux, so that I can reproduce it? |
@jpkrohling Apologies for the long delay, will schedule this for the weekend to do. |
This fairly reliably reproduces it for me: ConsoleApp1.csproj<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net5.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="OpenTelemetry" Version="1.0.0-rc1.1" />
<PackageReference Include="OpenTelemetry.Exporter.Jaeger" Version="1.0.0-rc1.1" />
</ItemGroup>
</Project> Program.csusing System;
using System.Diagnostics;
using OpenTelemetry;
using OpenTelemetry.Trace;
namespace ConsoleApp1
{
class Program
{
public static ActivitySource ActivitySource = new("foo");
static void Main()
{
Console.WriteLine("Hello World!");
using var tracerProvider = Sdk.CreateTracerProviderBuilder()
.AddSource(ActivitySource.Name)
.AddJaegerExporter()
.Build();
for (var i = 0; i < 1000; i++)
{
using var _ = ActivitySource.StartActivity(i.ToString());
}
}
}
} |
How can I run this on Linux? |
Just copy these two files somewhere and do |
@jpkrohling If you don't feel like installing .NET, I have pushed the sample app to https://github.com/Mario-Hofstaetter/jaegertracing-jaeger-issue-2638 including a Dockerfile and was able to reproduce on https://labs.play-with-docker.com . Thanks @mberginbh for the tiny example. You can point the app to any agent adress/port and change the Steps to reproduce$ git clone https://github.com/Mario-Hofstaetter/jaegertracing-jaeger-issue-2638.git && cd jaegertracing-jaeger-issue-2638
$ docker-compose up -d
# takes some time for elasticsearch to spin up, so collector and query-ui containers will be restarting
# c# demo app image is build with tag 'jaegerdemoclient' Run demo app: # Repeat a few times
$ docker-compose up jaegerdemoclient
Starting jaegerdemoclient ... done
Attaching to jaegerdemoclient
jaegerdemoclient | Hello World!
jaegerdemoclient | Using agenthost 'jaeger-agent:6831'
jaegerdemoclient | Running '500' loops
jaegerdemoclient | Starting loop
jaegerdemoclient | Loop done, wait 5 seconds because it seems not everything gets flushed
jaegerdemoclient exited with code 0 Memory usage increases: $ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c178db9e972d jaeger-agent 0.00% 2.095GiB / 31.42GiB 6.67% 445kB / 103kB 0B / 0B 11 Change agent adress or loopcountusing compose: $ docker-compose run -e "DEMOCLIENT_LOOPCOUNT=5000" -e "JAEGER_AGENTHOST=my-dev-machine" jaegerdemoclient
Hello World!
Using agenthost 'my-dev-machine:6831'
Running '1000' loops without compose: docker run --rm -it -e "DEMOCLIENT_LOOPCOUNT=5000" -e "JAEGER_AGENTHOST=my-dev-machine" jaegerdemoclient A loop of 5000 is crashing the agent for me on labs.play-with-docker.com . To run it natively just run |
Thank you to both of you! I'll give it a try soon. |
Quick status update: I was able to reproduce this issue, and I'm convinced it's the same as #2452. The code where this is mostly observable is here: I'm sure the root cause is elsewhere, and I'm still investigating, but wanted to share some findings in case someone has an idea on what's going on:
cc @yurishkuro, as you might be able to confirm that my expectation on item 4 is correct. Oh, and thank you again @mberginbh! For the first time in about 15 years, I'm able to C# again! |
Isn't this the root cause? It sounds like a malformed payload. |
The payload itself was within bounds (4083). While the payload was indeed malformed, we shouldn't just explode in memory consumption when receiving bad data. |
I agree, but this sounds like a bug(ish) in the Thrift library, if it just blindly allocates that much memory based on a malformed value from the payload. |
Alright, I created https://issues.apache.org/jira/browse/THRIFT-5322 to track this there, but I still need to make sure it's indeed in the Thrift library. |
Status update: I got a reproducer that touches far less code, so, might be easier to isolate and fix the issue: package main
import (
"bytes"
"context"
"encoding/base64"
"fmt"
"runtime"
"github.com/apache/thrift/lib/go/thrift"
"github.com/jaegertracing/jaeger/thrift-gen/agent"
"github.com/jaegertracing/jaeger/thrift-gen/jaeger"
"github.com/jaegertracing/jaeger/thrift-gen/zipkincore"
)
func main() {
// good payload
// payloadHex := "goECCWVtaXRCYXRjaBwcGBZPcGVuVGVsZW1ldHJ5IEV4cG9ydGVyGTwYEnRlbGVtZXRyeS5zZGsubmFtZRUAGA1vcGVudGVsZW1ldHJ5ABgWdGVsZW1ldHJ5LnNkay5sYW5ndWFnZRUAGAZkb3RuZXQAGBV0ZWxlbWV0cnkuc2RrLnZlcnNpb24VABgHMS4wLjAuMQAAGfwwFp2Ws72T8O+a/QEW/cr8o+2BjsvLARaK0c7h2b3RnjkWABgCNDklAha+7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvmB5vqP19eJmgEW8cbgvsX9hdSaARbpzqSkqZa5/NIBFgAYAjUwJQIWwu6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABal/YCC9qGovqMBFuXehfL70t6RLRaQ+dGm1Zu3rqMBFgAYAjUxJQIWxu6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaRneS75oHQn6QBFuu28rPG9o/zARbz+um3yvSYlX8WABgCNTIlAhbK7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFs2A+qjvnOzepgEW59aR/KbLzszTARbznoOR3ea8kDgWABgCNTMlAhbO7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvHMzMTX/YupmwEW+/rdyKi0qLYjFuW2j9Sxme2fLhYAGAI1NCUCFtLur+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW+eyju9vitMfRARaEzYyJ95PE++IBFv3G7pvhooyvyQEWABgCNTUlAhbW7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFs2I14nwp8z/wQEWlJ2jr/iUg9m8ARbp4oCKlICk2rMBFgAYAjU2JQIW2u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbTmuDiz6Sv9K0BFpq1h8y49ODBYhaAtf6GjomjzjMWABgCNTclAhbi7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsPk8ru7hJLcqgEW59bYoLCksPTyARac+YT5+9DQ5OIBFgAYAjU4JQIW5u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbn0KC++96n3d0BFpjRoubKxZuj4wEWkp3C65P0+bFAFgAYAjU5JQIW6u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbVyuHomvLN058BFv3KqMDmyIf7CBbt0t6lt/yTl/IBFgAYAjYwJQIW7u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbH4aD97O/S1NMBFpSRhsjtjeytSBac5aeiqK/6u/wBFgAYAjYxJQIW8u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb5wIPrkrqOmLgBFuWKt5WjlZGywwEW447Io4KWleivARYAGAI2MiUCFvbur+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW1Z3nzpDz+5WLARbxgvGb9Onkku8BFuGehqOB8Y2TugEWABgCNjMlAhb67q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrHMvsXLm/bKrAEWgIXptb2zxfNYFpb96fev8NiyRxYAGAI2NCUCFv7ur+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW4abO/53Wu9K/ARaEweeVgdazqeUBFoK13/LOveCtCRYAGAI2NSUCFoDvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW5ZWiyaequOu6ARbhjvDMt6LPneUBFuXCr7KvwK2OzAEWABgCNjYlAhaE76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFo3zkuOkmJiCxgEWgMnO6Of66PA5FpD1yJyusN2WVxYAGAI2NyUCFojvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWpbTcr/2UwvyWARbnmsvfptrEyNABForF44rfrPW2HhYAGAI2OCUCFozvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWgZrajsSQ7sPqARbrnqS/qYj4/N4BFpDRsqq4+JyIVBYAGAI2OSUCFpjvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW4ff57dLtg4CyARbtgsrTw4XjuVEW98L/5qL0ioNBFgAYAjcwJQIWnO+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABazpvjE7O3b/aUBFvv6stKLp/ryRxaEnbyBgv6d4ywWABgCNzElAhag76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsWgy/Kg/LDzuwEW64bd7J3K5+6KARb13pOzy7arlg4WABgCNzIlAhak76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFovuysuZ/YC50gEW6daPloSC+owQFvme7fO2g7773wEWABgCNzMlAhao76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFt3g4c3Z6J/96QEW8dqCrKLe16b7ARaWseX9xvXDorEBFgAYAjc0JQIWrO+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb/ormo/ODqzrMBFvO6hbir0ZzYUhae0caH6ZPxzA8WABgCNzUlAhaw76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFuvW+836trjdmwEWgK2EuuTUjvrQARaUkbm1h+S45SUWABgCNzYlAha076/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvWbnuS+85quqwEWiPWUl4G13sFzFpbN1fe1mNfipgEWABgCNzclAha476/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvOA2qHKgOCKoQEWnOWHwfffv8ORARaU0Zn3hvSEwMMBFgAYAjc4JQIWuu+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaLkayT8+alpOcBFoqh66Hkq7yYYhb/qqH6k7DwuEAWABgCNzklAha+76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqWLoOrSl7z/vAEW467d4sXW9OgHFu/q1qj8yun/qwEWABgCODAlAhbC76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFoGn7abA/8G3+QEWmKnkm4WtjYlCFvGK9ouiuNSWahYAGAI4MSUCFsbvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9bix6I/Y+ISxARaE8aPDkZqZh/8BFpL5x6u2xIuoggEWABgCODIlAhbQ76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqeYvfziyfrK9QEW4baMv+qI3MFSFufG8cqf5sqELBYAGAI4MyUCFtTvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW09P84aDRx9jfARaa0eSUu4PPnZsBFv3267T0m6uMxgEWABgCODQlAhbY76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqnLlvyinevkmAEWjJ25kOin0MHkARbppt2wlpX2m1oWABgCODUlAhba76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrPTzcjWpsX8sQEWkOHDotKZqKEPFvW6kc/z29DjAhYAGAI4NiUCFt7vr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWjdL2mb3U8/6yARb9guyShN7G0kYWkrnL9sX1wtRwFgAYAjg3JQIW4u+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbZ6ZHQkNih6PoBFu+Ssfz1u/CrIRacxc6g6NLf8wMWABgCODglAhbm76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqPq/Jecx9LtnAEW+/at6/Hyuc3BARaaweLSzLac8aABFgAYAjg5JQIW6u+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbhqNzC4cmE4qQBFvum3aqDoKSsgwEWnI2Si+Wig9aCARYAGAI5MCUCFu7vr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWh6r7ooyX35OaARbz8vWayabDtIQBFvGympnuu4KSMBYAGAI5MSUCFvLvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWsYmDsY7MkPjhARbj4oXwvdn5olgW++Ko98KfkMKsARYAGAI5MiUCFvbvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWkbvG4evG9qXGARb9zqCN6LyHwOkBFu2iotOct4vgSRYAGAI5MyUCFvrvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW+9mejeGMgOv9ARblpsn4jo/S3eMBFuOyhoHFkeTgtQEWABgCOTQlAhaC8K/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvXexNrQ9uHdhwEWkPWLrbCW1v7yARbjkrzN0Liz2PUBFgAYAjk1JQIWhvCv68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbpx9iypo7yz5QBFoqd97e/zaTujQEWjr2Mwo6is4PhARYAGAI5NiUCForwr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAAAA=="
// bad payload
payloadHex := "goEICWVtaXRCYXRjaBwcGBZPcGVuVGVsZW1ldHJ5IEV4cG9ydGVyGTwYEnRlbGVtZXRyeS5zZGsubmFtZRUAGA1vcGVudGVsZW1ldHJ5ABgWdGVsZW1ldHJ5LnNkay5sYW5ndWFnZRUAGAZkb3RuZXQAGBV0ZWxlbWV0cnkuc2RrLnZlcnNpb24VABgHMS4wLjAuMQAAGfwwuJfD9scBFpjp3uT6n/PRMRaa+au8zMLzxPMBFgAYAzM4NSUCFrql472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAABmb28AABa1l8Xms56WleYBFvPW7IHpi//RJhblptSZsIzX7yAWABgDMzM4JQIW8KPjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABblg53v9fK3ivkBFo795ZimoZvyxwEW8Ya//6GDwLTjARYAGAMzMzklAhb0o+O9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFuO854G734m7ywEWmomA2LqEksa9ARbt4uebnqGV0M0BFgAYAzM0MCUCFvyj472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWtbme6JHJvfXGARb9xvHcj4eRzysWkq3+waqVjZamARYAGAMzNDElAhaApOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsnIjIT8yNmljgEWhtGB3dO39b2KARb1ps7S1u6llIYBFgAYAzM0MiUCFoSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWgY2rkPmkwbzZARb9lpOO2Ir1sAYW7eLQyoWuzJlPFgAYAzM0MyUCFoik472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9cORw43J3bPTARb5+qrOqfjl2BEWmImQrK/ert1HFgAYAzM0NCUCFoyk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWy5eJwMfolMXBARbhusWa/Zj3r+ABFoTFkZLquN+cUhYAGAMzNDUlAhaQpOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFpng3v+U9LGjtwEW/5b9ipimxM+9ARaY8bCpuJbq7ekBFgAYAzM0NiUCFpSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWjdSj3fuMx8+4ARb/0tqq2/mfyfQBFvGmr5vg+fHPiAEWABgDMzQ3JQIWmKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaLvNnvp7j2qqUBFpbF7faAxofIERbjoqj+xOOcrWIWABgDMzQ4JQIWnKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb3uLLloL6ej80BFpSJrvXer/DqORaW+c3ShZLA/K4BFgAYAzM0OSUCFqCk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9+6b1Ymu+PPWARbrtqK5+cCv78IBFvGmsNuYqa6PnQEWABgDMzUwJQIWpKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbJ1KyXm5/UoPEBFvue3rSNyee+BBbpnrjAhvSIsWQWABgDMzUxJQIWqKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaH+JXn+p+O3fsBFvem7tvx5frPKxbjhpf9yomdjlAWABgDMzUyJQIWsKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABarhvC2vt20i8EBFvmStait88eBpgEW+dr5zN/nur6tARYAGAMzNTMlAha0pOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrmM4fa5ztWotwEWlI2ml/34w+hIFoS9hL7g2s3okAEWABgDMzU0JQIWuKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaNqej90+ex9d8BFunu3K6xl/26kwEWnKXUifvxzrOlARYAGAMzNTUlAha8pOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFu/IgaqDoLHtzQEWhPWagpH1j4i5ARaYuY6moZyVvkoWABgDMzU2JQIWwKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaBvrODm7mVro0BFoadhY7Kgof0aBaMvdrnq6GzxM4BFgAYAzM1NyUCFsSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW3au2ufuO+NHKARbn6u+J2rPBwBIW9+7+ob/H440/FgAYAzM1OCUCFsik472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWs5zZoe7q3reiARaE3f3q8fGAtYEBFozF4aui8LTx+gEWABgDMzU5JQIWzKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbtkfr7g4z3oNABFpixwc6s8M7jUhaa0enBgaa2rA0WABgDMzYwJQIW0KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbLsb3i4J6vpdUBFpDxsdDvmbniGBaC6bGN2c/X9tcBFgAYAzM2MSUCFtSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWw7H22I6MnrSVARb1zoS2t9Dx60UW+eKvuoWm+awlFgAYAzM2MiUCFtik472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWheCgv+/2hNe5ARbxhrP5tazI3yMWjO2Ah8aNmYb5ARYAGAMzNjMlAhbcpOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrW2yffPnenX4QEWhPG3sN6RuIS9ARaK5fKgtbrirkgWABgDMzY0JQIW4KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbbzcC+koSnlaYBFuW61p7nqNVCFvP23ZrizOXfrgEWABgDMzY1JQIW6KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb5sPjOjova16oBFvGm9sSrvPqqRBb7isill5jlijkWABgDMzY2JQIW7KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbJhajy6KiEkPUBFvfWkMGzwMqkfhaC+cqD8tGZ4+8BFgAYAzM2NyUCFvCk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWobum9IP+h/rHARbn1pDci7iZrsoBFoaJ9K+DityvexYAGAMzNjglAhb0pOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFquaxd+M3d+f5AEW997ivY2dqoIcFvu+nvKPnqbBkgEWABgDMzY5JQIW+KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbTgPuNuNb7gqMBFuuW6InpzMzNJxaGjae19qjuwzMWABgDMzcwJQIW/KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABarsIz788LP9+cBFvW67pro7v7yqAEW4+bZj/bVksKeARYAGAMzNzElAhaApeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrW7vI+NwdHK+AEW8fa8ssfigqKAARbl5rb0kYvr9jMWABgDMzcyJQIWhKXjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaTh9nbsYGDwaEBFvG6v4G1/urXzgEW8fadi7fmtJ/0ARYAGAMzNzMlAhaIpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsPV6pql35e5pwEW95aloJr/ydjWARbxzpWEy/K3/CQWABgDMzc0JQIWjKXjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaF8q+Esd2R19oBFpr53vzrzOHMrwEWkqmMiKj27f/GARYAGAMzNzUlAhaQpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFreH7Lbm2emOyQEWmPHxh+u8u8u8ARaOncq4p9PXwoABFgAYAzM3NiUCFpKl472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWkaep2YvAtJLdARaU6aWnp8/pqjQW9c7ambrCgvE6FgAYAzM3NyUCFpql472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW58StjoS9wKucARb36sTWpazvrBsWirHTtpCpkc3qARYAGAMzNzglAhagpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFpuUlYC89tmlnAEWlsG+kMqxm8pkFueqzO29wY6FbBYAGAMzNzklAhakpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFr/p59mV8bDbyAEW4ZqPqpKfvPe7ARbl9o2Q5Kels5YBFgAYAzM4MCUCFqal472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW8+fPjqnGpbW7ARb3ksqbj9XinEsW5eaU0PHb1+lnFgAYAzM4MSUCFqql472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWl92blYDduNjXARacjZjopsKT+zMWnr38t/Ldzd0aFgAYAzM4MiUCFq6l472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9Y3FrdXN68maARaQpeax95ulsx8WlKWLic/x8ufBARYAGAMzODMlAhaypeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrfqnOS4pf29pwEWnNXTuLyuyqvLARb9lr6DzPP4ljEWABgDMzg0JQIWtqXjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AAAAA"
payloadBin, err := base64.StdEncoding.DecodeString(payloadHex)
if err != nil {
fmt.Printf("error while decoding: %v\n", err)
return
}
transport := thrift.NewStreamTransportRW(bytes.NewBuffer(payloadBin))
protocol := thrift.NewTCompactProtocol(transport)
_, err = transport.Write(payloadBin)
if err != nil {
fmt.Printf("error while processing: %v\n", err)
return
}
rep := &Reporter{}
processor := agent.NewAgentProcessor(rep)
var before runtime.MemStats
runtime.ReadMemStats(&before)
_, err = processor.Process(context.Background(), protocol, protocol)
if err != nil {
fmt.Printf("error while processing: %v\n", err)
}
var after runtime.MemStats
runtime.ReadMemStats(&after)
fmt.Printf("memory usage before: %d MiB\n", before.Alloc/1024/1024)
fmt.Printf("memory usage after: %d MiB\n", after.Alloc/1024/1024)
fmt.Println("finished")
}
type Reporter struct {
}
func (r *Reporter) EmitBatch(ctx context.Context, batch *jaeger.Batch) (err error) {
fmt.Println("sending batch to collector")
fmt.Printf("batch: %s\n", batch)
return nil
}
func (r *Reporter) EmitZipkinBatch(ctx context.Context, spans []*zipkincore.Span) (err error) {
return nil
} The payload is based on what the reproducer from @mberginbh generated. |
Note that the thrift fix does not "prevent all memory allocations". It only prevents the blindly allocation according to a malformed/malicious string header. If you have malicious/malformed message with large string header claiming that this string has N bytes, in the old code it allocates N bytes upfront. In the fix it uses So say you have a string header claiming to have a string with size of 100MB, and you do have 90MB left in this message, it will still read all the remaining of your message (90MB) until it fails (because it doesn't have enough content to read all 100MB), and allocate 90MB (plus the smaller ones before it hits 90MB) buffers along the way. Another thrift change after the fix implemented |
@fishy, I agree with everything you said and I was about to argue something very similar in my previous message, but then I tried to add more pressure in the reproducer and came up with the scenario I've shown earlier: in under 10s, the memory consumption went from near zero to more than 11GB for the Are you able to try the reproducer above, or should I attempt to come up with an example payload that triggers the problem? |
Since it's udp only, can you try to use |
Also since you mentioned |
It's a list: jaeger/thrift-gen/jaeger/jaeger.go Lines 1147 to 1156 in 78019b6
I'll try your suggestion with the custom TConfiguration tomorrow. Thanks for your support! |
Containers (list/set/map) are not in that fix, and In some other thrift language packages (java I believe) we do have a sanity check on container size before doing the allocations. We do not have that in go library yet. you are welcomed to create a ticket and PR to do so in go library (also via That being said, for container size check, the only thing you can do is set a max container size and reject requests with a container header larger than that. As long as the container header is smaller than the limit we still do the allocation for the whole container with the given size upfront. This can be changed in thrift compiler code, but doing that has performance implications. |
@fishy not sure I understand that. Should this check be done in our business code, or should I change the generated code? I believe that it shouldn't be possible at all to allocate 11GB for a payload of under 65KB, no matter what data type is used in the payload, no? |
The sanity check on container header should be done in thrift go library ( If you want to avoid allocation of the containers upfront after it passed the header sanity check, then that needs to be done in thrift compiler (in generated go code) |
Thanks for the clarification! I just opened https://issues.apache.org/jira/browse/THRIFT-5369 to address this. |
I'm not sure if this is the same issue, but I hope so. I see very large allocations (using v1.22.0 of the jaeger-collector) as well but on the Which is an int64: jaeger/thrift-gen/jaeger/jaeger.go Lines 1060 to 1070 in 78019b6
|
@forestsword are you able to consistently reproduce this? I was able to get a situation or two very quickly using @Mario-Hofstaetter's reproducer, but if you have a payload or client call that reproduces it with increased frequency, I could use it to verify whether the proposed Thrift fix will address this. See apache/thrift#2359 |
@jpkrohling no I can't reproduce it myself. I run it as a service and somebody is probably sending something bad. But it is consistently getting oom killed. Any tips on how to track down the spans at a collector level? I can only think of building a fork with some more logging. Or if you can point me to branch with the apache/thrift#2359 I can build and verify that it fixes it. |
@forestsword You can run The PR actually also contains compiler changes and |
If we provide you with a proposed fix, would you be able to determine whether the fix works? Are you constantly getting OOM Killed (multiple times a day, for example)? I should be working soon on this one and I can prepare a container image with @fishy's and my own changes. |
Thanks @fishy that helped. @jpkrohling I was able to create a build with the thrift fix and had it run overnight to see if it fixed it and it most certainly did. The 1.22 version collectors oom multiple times an hour, the build hasn't been killed and runs with a stable memory usage, whereas the others would have 8-15gb spikes and be killed off. I'm happy to verify again with whatever you give me. |
Wonderful news! Are you able to open a PR with the changes you had to make? You might want to leave the dependency changes aside, as we'll bump it properly once @fishy's PR is merged and available in a release. |
I didn't need to change any code I only included @fishy's PR. I suppose that means it used the default of 100mb as |
Looks like it then. How's the memory usage, are you able to share some numbers? |
So I made it hard for myself to get those numbers. It was previously running to get a 4th of the traffic. I've cut-over now to only use the build for all of the traffic here, which isn't much here (~300 spans per second). I had multiple containers previously to try and offset the oom kills. I'm back to 1 collector with this build with 4Gb limit. Maybe not the most helpful shot but it gives you the ranges. Red being restarts of the pods. The rest is working set bytes. The cut-over around 13:30. |
Looks nice, thanks for sharing! |
Good to hear that it helped!
A release would be half year away (we plan to do 0.15.0 release around September and there's no plan to do 0.14.2 release at this time, and this bug alone is unlikely to change that). You probably want to use the master branch version once that PR is merged (go mod actually makes that easy, but I also totally understand it if you only want to use tagged releases). Also I'd still strongly suggest y'all to set a lower |
We can probably use what we did before 0.14.0, in that we forked the Thrift repository and tagged https://github.com/jaegertracing/thrift
Will do, thanks! In theory, this could be the same as the max size of a UDP packet, right? I would set it to a couple of MiBs just to be on the safe side. |
Also, since your tests shown that the library change alone is enough to fix your problems, I'm going to revert the compiler change from my PR, because that would have negative performance impacts for everyone. I can add that back later if it's proven necessary. |
Sounds good! |
I was also seeing these memory issues frequently with the 1.18 build of the agent. As per the discussion, updating to the 1.22 version went a long way to addressing the issue although there were still occasional spikes. Thanks for the info. The detail in this discussion was really informative. |
I believe @rubenvp8510 is working on backporting @fishy's fix to the version of Thrift that Jaeger uses. After that, we'll propose releasing 1.23, which would then fix this issue here once and for all (hopefully) :-) |
Describe the bug
I am observing very high and rapidly increasing memory usage of
jaeger-agent
which may be a memory leak.Eventually the agent (container) may run out of memory and crash.
I am able to reproduce the behavior reliably.
It is happening at very low span rates of <= 30 or 50 Spans/sec according to
jaeger_collector_spans_received_total
I am using a dev setup running Demo ASP.NET Core Webservices, using opentelemetry-dotnet for instrumentation. Since these are dummy projects in a home lab environment, I am able to provide the full source code of the .NET solution if necessary.
Possible Cause & Fix
open-telemetry/opentelemetry-dotnet#1372
It looks like this problem can be fixed by by using
MaxPayloadSizeInBytes = 65000;
which was the default until mid september.
Is this memory consumption by jaeger-agent expected behavior if a client library misbehaves?
Or is this something the jaeger team would like to investigate?
I am observing this behavior running jaeger-all-in-one native on windows. or in a linux container on DockerDesktop WSL2, or in a Linux Hyper-V VM. At first I was using and blaming
badger
local storage. I then switched toelasticsearch
storage.I have no split up to running
agent
,collector
andquery
containers on WSL 2 so I can pinpoint the memory usage to agent.Agent is currently not on
localhost
where instrumented client application is running, but I tried this also and the issue happend too. Will try this again now that I am no longer using all in one image.The issue does not seem to occur under very light load. I am
curl
'ing my services to generate spans. At first memory is stable and low. Then I startedcurl
'ing in 10 parallel loops increasing span creation rate.After some minutes agents memory jumps from < 50 MB to > 2GB and then > 5 GB.
The container has currently a hard memory limit of
mem_limit: 8000m
.At the current moment it sites "stable" at 4.6 GB but I have seen it go beyond 8 GB as well.
A symptom or maybe the cause for this are log errors starting to appear in agent logs. While running up to about 3
curl
loops there are no log messages. A litte more requests and these start trickling in:These 5 log errors correlate with metric
I am not sure if the instrumentation library is to blame and this is a concurrency issue there. If requests are reduced, no more log errors are happening, memory is stable (16 MB).
When increasing request load error rate increases again, and at some point memory jumps a few gigabytes:
I took 300 log lines and de-duplicated them a bit to these 50:
To Reproduce
Steps to reproduce the behavior:
jaeger-agent
memory increase by some gigabyte and agent may crash.If the logs / other steps are not enough I should be able to provide you with a ready to run docker-compose setup, but it will take me some time.
Expected behavior
jaeger-agent
should not consume this much memory with such little load.Screenshots
See above.
Version (please complete the following information):
Windows 10 2020H2
+ Docker Desktop2.5.0.1
on WSL 2` or natively on windows.1.20.0
and1.20.1
localhost
where instrumented client application is running, but I tried this also and the issue happend too.1.0.0-RC1
or0.8.0-beta
What troubleshooting steps did you try?
Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.
I did enable
debug
logging for agent and collector, files attached. For this run, agent memory did only increase up until 1.9 GB.jaeger-agent-debug.log
jaeger-collector-debug.log
Not sure which other steps would apply. I am no go dev, so using debug images would be of no use.
Additional context
The text was updated successfully, but these errors were encountered: