Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jaeger-agent reproducible memory leak in 1.21.0 #2638

Closed
Mario-Hofstaetter opened this issue Nov 20, 2020 · 56 comments · Fixed by #2780 or #3050
Closed

jaeger-agent reproducible memory leak in 1.21.0 #2638

Mario-Hofstaetter opened this issue Nov 20, 2020 · 56 comments · Fixed by #2780 or #3050
Assignees
Labels
bug needs-info The ticket does not contain sufficient information performance

Comments

@Mario-Hofstaetter
Copy link

Mario-Hofstaetter commented Nov 20, 2020

Describe the bug
I am observing very high and rapidly increasing memory usage of jaeger-agent which may be a memory leak.
Eventually the agent (container) may run out of memory and crash.

I am able to reproduce the behavior reliably.
It is happening at very low span rates of <= 30 or 50 Spans/sec according to jaeger_collector_spans_received_total

I am using a dev setup running Demo ASP.NET Core Webservices, using opentelemetry-dotnet for instrumentation. Since these are dummy projects in a home lab environment, I am able to provide the full source code of the .NET solution if necessary.

Possible Cause & Fix

open-telemetry/opentelemetry-dotnet#1372
It looks like this problem can be fixed by by using MaxPayloadSizeInBytes = 65000;
which was the default until mid september.

Is this memory consumption by jaeger-agent expected behavior if a client library misbehaves?
Or is this something the jaeger team would like to investigate?


I am observing this behavior running jaeger-all-in-one native on windows. or in a linux container on DockerDesktop WSL2, or in a Linux Hyper-V VM. At first I was using and blaming badger local storage. I then switched to elasticsearch storage.
I have no split up to running agent, collector and query containers on WSL 2 so I can pinpoint the memory usage to agent.

Agent is currently not on localhost where instrumented client application is running, but I tried this also and the issue happend too. Will try this again now that I am no longer using all in one image.

The issue does not seem to occur under very light load. I am curl'ing my services to generate spans. At first memory is stable and low. Then I started curl'ing in 10 parallel loops increasing span creation rate.

After some minutes agents memory jumps from < 50 MB to > 2GB and then > 5 GB.
The container has currently a hard memory limit of mem_limit: 8000m.

grafik
At the current moment it sites "stable" at 4.6 GB but I have seen it go beyond 8 GB as well.

A symptom or maybe the cause for this are log errors starting to appear in agent logs. While running up to about 3 curl loops there are no log messages. A litte more requests and these start trickling in:

jaeger-agent        | {"level":"error","ts":1605909716.7062023,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: error reading list begin: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
jaeger-agent        | {"level":"error","ts":1605909751.7633333,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 2 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
jaeger-agent        | {"level":"error","ts":1605909761.783476,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}      
jaeger-agent        | {"level":"error","ts":1605909771.80936,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading 
struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
jaeger-agent        | {"level":"error","ts":1605909791.8287015,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}

These 5 log errors correlate with metric

jaeger_agent_thrift_udp_t_processor_handler_errors_total{protocol="compact",model="jaeger"}

grafik

I am not sure if the instrumentation library is to blame and this is a concurrency issue there. If requests are reduced, no more log errors are happening, memory is stable (16 MB).

When increasing request load error rate increases again, and at some point memory jumps a few gigabytes:

grafik

I took 300 log lines and de-duplicated them a bit to these 50:

":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 6 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Log error reading struct: Required field Timestamp is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Invalid data length","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: error reading list begin: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 9 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 11 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Unknown data type 0","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 2 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 2 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 6 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 23 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 12 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 15 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: EOF","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 22 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 6 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: *jaeger.SpanRef field 6 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field -24 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 25 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 12 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 20 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 33 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 14 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 5 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: error reading list begin: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: *jaeger.SpanRef field 5 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 10 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 13 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 12 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 13 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: Required field RefType is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 4 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 10 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 25 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 16 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 13 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 5 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 18 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 11 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Log error reading struct: *jaeger.Log field 9 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 19 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 25 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 22 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}

To Reproduce
Steps to reproduce the behavior:

  1. Run ASP.NET Core 3.1 Demo Solution using opentelemetry instrumentation
  2. Create load to generate spans
  3. jaeger-agent memory increase by some gigabyte and agent may crash.

If the logs / other steps are not enough I should be able to provide you with a ready to run docker-compose setup, but it will take me some time.

Expected behavior
jaeger-agent should not consume this much memory with such little load.

Screenshots
See above.

Version (please complete the following information):

  • OS: Windows 10 2020H2 + Docker Desktop 2.5.0.1 on WSL 2` or natively on windows.
  • Jaeger version: 1.20.0 and 1.20.1
  • Deployment: agent on bare metal windows, or WSL2 linux container, or linux vm. Agent is currently not on localhost where instrumented client application is running, but I tried this also and the issue happend too.
  • opentelemetry-dotnet version 1.0.0-RC1 or 0.8.0-beta

What troubleshooting steps did you try?
Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.

I did enable debug logging for agent and collector, files attached. For this run, agent memory did only increase up until 1.9 GB.
jaeger-agent-debug.log
jaeger-collector-debug.log

Not sure which other steps would apply. I am no go dev, so using debug images would be of no use.

Additional context

@jpkrohling
Copy link
Contributor

Thanks for the detailed report, @Mario-Hofstaetter! I'll try to reproduce this, but I'm not sure I'll be able to build/run a dotnet example on Linux. Would you be able to capture the payload the opentelemetry-dotnet library is sending? This would make it easier to reproduce/fix this.

@jpkrohling
Copy link
Contributor

This might be a duplicate of #2459.

@Mario-Hofstaetter
Copy link
Author

This might be a duplicate of #2459.

@jpkrohling I guess in my case, this here was the cause and not badger. The original author of #2459 (lichenglin) did not state which instrumentation library he is using.
I have yet to retry badger local storage with the opentelemetry-dotnet fix mentioned above in first post.

I'll try to reproduce this, but I'm not sure I'll be able to build/run a dotnet example on Linux. Would you be able to capture the payload the opentelemetry-dotnet library is sending? This would make it easier to reproduce/fix this.

I'm pretty sure I can provide a small solution that will run on linux (it's .net core after all), at least a docker-compose setup.
What do you mean by capture the payload?
Wireshark / pcap?

Thanks.

@jpkrohling
Copy link
Contributor

I'm pretty sure I can provide a small solution that will run on linux (it's .net core after all), at least a docker-compose setup.

That would be awesome!

What do you mean by capture the payload?
Wireshark / pcap?

Yes, something like that. If you are using plain HTTP for the communication, instead of gRPC, it would be even easier as I could just use the HTTP payload with cURL.

@Ashmita152
Copy link
Contributor

I think we can make use of /debug/pprof/heap url at jaeger-agent's admin host-port endpoint to get the live heap and see what is taking the heap so much.

@Mario-Hofstaetter
Copy link
Author

Mario-Hofstaetter commented Nov 30, 2020

I think we can make use of /debug/pprof/heap url at jaeger-agent's admin host-port endpoint to get the live heap and see what is taking the heap so much.

@Ashmita152 Is this url available by default or does it need to be enabled?
Nevermind already had the port enabled for metrics but not mapped on host.

It downloads a heap binary file, so reproduce the issue and attach the file here for you to debug?

@Ashmita152
Copy link
Contributor

Hi @Mario-Hofstaetter

It will be very useful to have that to debug more. I am new to golang but I can give it a try. Someone will definitely be able to help.

@jpkrohling
Copy link
Contributor

@Mario-Hofstaetter, you can open http://localhost:14269/debug/pprof/ in your browser, and navigate over the raw data. You might also want to try this:

$ go tool pprof -web http://localhost:14269/debug/pprof/heap

This should open your web browser with the current heap as an image. You can try the same for the other profiles (see the first URL for the available profiles).

@jpkrohling
Copy link
Contributor

@Mario-Hofstaetter do you have any news on this? Are you able to provide with a dotnet application that runs on Linux, so that I can reproduce it?

@Mario-Hofstaetter
Copy link
Author

@Mario-Hofstaetter do you have any news on this? Are you able to provide with a dotnet application that runs on Linux, so that I can reproduce it?

@jpkrohling Apologies for the long delay, will schedule this for the weekend to do.

@jpkrohling jpkrohling added this to the Release 1.22.0 milestone Dec 8, 2020
@jpkrohling jpkrohling added performance needs-info The ticket does not contain sufficient information labels Dec 8, 2020
@mberginbh
Copy link

This fairly reliably reproduces it for me:

ConsoleApp1.csproj

<Project Sdk="Microsoft.NET.Sdk">

    <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net5.0</TargetFramework>
    </PropertyGroup>

    <ItemGroup>
      <PackageReference Include="OpenTelemetry" Version="1.0.0-rc1.1" />
      <PackageReference Include="OpenTelemetry.Exporter.Jaeger" Version="1.0.0-rc1.1" />
    </ItemGroup>

</Project>

Program.cs

using System;
using System.Diagnostics;
using OpenTelemetry;
using OpenTelemetry.Trace;

namespace ConsoleApp1
{
    class Program
    {
        public static ActivitySource ActivitySource = new("foo");
        static void Main()
        {
            Console.WriteLine("Hello World!");
            using var tracerProvider = Sdk.CreateTracerProviderBuilder()
                .AddSource(ActivitySource.Name)
                .AddJaegerExporter()
                .Build();

            for (var i = 0; i < 1000; i++)
            {
                using var _ = ActivitySource.StartActivity(i.ToString());
            }
            
        }
    }
}

@jpkrohling
Copy link
Contributor

How can I run this on Linux?

@mberginbh
Copy link

mberginbh commented Dec 8, 2020

How can I run this on Linux?

Just copy these two files somewhere and do dotnet run, after installing .NET 5: https://dotnet.microsoft.com/download

@Mario-Hofstaetter
Copy link
Author

How can I run this on Linux?

@jpkrohling If you don't feel like installing .NET, I have pushed the sample app to https://github.com/Mario-Hofstaetter/jaegertracing-jaeger-issue-2638 including a Dockerfile and was able to reproduce on https://labs.play-with-docker.com . Thanks @mberginbh for the tiny example.

You can point the app to any agent adress/port and change the loopCount to change the load.

Steps to reproduce

$ git clone https://github.com/Mario-Hofstaetter/jaegertracing-jaeger-issue-2638.git && cd jaegertracing-jaeger-issue-2638
$ docker-compose up -d
# takes some time for elasticsearch to spin up, so collector and query-ui containers will be restarting
# c# demo app image is build with tag 'jaegerdemoclient'

Run demo app:

# Repeat a few times
$ docker-compose up jaegerdemoclient
Starting jaegerdemoclient ... done
Attaching to jaegerdemoclient
jaegerdemoclient    | Hello World!
jaegerdemoclient    | Using agenthost 'jaeger-agent:6831'
jaegerdemoclient    | Running  '500' loops
jaegerdemoclient    | Starting loop
jaegerdemoclient    | Loop done, wait 5 seconds because it seems not everything gets flushed
jaegerdemoclient exited with code 0

Memory usage increases:

$ docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
c178db9e972d        jaeger-agent        0.00%               2.095GiB / 31.42GiB   6.67%               445kB / 103kB       0B / 0B             11

Change agent adress or loopcount

using compose:

$ docker-compose run -e "DEMOCLIENT_LOOPCOUNT=5000" -e "JAEGER_AGENTHOST=my-dev-machine" jaegerdemoclient
Hello World!
Using agenthost 'my-dev-machine:6831'
Running  '1000' loops

without compose:

docker run --rm -it -e "DEMOCLIENT_LOOPCOUNT=5000" -e "JAEGER_AGENTHOST=my-dev-machine" jaegerdemoclient

A loop of 5000 is crashing the agent for me on labs.play-with-docker.com .

To run it natively just run dotnet run in the repo (with dotnet 5 installed),
the app will default to localhost:6831 for the agent.

@jpkrohling
Copy link
Contributor

Thank you to both of you! I'll give it a try soon.

@jpkrohling
Copy link
Contributor

jpkrohling commented Dec 9, 2020

Quick status update: I was able to reproduce this issue, and I'm convinced it's the same as #2452. The code where this is mostly observable is here:

https://github.com/apache/thrift/blob/b75e88a33d67ae05ef9b5fa001d2a63a2effe377/lib/go/thrift/compact_protocol.go#L557-L576

I'm sure the root cause is elsewhere, and I'm still investigating, but wanted to share some findings in case someone has an idea on what's going on:

  1. most of the time, this function runs fine, but given enough retries of the reproducing code, memory will just explode
  2. when the memory explodes, the usage at the beginning of this function is typically quite low. One example execution: the agent was using 3MiB of memory before the buffer allocation (lines 570-574 in the code above), and 1987MiB after that.
  3. the length (L558) was 2081129901 for that case
  4. the total UDP package payload length was 4083 (
    payload := readBuf.GetBytes()
    ). Here's where I'm confused: as far as I understand, this payload is what will then be read at the protocol code, so, the length (previous point) would have to be lower than the payload size
  5. I might need to read more about Go's memory management, but the heap part of pprof shows a value close to the buffer size for the line 576 in compact_protocol. Does it mean that the buffer is still in memory, or will Go reclaim this space?

cc @yurishkuro, as you might be able to confirm that my expectation on item 4 is correct.

Oh, and thank you again @mberginbh! For the first time in about 15 years, I'm able to C# again!

@yurishkuro
Copy link
Member

the length (L558) was 2081129901 for that case

Isn't this the root cause? It sounds like a malformed payload.

@jpkrohling
Copy link
Contributor

The payload itself was within bounds (4083). While the payload was indeed malformed, we shouldn't just explode in memory consumption when receiving bad data.

@yurishkuro
Copy link
Member

I agree, but this sounds like a bug(ish) in the Thrift library, if it just blindly allocates that much memory based on a malformed value from the payload.

@jpkrohling
Copy link
Contributor

Alright, I created https://issues.apache.org/jira/browse/THRIFT-5322 to track this there, but I still need to make sure it's indeed in the Thrift library.

@jpkrohling
Copy link
Contributor

jpkrohling commented Dec 10, 2020

Status update: I got a reproducer that touches far less code, so, might be easier to isolate and fix the issue:

package main

import (
	"bytes"
	"context"
	"encoding/base64"
	"fmt"
	"runtime"

	"github.com/apache/thrift/lib/go/thrift"
	"github.com/jaegertracing/jaeger/thrift-gen/agent"
	"github.com/jaegertracing/jaeger/thrift-gen/jaeger"
	"github.com/jaegertracing/jaeger/thrift-gen/zipkincore"
)

func main() {
	// good payload
	// payloadHex := "goECCWVtaXRCYXRjaBwcGBZPcGVuVGVsZW1ldHJ5IEV4cG9ydGVyGTwYEnRlbGVtZXRyeS5zZGsubmFtZRUAGA1vcGVudGVsZW1ldHJ5ABgWdGVsZW1ldHJ5LnNkay5sYW5ndWFnZRUAGAZkb3RuZXQAGBV0ZWxlbWV0cnkuc2RrLnZlcnNpb24VABgHMS4wLjAuMQAAGfwwFp2Ws72T8O+a/QEW/cr8o+2BjsvLARaK0c7h2b3RnjkWABgCNDklAha+7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvmB5vqP19eJmgEW8cbgvsX9hdSaARbpzqSkqZa5/NIBFgAYAjUwJQIWwu6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABal/YCC9qGovqMBFuXehfL70t6RLRaQ+dGm1Zu3rqMBFgAYAjUxJQIWxu6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaRneS75oHQn6QBFuu28rPG9o/zARbz+um3yvSYlX8WABgCNTIlAhbK7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFs2A+qjvnOzepgEW59aR/KbLzszTARbznoOR3ea8kDgWABgCNTMlAhbO7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvHMzMTX/YupmwEW+/rdyKi0qLYjFuW2j9Sxme2fLhYAGAI1NCUCFtLur+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW+eyju9vitMfRARaEzYyJ95PE++IBFv3G7pvhooyvyQEWABgCNTUlAhbW7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFs2I14nwp8z/wQEWlJ2jr/iUg9m8ARbp4oCKlICk2rMBFgAYAjU2JQIW2u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbTmuDiz6Sv9K0BFpq1h8y49ODBYhaAtf6GjomjzjMWABgCNTclAhbi7q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsPk8ru7hJLcqgEW59bYoLCksPTyARac+YT5+9DQ5OIBFgAYAjU4JQIW5u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbn0KC++96n3d0BFpjRoubKxZuj4wEWkp3C65P0+bFAFgAYAjU5JQIW6u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbVyuHomvLN058BFv3KqMDmyIf7CBbt0t6lt/yTl/IBFgAYAjYwJQIW7u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbH4aD97O/S1NMBFpSRhsjtjeytSBac5aeiqK/6u/wBFgAYAjYxJQIW8u6v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb5wIPrkrqOmLgBFuWKt5WjlZGywwEW447Io4KWleivARYAGAI2MiUCFvbur+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW1Z3nzpDz+5WLARbxgvGb9Onkku8BFuGehqOB8Y2TugEWABgCNjMlAhb67q/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrHMvsXLm/bKrAEWgIXptb2zxfNYFpb96fev8NiyRxYAGAI2NCUCFv7ur+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW4abO/53Wu9K/ARaEweeVgdazqeUBFoK13/LOveCtCRYAGAI2NSUCFoDvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW5ZWiyaequOu6ARbhjvDMt6LPneUBFuXCr7KvwK2OzAEWABgCNjYlAhaE76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFo3zkuOkmJiCxgEWgMnO6Of66PA5FpD1yJyusN2WVxYAGAI2NyUCFojvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWpbTcr/2UwvyWARbnmsvfptrEyNABForF44rfrPW2HhYAGAI2OCUCFozvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWgZrajsSQ7sPqARbrnqS/qYj4/N4BFpDRsqq4+JyIVBYAGAI2OSUCFpjvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW4ff57dLtg4CyARbtgsrTw4XjuVEW98L/5qL0ioNBFgAYAjcwJQIWnO+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABazpvjE7O3b/aUBFvv6stKLp/ryRxaEnbyBgv6d4ywWABgCNzElAhag76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsWgy/Kg/LDzuwEW64bd7J3K5+6KARb13pOzy7arlg4WABgCNzIlAhak76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFovuysuZ/YC50gEW6daPloSC+owQFvme7fO2g7773wEWABgCNzMlAhao76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFt3g4c3Z6J/96QEW8dqCrKLe16b7ARaWseX9xvXDorEBFgAYAjc0JQIWrO+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb/ormo/ODqzrMBFvO6hbir0ZzYUhae0caH6ZPxzA8WABgCNzUlAhaw76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFuvW+836trjdmwEWgK2EuuTUjvrQARaUkbm1h+S45SUWABgCNzYlAha076/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvWbnuS+85quqwEWiPWUl4G13sFzFpbN1fe1mNfipgEWABgCNzclAha476/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvOA2qHKgOCKoQEWnOWHwfffv8ORARaU0Zn3hvSEwMMBFgAYAjc4JQIWuu+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaLkayT8+alpOcBFoqh66Hkq7yYYhb/qqH6k7DwuEAWABgCNzklAha+76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqWLoOrSl7z/vAEW467d4sXW9OgHFu/q1qj8yun/qwEWABgCODAlAhbC76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFoGn7abA/8G3+QEWmKnkm4WtjYlCFvGK9ouiuNSWahYAGAI4MSUCFsbvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9bix6I/Y+ISxARaE8aPDkZqZh/8BFpL5x6u2xIuoggEWABgCODIlAhbQ76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqeYvfziyfrK9QEW4baMv+qI3MFSFufG8cqf5sqELBYAGAI4MyUCFtTvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW09P84aDRx9jfARaa0eSUu4PPnZsBFv3267T0m6uMxgEWABgCODQlAhbY76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqnLlvyinevkmAEWjJ25kOin0MHkARbppt2wlpX2m1oWABgCODUlAhba76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrPTzcjWpsX8sQEWkOHDotKZqKEPFvW6kc/z29DjAhYAGAI4NiUCFt7vr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWjdL2mb3U8/6yARb9guyShN7G0kYWkrnL9sX1wtRwFgAYAjg3JQIW4u+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbZ6ZHQkNih6PoBFu+Ssfz1u/CrIRacxc6g6NLf8wMWABgCODglAhbm76/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFqPq/Jecx9LtnAEW+/at6/Hyuc3BARaaweLSzLac8aABFgAYAjg5JQIW6u+v68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbhqNzC4cmE4qQBFvum3aqDoKSsgwEWnI2Si+Wig9aCARYAGAI5MCUCFu7vr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWh6r7ooyX35OaARbz8vWayabDtIQBFvGympnuu4KSMBYAGAI5MSUCFvLvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWsYmDsY7MkPjhARbj4oXwvdn5olgW++Ko98KfkMKsARYAGAI5MiUCFvbvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWkbvG4evG9qXGARb9zqCN6LyHwOkBFu2iotOct4vgSRYAGAI5MyUCFvrvr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW+9mejeGMgOv9ARblpsn4jo/S3eMBFuOyhoHFkeTgtQEWABgCOTQlAhaC8K/rx4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFvXexNrQ9uHdhwEWkPWLrbCW1v7yARbjkrzN0Liz2PUBFgAYAjk1JQIWhvCv68eH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbpx9iypo7yz5QBFoqd97e/zaTujQEWjr2Mwo6is4PhARYAGAI5NiUCForwr+vHh9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAAAA=="

	// bad payload
	payloadHex := "goEICWVtaXRCYXRjaBwcGBZPcGVuVGVsZW1ldHJ5IEV4cG9ydGVyGTwYEnRlbGVtZXRyeS5zZGsubmFtZRUAGA1vcGVudGVsZW1ldHJ5ABgWdGVsZW1ldHJ5LnNkay5sYW5ndWFnZRUAGAZkb3RuZXQAGBV0ZWxlbWV0cnkuc2RrLnZlcnNpb24VABgHMS4wLjAuMQAAGfwwuJfD9scBFpjp3uT6n/PRMRaa+au8zMLzxPMBFgAYAzM4NSUCFrql472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAABmb28AABa1l8Xms56WleYBFvPW7IHpi//RJhblptSZsIzX7yAWABgDMzM4JQIW8KPjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABblg53v9fK3ivkBFo795ZimoZvyxwEW8Ya//6GDwLTjARYAGAMzMzklAhb0o+O9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFuO854G734m7ywEWmomA2LqEksa9ARbt4uebnqGV0M0BFgAYAzM0MCUCFvyj472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWtbme6JHJvfXGARb9xvHcj4eRzysWkq3+waqVjZamARYAGAMzNDElAhaApOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsnIjIT8yNmljgEWhtGB3dO39b2KARb1ps7S1u6llIYBFgAYAzM0MiUCFoSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWgY2rkPmkwbzZARb9lpOO2Ir1sAYW7eLQyoWuzJlPFgAYAzM0MyUCFoik472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9cORw43J3bPTARb5+qrOqfjl2BEWmImQrK/ert1HFgAYAzM0NCUCFoyk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWy5eJwMfolMXBARbhusWa/Zj3r+ABFoTFkZLquN+cUhYAGAMzNDUlAhaQpOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFpng3v+U9LGjtwEW/5b9ipimxM+9ARaY8bCpuJbq7ekBFgAYAzM0NiUCFpSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWjdSj3fuMx8+4ARb/0tqq2/mfyfQBFvGmr5vg+fHPiAEWABgDMzQ3JQIWmKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaLvNnvp7j2qqUBFpbF7faAxofIERbjoqj+xOOcrWIWABgDMzQ4JQIWnKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb3uLLloL6ej80BFpSJrvXer/DqORaW+c3ShZLA/K4BFgAYAzM0OSUCFqCk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9+6b1Ymu+PPWARbrtqK5+cCv78IBFvGmsNuYqa6PnQEWABgDMzUwJQIWpKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbJ1KyXm5/UoPEBFvue3rSNyee+BBbpnrjAhvSIsWQWABgDMzUxJQIWqKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaH+JXn+p+O3fsBFvem7tvx5frPKxbjhpf9yomdjlAWABgDMzUyJQIWsKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABarhvC2vt20i8EBFvmStait88eBpgEW+dr5zN/nur6tARYAGAMzNTMlAha0pOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrmM4fa5ztWotwEWlI2ml/34w+hIFoS9hL7g2s3okAEWABgDMzU0JQIWuKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaNqej90+ex9d8BFunu3K6xl/26kwEWnKXUifvxzrOlARYAGAMzNTUlAha8pOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFu/IgaqDoLHtzQEWhPWagpH1j4i5ARaYuY6moZyVvkoWABgDMzU2JQIWwKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaBvrODm7mVro0BFoadhY7Kgof0aBaMvdrnq6GzxM4BFgAYAzM1NyUCFsSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW3au2ufuO+NHKARbn6u+J2rPBwBIW9+7+ob/H440/FgAYAzM1OCUCFsik472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWs5zZoe7q3reiARaE3f3q8fGAtYEBFozF4aui8LTx+gEWABgDMzU5JQIWzKTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbtkfr7g4z3oNABFpixwc6s8M7jUhaa0enBgaa2rA0WABgDMzYwJQIW0KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbLsb3i4J6vpdUBFpDxsdDvmbniGBaC6bGN2c/X9tcBFgAYAzM2MSUCFtSk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWw7H22I6MnrSVARb1zoS2t9Dx60UW+eKvuoWm+awlFgAYAzM2MiUCFtik472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWheCgv+/2hNe5ARbxhrP5tazI3yMWjO2Ah8aNmYb5ARYAGAMzNjMlAhbcpOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrW2yffPnenX4QEWhPG3sN6RuIS9ARaK5fKgtbrirkgWABgDMzY0JQIW4KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbbzcC+koSnlaYBFuW61p7nqNVCFvP23ZrizOXfrgEWABgDMzY1JQIW6KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABb5sPjOjova16oBFvGm9sSrvPqqRBb7isill5jlijkWABgDMzY2JQIW7KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbJhajy6KiEkPUBFvfWkMGzwMqkfhaC+cqD8tGZ4+8BFgAYAzM2NyUCFvCk472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWobum9IP+h/rHARbn1pDci7iZrsoBFoaJ9K+DityvexYAGAMzNjglAhb0pOO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFquaxd+M3d+f5AEW997ivY2dqoIcFvu+nvKPnqbBkgEWABgDMzY5JQIW+KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABbTgPuNuNb7gqMBFuuW6InpzMzNJxaGjae19qjuwzMWABgDMzcwJQIW/KTjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABarsIz788LP9+cBFvW67pro7v7yqAEW4+bZj/bVksKeARYAGAMzNzElAhaApeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrW7vI+NwdHK+AEW8fa8ssfigqKAARbl5rb0kYvr9jMWABgDMzcyJQIWhKXjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaTh9nbsYGDwaEBFvG6v4G1/urXzgEW8fadi7fmtJ/0ARYAGAMzNzMlAhaIpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFsPV6pql35e5pwEW95aloJr/ydjWARbxzpWEy/K3/CQWABgDMzc0JQIWjKXjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AABaF8q+Esd2R19oBFpr53vzrzOHMrwEWkqmMiKj27f/GARYAGAMzNzUlAhaQpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFreH7Lbm2emOyQEWmPHxh+u8u8u8ARaOncq4p9PXwoABFgAYAzM3NiUCFpKl472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWkaep2YvAtJLdARaU6aWnp8/pqjQW9c7ambrCgvE6FgAYAzM3NyUCFpql472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW58StjoS9wKucARb36sTWpazvrBsWirHTtpCpkc3qARYAGAMzNzglAhagpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFpuUlYC89tmlnAEWlsG+kMqxm8pkFueqzO29wY6FbBYAGAMzNzklAhakpeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFr/p59mV8bDbyAEW4ZqPqpKfvPe7ARbl9o2Q5Kels5YBFgAYAzM4MCUCFqal472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW8+fPjqnGpbW7ARb3ksqbj9XinEsW5eaU0PHb1+lnFgAYAzM4MSUCFqql472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAWl92blYDduNjXARacjZjopsKT+zMWnr38t/Ldzd0aFgAYAzM4MiUCFq6l472Th9sFFgAZHBgRb3RlbC5saWJyYXJ5Lm5hbWUVABgDZm9vAAAW9Y3FrdXN68maARaQpeax95ulsx8WlKWLic/x8ufBARYAGAMzODMlAhaypeO9k4fbBRYAGRwYEW90ZWwubGlicmFyeS5uYW1lFQAYA2ZvbwAAFrfqnOS4pf29pwEWnNXTuLyuyqvLARb9lr6DzPP4ljEWABgDMzg0JQIWtqXjvZOH2wUWABkcGBFvdGVsLmxpYnJhcnkubmFtZRUAGANmb28AAAAA"

	payloadBin, err := base64.StdEncoding.DecodeString(payloadHex)
	if err != nil {
		fmt.Printf("error while decoding: %v\n", err)
		return
	}

	transport := thrift.NewStreamTransportRW(bytes.NewBuffer(payloadBin))
	protocol := thrift.NewTCompactProtocol(transport)

	_, err = transport.Write(payloadBin)
	if err != nil {
		fmt.Printf("error while processing: %v\n", err)
		return
	}

	rep := &Reporter{}

	processor := agent.NewAgentProcessor(rep)

	var before runtime.MemStats
	runtime.ReadMemStats(&before)

	_, err = processor.Process(context.Background(), protocol, protocol)
	if err != nil {
		fmt.Printf("error while processing: %v\n", err)
	}

	var after runtime.MemStats
	runtime.ReadMemStats(&after)

	fmt.Printf("memory usage before: %d MiB\n", before.Alloc/1024/1024)
	fmt.Printf("memory usage after: %d MiB\n", after.Alloc/1024/1024)

	fmt.Println("finished")
}

type Reporter struct {
}

func (r *Reporter) EmitBatch(ctx context.Context, batch *jaeger.Batch) (err error) {
	fmt.Println("sending batch to collector")
	fmt.Printf("batch: %s\n", batch)
	return nil
}
func (r *Reporter) EmitZipkinBatch(ctx context.Context, spans []*zipkincore.Span) (err error) {
	return nil
}

The payload is based on what the reproducer from @mberginbh generated.

@fishy
Copy link

fishy commented Mar 11, 2021

Note that the thrift fix does not "prevent all memory allocations". It only prevents the blindly allocation according to a malformed/malicious string header.

If you have malicious/malformed message with large string header claiming that this string has N bytes, in the old code it allocates N bytes upfront. In the fix it uses bytes.Buffer, read at most N bytes and grow the buffer as it reads if needed. You can read bytes.Buffer.Grow's doc/code to learn more about how it works, but it does involve allocating a new, larger buffer, and copy from the old smaller buffer to the new buffer, when needed.

So say you have a string header claiming to have a string with size of 100MB, and you do have 90MB left in this message, it will still read all the remaining of your message (90MB) until it fails (because it doesn't have enough content to read all 100MB), and allocate 90MB (plus the smaller ones before it hits 90MB) buffers along the way.

Another thrift change after the fix implemented TConfiguration, which allows you to change default max message size (which is 100MB if you don't override it through TConfiguration), and we do check max message size when reading string header and reject it directly (without any future read) if the string header is larger than max message size. So if your code have to deal with malicious/malformed messages a lot, you do want to use TConfiguration and set a smaller max message size.

@jpkrohling
Copy link
Contributor

@fishy, I agree with everything you said and I was about to argue something very similar in my previous message, but then I tried to add more pressure in the reproducer and came up with the scenario I've shown earlier: in under 10s, the memory consumption went from near zero to more than 11GB for the github.com/jaegertracing/jaeger/thrift-gen/jaeger.(*Span).ReadField10, in a path that accepts data only via UDP, meaning that the packet size is at most about 65KB. The memory consumption then rapidly went back to very low levels, but this spike in memory would be enough to get the process killed depending on the process manager (systemd, Kubernetes, ...)

Are you able to try the reproducer above, or should I attempt to come up with an example payload that triggers the problem?

@fishy
Copy link

fishy commented Mar 11, 2021

Since it's udp only, can you try to use &TConfiguration{MaxMessageSize: 100*1024} (I used something larger than 65KB because it could be zipped payload), and see how's the memory usage?

@fishy
Copy link

fishy commented Mar 11, 2021

Also since you mentioned ReadField10, which thrift struct is it and what's the type of the field with numeric id 10?

@jpkrohling
Copy link
Contributor

It's a list:

case 10:
if fieldTypeId == thrift.LIST {
if err := p.ReadField10(ctx, iprot); err != nil {
return err
}
} else {
if err := iprot.Skip(ctx, fieldTypeId); err != nil {
return err
}
}

I'll try your suggestion with the custom TConfiguration tomorrow. Thanks for your support!

@fishy
Copy link

fishy commented Mar 11, 2021

Containers (list/set/map) are not in that fix, and TConfiguration won't help them either (but I do think you should still apply the TConfiguration fix anyways). So this is actually a different issue (but kinda related to THRIFT-5322)

In some other thrift language packages (java I believe) we do have a sanity check on container size before doing the allocations. We do not have that in go library yet. you are welcomed to create a ticket and PR to do so in go library (also via TConfiguration, ideally).

That being said, for container size check, the only thing you can do is set a max container size and reject requests with a container header larger than that. As long as the container header is smaller than the limit we still do the allocation for the whole container with the given size upfront. This can be changed in thrift compiler code, but doing that has performance implications.

@jpkrohling
Copy link
Contributor

@fishy not sure I understand that. Should this check be done in our business code, or should I change the generated code? I believe that it shouldn't be possible at all to allocate 11GB for a payload of under 65KB, no matter what data type is used in the payload, no?

@fishy
Copy link

fishy commented Mar 11, 2021

The sanity check on container header should be done in thrift go library (github.com/apache/thrift/lib/go/thrift)

If you want to avoid allocation of the containers upfront after it passed the header sanity check, then that needs to be done in thrift compiler (in generated go code)

@jpkrohling
Copy link
Contributor

Thanks for the clarification! I just opened https://issues.apache.org/jira/browse/THRIFT-5369 to address this.

@forestsword
Copy link

forestsword commented Mar 26, 2021

I'm not sure if this is the same issue, but I hope so. I see very large allocations (using v1.22.0 of the jaeger-collector) as well but on the TraceIdHigh here:

prof

Which is an int64:

case 2:
if fieldTypeId == thrift.I64 {
if err := p.ReadField2(ctx, iprot); err != nil {
return err
}
issetTraceIdHigh = true
} else {
if err := iprot.Skip(ctx, fieldTypeId); err != nil {
return err
}
}

@jpkrohling
Copy link
Contributor

@forestsword are you able to consistently reproduce this? I was able to get a situation or two very quickly using @Mario-Hofstaetter's reproducer, but if you have a payload or client call that reproduces it with increased frequency, I could use it to verify whether the proposed Thrift fix will address this. See apache/thrift#2359

@forestsword
Copy link

@jpkrohling no I can't reproduce it myself. I run it as a service and somebody is probably sending something bad. But it is consistently getting oom killed. Any tips on how to track down the spans at a collector level? I can only think of building a fork with some more logging.

Or if you can point me to branch with the apache/thrift#2359 I can build and verify that it fixes it.

@fishy
Copy link

fishy commented Apr 6, 2021

@jpkrohling no I can't reproduce it myself. I run it as a service and somebody is probably sending something bad. But it is consistently getting oom killed. Any tips on how to track down the spans at a collector level? I can only think of building a fork with some more logging.

Or if you can point me to branch with the apache/thrift#2359 I can build and verify that it fixes it.

@forestsword You can run go mod edit --replace=github.com/apache/thrift=github.com/fishy/thrift@1bc6892 && go mod tidy to use the code from the PR. You do need to pass in a TConfiguration with a reasonable MaxMessageSize set to actually limit the memory usage.

The PR actually also contains compiler changes and go mod edit would only replace the library code, but the library code (with reasonable MaxMessageSize configuration) should be enough to improve the situation.

@jpkrohling
Copy link
Contributor

But it is consistently getting oom killed.

If we provide you with a proposed fix, would you be able to determine whether the fix works? Are you constantly getting OOM Killed (multiple times a day, for example)?

I should be working soon on this one and I can prepare a container image with @fishy's and my own changes.

@forestsword
Copy link

Thanks @fishy that helped. @jpkrohling I was able to create a build with the thrift fix and had it run overnight to see if it fixed it and it most certainly did. The 1.22 version collectors oom multiple times an hour, the build hasn't been killed and runs with a stable memory usage, whereas the others would have 8-15gb spikes and be killed off. I'm happy to verify again with whatever you give me.

@jpkrohling
Copy link
Contributor

Wonderful news! Are you able to open a PR with the changes you had to make? You might want to leave the dependency changes aside, as we'll bump it properly once @fishy's PR is merged and available in a release.

@forestsword
Copy link

I didn't need to change any code I only included @fishy's PR. I suppose that means it used the default of 100mb as MaxMessageSize.

@jpkrohling
Copy link
Contributor

Looks like it then. How's the memory usage, are you able to share some numbers?

@forestsword
Copy link

So I made it hard for myself to get those numbers. It was previously running to get a 4th of the traffic. I've cut-over now to only use the build for all of the traffic here, which isn't much here (~300 spans per second). I had multiple containers previously to try and offset the oom kills. I'm back to 1 collector with this build with 4Gb limit.

Maybe not the most helpful shot but it gives you the ranges. Red being restarts of the pods. The rest is working set bytes. The cut-over around 13:30.

shot

@jpkrohling
Copy link
Contributor

Looks nice, thanks for sharing!

@fishy
Copy link

fishy commented Apr 7, 2021

Good to hear that it helped!

Wonderful news! Are you able to open a PR with the changes you had to make? You might want to leave the dependency changes aside, as we'll bump it properly once @fishy's PR is merged and available in a release.

A release would be half year away (we plan to do 0.15.0 release around September and there's no plan to do 0.14.2 release at this time, and this bug alone is unlikely to change that). You probably want to use the master branch version once that PR is merged (go mod actually makes that easy, but I also totally understand it if you only want to use tagged releases).

Also I'd still strongly suggest y'all to set a lower MaxMessageSize.

@jpkrohling
Copy link
Contributor

A release would be half year away (we plan to do 0.15.0 release around September and there's no plan to do 0.14.2 release at this time, and this bug alone is unlikely to change that).

We can probably use what we did before 0.14.0, in that we forked the Thrift repository and tagged v1.13.0-patch1 with your first fix. References:

https://github.com/jaegertracing/thrift
#2780

Also I'd still strongly suggest y'all to set a lower MaxMessageSize.

Will do, thanks! In theory, this could be the same as the max size of a UDP packet, right? I would set it to a couple of MiBs just to be on the safe side.

@fishy
Copy link

fishy commented Apr 8, 2021

Also, since your tests shown that the library change alone is enough to fix your problems, I'm going to revert the compiler change from my PR, because that would have negative performance impacts for everyone. I can add that back later if it's proven necessary.

@jpkrohling
Copy link
Contributor

Sounds good!

@stevedunne
Copy link

I was also seeing these memory issues frequently with the 1.18 build of the agent. As per the discussion, updating to the 1.22 version went a long way to addressing the issue although there were still occasional spikes.
I have made the modification recommended by @fishy and deployed, and we have not seen a spike for over 48 hours.

Thanks for the info. The detail in this discussion was really informative.

@jpkrohling
Copy link
Contributor

I believe @rubenvp8510 is working on backporting @fishy's fix to the version of Thrift that Jaeger uses. After that, we'll propose releasing 1.23, which would then fix this issue here once and for all (hopefully) :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs-info The ticket does not contain sufficient information performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.