Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Thrift with Jaeger's fork #3050

Merged
merged 4 commits into from
Jun 3, 2021
Merged

Switch Thrift with Jaeger's fork #3050

merged 4 commits into from
Jun 3, 2021

Conversation

jpkrohling
Copy link
Contributor

Closes #2638 by using the Jaeger fork of Thrift, which contains the unreleased fix to the memory consumption issue that affects the Jaeger Agent.
Once Apache Thrift 0.15.0 or 0.14.2 is released, the replace directive should be removed.

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling jpkrohling requested a review from a team as a code owner June 3, 2021 08:25
@@ -73,4 +73,4 @@ require (
honnef.co/go/tools v0.1.4
)

replace github.com/gogo/protobuf => github.com/gogo/protobuf v1.3.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protobuf dependency has been removed, as the one used in the main section is the same.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@@ -225,7 +226,7 @@ func TestFormatBadBody(t *testing.T) {
statusCode, resBodyStr, err := postBytes(server.URL+`/api/v1/spans`, []byte("not good"), createHeader("application/x-thrift"))
assert.NoError(t, err)
assert.EqualValues(t, http.StatusBadRequest, statusCode)
assert.EqualValues(t, "Unable to process request body: Unknown data type 111\n", resBodyStr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was interesting, as the new error is this: Unable to process request body: size exceeded max allowed: 1869881447. This seems to indicate that the fix is being picked up, but it did raise a warning in my head: are we really allocating more than 1GiB for this? Apparently, no. I used runtime.MemStats to measure the memory consumption before and after the HTTP call, and the usage was minimal (2 Alloc vs. 3 Alloc).

Copy link

@fishy fishy Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed interesting.

The old error message "Unknown data type" comes from Skip function, which is called when either the field id is not defined in the struct, or when the type of the field id doesn't match what's defined in the struct.

The new error message "size exceeded max allowed" comes from size sanity checks, which could be called by a lot of TProtocol functions (for example, ReadString, ReadBinary, ReadMapBegin, ReadListBegin, ReadSetBegin). This means the new code actually passed the field type check and is no longer skipped (or maybe previously the first field happened to match and it's the second field failed the field type check and caused the error, while in the new code the first field passed field type check but failed the size sanity check).

runtime.MemStats might be misleading. I believe it's the same as benchmark tests' ReportAllocs, and this benchmark test reported 0 allocs/0 bytes which is certainly a lie (I have to use a much smaller size because the original one is too slow):

package main

import (
        "testing"
)

// const size = 1869881447
const size = 1869

func BenchmarkAlloc(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
                _ = make([]int, size)
        }
}
$ go test -bench . -benchmem
goos: linux
goarch: amd64
pkg: foo
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkAlloc-12       1000000000               0.2487 ns/op          0 B/op          0 allocs/op
PASS
ok      foo     0.278s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fishy, do you think this change raises flags? I wouldn't expect the payload for this test to cause such a huge message.

Copy link

@fishy fishy Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think this change raises flags?

Not necessarily.

If you dig into the code on what this does, the first ever read from the thrift payload is to ReadListBegin, which is one of the functions that would trigger the new error.

Let's first convert the payload ([]byte("not good")) into raw bytes: [6e 6f 74 20 67 6f 6f 64].

In ReadListBegin implementation, first it reads a byte, 6e, then it tries to read an int32 as the size of the list. For the reading of the int32, it reads the next 4 bytes (6f 74 20 67), decode with bigendian, which results in 1869881447.

So this is totally expected behavior.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And following ReadListBegin, jaeger's code didn't really do the pre-allocation. it just append spans to the slice and rely on append to do the allocation and grow the slice.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y'all even already have a comment on that :)

// We don't depend on the size returned by ReadListBegin to preallocate the array because it
// sometimes returns a nil error on bad input and provides an unreasonably large int for size

@codecov
Copy link

codecov bot commented Jun 3, 2021

Codecov Report

Merging #3050 (61d1790) into master (d052d34) will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3050      +/-   ##
==========================================
+ Coverage   96.00%   96.03%   +0.03%     
==========================================
  Files         229      229              
  Lines        9937     9937              
==========================================
+ Hits         9540     9543       +3     
+ Misses        327      325       -2     
+ Partials       70       69       -1     
Impacted Files Coverage Δ
pkg/config/tlscfg/cert_watcher.go 92.20% <0.00%> (-2.60%) ⬇️
cmd/query/app/static_handler.go 96.77% <0.00%> (ø)
plugin/storage/integration/integration.go 77.90% <0.00%> (+0.55%) ⬆️
...lugin/sampling/strategystore/adaptive/processor.go 100.00% <0.00%> (+0.92%) ⬆️
cmd/query/app/server.go 97.08% <0.00%> (+1.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d052d34...61d1790. Read the comment docs.

@jpkrohling jpkrohling changed the title Switched Thrift with Jaeger's fork Switch Thrift with Jaeger's fork Jun 3, 2021
pavolloffay
pavolloffay previously approved these changes Jun 3, 2021
go.mod Show resolved Hide resolved
cmd/collector/app/zipkin/http_handler_test.go Outdated Show resolved Hide resolved
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@yurishkuro yurishkuro enabled auto-merge (squash) June 3, 2021 14:55
@yurishkuro yurishkuro merged commit 394ec23 into jaegertracing:master Jun 3, 2021
@jpkrohling jpkrohling added this to the Release 1.23.0 milestone Jun 4, 2021
@jpkrohling jpkrohling deleted the jpkrohling/patched-thrift branch July 28, 2021 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jaeger-agent reproducible memory leak in 1.21.0
4 participants