Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: index out of range on (*TxnMeta).MarshalTo #5998

Closed
mberhault opened this issue Apr 12, 2016 · 9 comments
Closed

stability: index out of range on (*TxnMeta).MarshalTo #5998

mberhault opened this issue Apr 12, 2016 · 9 comments
Assignees
Milestone

Comments

@mberhault
Copy link
Contributor

build sha: c382dd7

brand new beta cluster with photos and block-writer (concurrency=5).
Ran for less than an hour before seeing logs of errors along the lines of:

W160412 10:12:55.732779 storage/replica.go:1433  failed to lookup sender replica 2 in group 80: storage/replica.go:658: replica 2 not found in range 80

I stopped block_writers to start poking around, when node 1 (ec2-54-209-69-52.compute-1.amazonaws.com) crashed with:

panic: runtime error: index out of range

1: running [Created by grpc.(*Server).serveStreams.func1 @ server.go:324]
              panic.go:464            panic(0x163ce80, 0xc82000c050)
    roachpb   data.pb.go:934          (*TxnMeta).MarshalTo(#10, 0xc8250144d8, 0x1c, 0x1c, 0x64, 0, 0)
    roachpb   data.pb.go:975          (*Transaction).MarshalTo(#10, 0xc8250144d6, 0x1e, 0x1e, 0xc0, 0, 0)
    roachpb   errors.pb.go:1348       (*Error).MarshalTo(#8, 0xc8250144b6, 0x3e, 0x3e, 0xf7, 0, 0)
    roachpb   api.pb.go:3847          (*BatchResponse_Header).MarshalTo(#11, 0xc8250144b3, 0x41, 0x41, 0x106, 0, 0)
    roachpb   api.pb.go:3808          (*BatchResponse).MarshalTo(#11, #9, 0x44, 0x44, 0x44, 0, 0)
    roachpb   api.pb.go:3793          (*BatchResponse).Marshal(#11, #9, 0x44, 0x44, 0, 0)
    proto     encode.go:225           Marshal(0x7f2d22317380, #11, 0, 0, 0, 0, 0)
    grpc      rpc_util.go:70          protoCodec.Marshal(#1, #11, 0, 0, 0, 0, 0)
    grpc      <autogenerated>:18      (*protoCodec).Marshal(#3, #1, #11, 0, 0, 0, 0, 0)
    grpc      rpc_util.go:248         encode(#14, #3, #1, #11, 0, 0, 0, 0, 0, 0, ...)
    grpc      server.go:412           (*Server).sendResponse(#6, #13, #4, #12, #1, #11, 0, 0, 0xc833ea7f08, 0, ...)
    grpc      server.go:526           (*Server).processUnaryRPC(#6, #13, #4, #12, #5, #2, 0, 0, 0)
    grpc      server.go:646           (*Server).handleStream(#6, #13, #4, #12, 0)
    grpc      server.go:323           (*Server).serveStreams.func1.1(#7, #6, #13, #4, #12)

Node log:
node1.log.parse.txt

@tbg
Copy link
Member

tbg commented Apr 12, 2016

Without having looked closely, that looks like sharing transaction protos.
This happens in a bunch of places in ./storage at the moment, I've been
meaning to give this another pass through. Feel free to assign to me.

On Tue, Apr 12, 2016 at 6:20 AM marc notifications@github.com wrote:

build sha: c382dd7
c382dd7

brand new beta cluster with photos and block-writer (concurrency=5).
Ran for less than an hour before seeing logs of errors along the lines of:

W160412 10:12:55.732779 storage/replica.go:1433 failed to lookup sender replica 2 in group 80: storage/replica.go:658: replica 2 not found in range 80

I stopped block_writers to start poking around, when node 1 (
ec2-54-209-69-52.compute-1.amazonaws.com) crashed with:

panic: runtime error: index out of range

1: running [Created by grpc.(_Server).serveStreams.func1 @ server.go:324]
panic.go:464 panic(0x163ce80, 0xc82000c050)
roachpb data.pb.go:934 (_TxnMeta).MarshalTo(#10, 0xc8250144d8, 0x1c, 0x1c, 0x64, 0, 0)
roachpb data.pb.go:975 (_Transaction).MarshalTo(#10, 0xc8250144d6, 0x1e, 0x1e, 0xc0, 0, 0)
roachpb errors.pb.go:1348 (_Error).MarshalTo(#8, 0xc8250144b6, 0x3e, 0x3e, 0xf7, 0, 0)
roachpb api.pb.go:3847 (_BatchResponse_Header).MarshalTo(#11, 0xc8250144b3, 0x41, 0x41, 0x106, 0, 0)
roachpb api.pb.go:3808 (_BatchResponse).MarshalTo(#11, #9, 0x44, 0x44, 0x44, 0, 0)
roachpb api.pb.go:3793 (_BatchResponse).Marshal(#11, #9, 0x44, 0x44, 0, 0)
proto encode.go:225 Marshal(0x7f2d22317380, #11, 0, 0, 0, 0, 0)
grpc rpc_util.go:70 protoCodec.Marshal(#1, #11, 0, 0, 0, 0, 0)
grpc :18 (_protoCodec).Marshal(#3, #1, #11, 0, 0, 0, 0, 0)
grpc rpc_util.go:248 encode(#14, #3, #1, #11, 0, 0, 0, 0, 0, 0, ...)
grpc server.go:412 (_Server).sendResponse(#6, #13, #4, #12, #1, #11, 0, 0, 0xc833ea7f08, 0, ...)
grpc server.go:526 (_Server).processUnaryRPC(#6, #13, #4, #12, #5, #2, 0, 0, 0)
grpc server.go:646 (_Server).handleStream(#6, #13, #4, #12, 0)
grpc server.go:323 (_Server).serveStreams.func1.1(#7, #6, #13, #4, #12)

Node log:
node1.log.parse.txt
https://github.com/cockroachdb/cockroach/files/214779/node1.log.parse.txt


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5998

-- Tobias

@mberhault
Copy link
Contributor Author

ooh, even better:

fatal error: concurrent map read and map write

goroutine 205715 [running]:
runtime.throw(0x18d59e0, 0x21)
        /usr/local/go/src/runtime/panic.go:530 +0x90 fp=0xc824b6d040 sp=0xc824b6d028
runtime.mapaccess1_fast32(0x14054c0, 0xc820da04e0, 0x5, 0xc820a4d388)
        /usr/local/go/src/runtime/hashmap_fast.go:22 +0x5a fp=0xc824b6d060 sp=0xc824b6d040
github.com/cockroachdb/cockroach/roachpb.(*Transaction).MarshalTo(0xc829209ec0, 0xc82199c385, 0xf4, 0xf4, 0xd2, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/roachpb/data.pb.go:1022 +0x978 fp=0xc824b6d288 sp=0xc824b6d060
github.com/cockroachdb/cockroach/roachpb.(*Error).MarshalTo(0xc820e05360, 0xc82199c366, 0x113, 0x113, 0x107, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/roachpb/errors.pb.go:1348 +0x2ea fp=0xc824b6d350 sp=0xc824b6d288
github.com/cockroachdb/cockroach/roachpb.(*BatchResponse_Header).MarshalTo(0xc8294f3620, 0xc82199c363, 0x116, 0x116, 0x116, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:3847 +0x13d fp=0xc824b6d468 sp=0xc824b6d350
github.com/cockroachdb/cockroach/roachpb.(*BatchResponse).MarshalTo(0xc8294f3620, 0xc82199c360, 0x119, 0x119, 0x119, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:3808 +0x13c fp=0xc824b6d6a8 sp=0xc824b6d468
github.com/cockroachdb/cockroach/roachpb.(*BatchResponse).Marshal(0xc8294f3620, 0xc82199c360, 0x119, 0x119, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:3793 +0xa0 fp=0xc824b6d6e8 sp=0xc824b6d6a8
github.com/golang/protobuf/proto.Marshal(0x7f88f51ad890, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/golang/protobuf/proto/encode.go:225 +0xca fp=0xc824b6d7e0 sp=0xc824b6d6e8
google.golang.org/grpc.protoCodec.Marshal(0x1749220, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/google.golang.org/grpc/rpc_util.go:70 +0x89 fp=0xc824b6d830 sp=0xc824b6d7e0
google.golang.org/grpc.(*protoCodec).Marshal(0x2583050, 0x1749220, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0)
        <autogenerated>:18 +0xbe fp=0xc824b6d870 sp=0xc824b6d830
google.golang.org/grpc.encode(0x7f88f5223048, 0x2583050, 0x1749220, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/google.golang.org/grpc/rpc_util.go:248 +0xb4 fp=0xc824b6d9e0 sp=0xc824b6d870
google.golang.org/grpc.(*Server).sendResponse(0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20, 0x1749220, 0xc8294f3620, 0x0, 0x0, 0xc820a4d380, 0x0, ...)
        /go/src/google.golang.org/grpc/server.go:412 +0xba fp=0xc824b6da98 sp=0xc824b6d9e0
google.golang.org/grpc.(*Server).processUnaryRPC(0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20, 0xc820432240, 0x225ce30, 0x0, 0x0, 0x0)
        /go/src/google.golang.org/grpc/server.go:526 +0x13e4 fp=0xc824b6dde0 sp=0xc824b6da98
google.golang.org/grpc.(*Server).handleStream(0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20, 0x0)
        /go/src/google.golang.org/grpc/server.go:646 +0x109d fp=0xc824b6df38 sp=0xc824b6dde0
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc8209d56f0, 0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20)
        /go/src/google.golang.org/grpc/server.go:323 +0xa0 fp=0xc824b6df68 sp=0xc824b6df38
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc824b6df70 sp=0xc824b6df68
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /go/src/google.golang.org/grpc/server.go:324 +0x9a

This one is on 104.196.41.218 on the gamma cluster.

@tbg
Copy link
Member

tbg commented Apr 12, 2016

Thanks! I don't think the data will be useful for this one. Feel free to
restart the nodes freely as this sort of thing happens, I'll work on
eliminating the (many) possible culprits.

On Tue, Apr 12, 2016 at 9:11 AM marc notifications@github.com wrote:

ooh, even better:

fatal error: concurrent map read and map write

goroutine 205715 [running]:
runtime.throw(0x18d59e0, 0x21)
/usr/local/go/src/runtime/panic.go:530 +0x90 fp=0xc824b6d040 sp=0xc824b6d028
runtime.mapaccess1_fast32(0x14054c0, 0xc820da04e0, 0x5, 0xc820a4d388)
/usr/local/go/src/runtime/hashmap_fast.go:22 +0x5a fp=0xc824b6d060 sp=0xc824b6d040github.com/cockroachdb/cockroach/roachpb.(_Transaction).MarshalTo(0xc829209ec0, 0xc82199c385, 0xf4, 0xf4, 0xd2, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/data.pb.go:1022 +0x978 fp=0xc824b6d288 sp=0xc824b6d060github.com/cockroachdb/cockroach/roachpb.(_Error).MarshalTo(0xc820e05360, 0xc82199c366, 0x113, 0x113, 0x107, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/errors.pb.go:1348 +0x2ea fp=0xc824b6d350 sp=0xc824b6d288github.com/cockroachdb/cockroach/roachpb.(_BatchResponse_Header).MarshalTo(0xc8294f3620, 0xc82199c363, 0x116, 0x116, 0x116, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:3847 +0x13d fp=0xc824b6d468 sp=0xc824b6d350github.com/cockroachdb/cockroach/roachpb.(_BatchResponse).MarshalTo(0xc8294f3620, 0xc82199c360, 0x119, 0x119, 0x119, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:3808 +0x13c fp=0xc824b6d6a8 sp=0xc824b6d468github.com/cockroachdb/cockroach/roachpb.(_BatchResponse).Marshal(0xc8294f3620, 0xc82199c360, 0x119, 0x119, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:3793 +0xa0 fp=0xc824b6d6e8 sp=0xc824b6d6a8github.com/golang/protobuf/proto.Marshal(0x7f88f51ad890, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/golang/protobuf/proto/encode.go:225 +0xca fp=0xc824b6d7e0 sp=0xc824b6d6e8google.golang.org/grpc.protoCodec.Marshal(0x1749220, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/google.golang.org/grpc/rpc_util.go:70 +0x89 fp=0xc824b6d830 sp=0xc824b6d7e0google.golang.org/grpc.(_protoCodec).Marshal(0x2583050, 0x1749220, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0)
:18 +0xbe fp=0xc824b6d870 sp=0xc824b6d830google.golang.org/grpc.encode(0x7f88f5223048, 0x2583050, 0x1749220, 0xc8294f3620, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/google.golang.org/grpc/rpc_util.go:248 +0xb4 fp=0xc824b6d9e0 sp=0xc824b6d870google.golang.org/grpc.(_Server).sendResponse(0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20, 0x1749220, 0xc8294f3620, 0x0, 0x0, 0xc820a4d380, 0x0, ...)
/go/src/google.golang.org/grpc/server.go:412 +0xba fp=0xc824b6da98 sp=0xc824b6d9e0google.golang.org/grpc.(_Server).processUnaryRPC(0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20, 0xc820432240, 0x225ce30, 0x0, 0x0, 0x0)
/go/src/google.golang.org/grpc/server.go:526 +0x13e4 fp=0xc824b6dde0 sp=0xc824b6da98google.golang.org/grpc.(_Server).handleStream(0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20, 0x0)
/go/src/google.golang.org/grpc/server.go:646 +0x109d fp=0xc824b6df38 sp=0xc824b6dde0google.golang.org/grpc.(_Server).serveStreams.func1.1(0xc8209d56f0, 0xc8203ea000, 0x7f88f515da20, 0xc8204018c0, 0xc8211ffb20)
/go/src/google.golang.org/grpc/server.go:323 +0xa0 fp=0xc824b6df68 sp=0xc824b6df38
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc824b6df70 sp=0xc824b6df68
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/google.golang.org/grpc/server.go:324 +0x9a

This one is on 104.196.41.218 on the gamma cluster.


You are receiving this because you were assigned.

Reply to this email directly or view it on GitHub
#5998 (comment)

-- Tobias

@mberhault
Copy link
Contributor Author

I'm running into concurrent proto accesses quite a bit (eg: #6020, #6052). I'll see if I can get beefy-enough machines to run with race-enabled builds. I tried before, but it gobbles up memory way too quickly to catch anything. It may even be worth doing this in the long term. I think 3 nodes with load for a start would be good.

@tbg
Copy link
Member

tbg commented Apr 16, 2016

Yeah, I think that's a good idea in general. I'm also going to get started on reducing these races.

@mberhault mberhault modified the milestone: Q2 Apr 17, 2016
@tbg
Copy link
Member

tbg commented Apr 18, 2016

This could also have been fixed by #6111, but I'm a little less certain here - now that you're running race-enabled builds, we should close the race-disabled race issues since they're essentially going to be very hard to reason about.

@mberhault
Copy link
Contributor Author

probably fixed by #6111, will re-open or file new issue if it reoccurs.

@bdarnell
Copy link
Contributor

I don't see how this one could have been fixed by #6111: that fix related to errors returned by LeaderLease requests, and LeaderLease should never return an error containing a TxnMeta

@mberhault
Copy link
Contributor Author

yeah, I was a bit aggressive in closing race-related things.

The plan is to re-open what we find on the rho cluster. Actual data race dumps make it much simpler to debug too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants