Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: grpc unimplemented panic while running workload import #49842

Closed
mwang1026 opened this issue Jun 3, 2020 · 8 comments · Fixed by #49869
Closed

server: grpc unimplemented panic while running workload import #49842

mwang1026 opened this issue Jun 3, 2020 · 8 comments · Fixed by #49869
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@mwang1026
Copy link

Describe the problem
I had a roachprod cluster running CockroachDB CCL v20.2.0-alpha.00000000-1972-g69bc0192b3 (x86_64-unknown-linux-gnu, built 2020/06/03 13:38:23, go1.13.9) on 3 node c5d.4xlarge machines. I kicked off a workload run and it caused a fatal error and the node shut down (logs attached below)

To Reproduce
Not sure exactly how to repro. This is high level steps I took:

  • Setup the cluster
  • Ran workload init ./cockroach workload init tpcc --warehouses=1000 --db=tpcc --data-loader=IMPORT
  • Set up a changefeed on the orders table. Let backfill complete.
  • Start workload run "./cockroach workload run tpcc --warehouses=1000 --duration=15m
  • Node I was running workload on crashed

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

  • CockroachDB CCL v20.2.0-alpha.00000000-1972-g69bc0192b3 (x86_64-unknown-linux-gnu, built 2020/06/03 13:38:23, go1.13.9)
  • 3 nodes of c5d.4xlarge
@mwang1026 mwang1026 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 3, 2020
@mwang1026
Copy link
Author

@mwang1026
Copy link
Author

I restarted the dead node, ran the same workload run command from above, and it was fine. I canceled the in-progress changefeed and started a new one (s3 sink WITH resolved='10s', compression='gzip', no_initial_scan) and data started flowing to the sink then ALL of the nodes crashed.

Logs from nodes 1 and 2 here
cockroach.ip-10-12-24-122.ubuntu.2020-06-03T16_12_45Z.003889.log
cockroach.ip-10-12-31-36.ubuntu.2020-06-03T16_12_44Z.004039.log

@knz
Copy link
Contributor

knz commented Jun 4, 2020

The first crash is odd, it's an "unimplemented" panic during a protobuf decode deep inside grpc:

F200603 19:01:16.499820 363844 util/log/log.go:35  unexpected panic: not implemented
goroutine 363844 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7ee2b01, 0xed669ea7c, 0x0, 0xc00001d041)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7edf320, 0xc000000004, 0x7426632, 0xf, 0x23, 0xc0531e6a20, 0x21)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:212 +0xbcd
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x5235ce0, 0xc000078120, 0xc000000004, 0x2, 0x49052b3, 0x14, 0xc007f391b8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:68 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x5235ce0, 0xc000078120, 0x1, 0xc000000004, 0x49052b3, 0x14, 0xc007f391b8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:148
github.com/cockroachdb/cockroach/pkg/util/log.FatalOnPanic()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:35 +0xb8
panic(0x4220820, 0x5149b20)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/cockroachdb/cockroach/vendor/github.com/golang/protobuf/proto.(*InternalMessageInfo).Size(...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/golang/protobuf/proto/deprecated.go:91
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/internal.(*Error).XXX_Size(0xc059e72360, 0x7f60392b7658)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/internal/errors.pb.go:55 +0x39
github.com/cockroachdb/cockroach/vendor/github.com/gogo/protobuf/proto.Marshal(0x7f60392b7658, 0xc059e72360, 0xc059e72360, 0x7f60392b7658, 0xc059e72360, 0xc057e25801, 0xc007f393b0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/gogo/protobuf/proto/table_marshal.go:2937 +0x278
github.com/cockroachdb/cockroach/pkg/util/protoutil.(*ProtoPb).Marshal(0x7f31cb8, 0x4729740, 0xc059e72360, 0xc02723aaf0, 0xc00be1c8f8, 0xc004af6200, 0x4, 0x20300a)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/protoutil/marshaler.go:40 +0x129
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/runtime.DefaultHTTPError(0x5235d60, 0xc02891a2a0, 0xc000751b20, 0x525f5a0, 0x7f31cb8, 0x520b7a0, 0xc00485a300, 0xc0071a3c00, 0x51a9240, 0xc02723aaa0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/runtime/errors.go:139 +0x3e2
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/runtime.MuxOrGlobalHTTPError(0x5235d60, 0xc02891a2a0, 0xc000751b20, 0x525f5a0, 0x7f31cb8, 0x520b7a0, 0xc00485a300, 0xc0071a3c00, 0x51a9240, 0xc02723aaa0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/runtime/errors.go:102 +0x12a
github.com/cockroachdb/cockroach/pkg/ts/tspb.RegisterTimeSeriesHandlerClient.func1(0x520b7a0, 0xc00485a300, 0xc0071a3c00, 0xc03f05f8c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ts/tspb/timeseries.pb.gw.go:146 +0x32c
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/runtime.(*ServeMux).ServeHTTP(0xc000751b20, 0x520b7a0, 0xc00485a300, 0xc0071a3c00)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-gateway/runtime/mux.go:240 +0xbd9
net/http.(*ServeMux).ServeHTTP(0xc0009f3cf8, 0x520b7a0, 0xc00485a300, 0xc0071a3c00)
	/usr/local/go/src/net/http/server.go:2387 +0x1bd
github.com/cockroachdb/cockroach/pkg/server.(*safeServeMux).ServeHTTP(0xc0009f3ce0, 0x520b7a0, 0xc00485a300, 0xc0071a3c00)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:128 +0x61
github.com/cockroachdb/cockroach/pkg/server.(*Server).ServeHTTP(0xc0009f3800, 0x521d620, 0xc0030962a0, 0xc0071a3c00)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1964 +0x1e1
net/http.serverHandler.ServeHTTP(0xc00014a460, 0x521d620, 0xc0030962a0, 0xc0071a3c00)
	/usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.(*conn).serve(0xc053e46c80, 0x5235ca0, 0xc001a27900)
	/usr/local/go/src/net/http/server.go:1890 +0x875
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2928 +0x384

@knz
Copy link
Contributor

knz commented Jun 4, 2020

The other crashes are also with the same panic.

@tbg I think you updated grpc recently? Could this be related? (Does the version ran by Michael contain the upgraded grpc?)

@knz knz added A-kv-server Relating to the KV-level RPC server S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Jun 4, 2020
@knz knz changed the title Fatal error when running workload server: grpc unimplemented panic while running workload import Jun 4, 2020
@tbg
Copy link
Member

tbg commented Jun 4, 2020

Yes, this almost certainly is caused by the protobuf bump. The grpc-gateway proto generated code uses InternalMessageInfo for XXX_Size() (and others):

https://github.com/grpc-ecosystem/grpc-gateway/blob/6dd8bd6afbea6dc2c68128d17a04a276e07970b2/internal/errors.pb.go#L45-L61

I don't know why it does that. InternalMessageInfo is, as far as I can tell, not something that should show up here. The upstream proto package we are now using aggressively deprecated this by inserting the panics.

However, they have since backpedaled and made this functional again:
https://github.com/golang/protobuf/pull/1129/files

Picking up that fix should give us temporary relief.

I'll write up a unit test that repros the panic and make sure it is fixed by a dep bump. But this is certainly going to come back and haunt us in a future upgrade, so I will also file something against grpc-gateway.

@tbg
Copy link
Member

tbg commented Jun 4, 2020

errors.pb.go is generated here:

https://github.com/grpc-ecosystem/grpc-gateway/blob/master/Makefile#L150

This invocation looks vanilla. I wonder why it results in deprecated code. Perhaps it just hasn't been updated in a long time, I will take a look at that.

@tbg
Copy link
Member

tbg commented Jun 4, 2020

grpc-gateway has proto 1.3.2 pinned:

docker run -v $(pwd):/src/grpc-gateway --rm docker.pkg.github.com/grpc-ecosystem/grpc-gateway/build-env:1.14     /bin/bash -c 'cd /src/grpc-gateway && make internal/errors.pb.go'
go build -o bin/protoc-gen-go github.com/golang/protobuf/protoc-gen-go
go: downloading github.com/golang/protobuf v1.3.2
protoc -I /usr/local/bin//../include --plugin=bin/protoc-gen-go -I. --go_out=Mgoogle/protobuf/field_mask.proto=google.golang.org/genproto/protobuf/field_mask,Mgoogle/protobuf/descriptor.proto=github.com/golang/protobuf/protoc-gen-go/descriptor,Mexamples/internal/proto/sub/message.proto=github.com/grpc-ecosystem/grpc-gateway/examples/internal/proto/sub,paths=source_relative:. internal/errors.proto

It is evidently creating that code that was then deprecated in 1.4.x.

I think we can sit this one out then. This problem will affect everyone mixing proto-1.3 generated code with proto 1.4, which is why proto added the PR that makes it work again. The legacy behavior will stick around, and once grpc-gateway decides to use proto 1.4.x their generated code will not have that problem any more.

I'll send a PR to bump golang/protobuf to pick up the PR mentioned above.

@nvanbenschoten
Copy link
Member

This was also being tracked in #49612, so I'll close that. Thanks for tracking this down @tbg!

craig bot pushed a commit that referenced this issue Jun 4, 2020
49827: geo/geomfn: Implements ST_Segmentize for geometry r=otan a=abhishek20123g

Fixes #49029

This PR implements ST_Segmentize({geometry, float8}) builtin
function, which allows modify given geometry such that no
segment longer than the given max_segment_length.

Also this PR refactors and add extra test cases for
ST_Segmentize for geography.

Release note (sql change): This PR implements ST_Segmentize({geometry,
float8}) builtin function.

49833: geo/geomfn: implement Intersection, PointOnSurface, Union r=sumeerbhola a=otan

The last of the topology operators up to Chapter 20.

Resolves #48951
Resolves #49832 
Resolves #49064

Release note (sql change): Implements the ST_Intersection,
ST_PointOnSurface and ST_Union builtin functions.

49869: vendor: bump golang/protobuf to 1.4.2 r=knz a=tbg

v1.4.1 aggressively deprecated something (by inserting panics) that was
reachable via gogoproto's marshaler. Luckily, v1.4.2 has this "fixed";
it caused enough trouble for others as well.

Closes #49842.

Release note: None

49870: schemachange: unskip TestDropWhileBackfill r=spaskob a=spaskob

Disabling the GC job was preventing this test from completing.
Tested with `test stress`: 1000 successful runs.

Fixes #44944.

Release note: none.

49871: kvserver: fixup test failure message r=andreimatei a=andreimatei

Expected and real err were reversed.

Release note: None

Co-authored-by: abhishek20123g <abhishek20123g@gmail.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in 415d614 Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants