Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when Ingester is starting #1195

Closed
andersosthus opened this issue Jan 3, 2022 · 4 comments · Fixed by #1197
Closed

Panic when Ingester is starting #1195

andersosthus opened this issue Jan 3, 2022 · 4 comments · Fixed by #1197

Comments

@andersosthus
Copy link

Describe the bug
When one of my Ingesters is starting, it crashloops with panic: runtime error: slice bounds out of range [2610010201:824]

To Reproduce
Not sure how this started or how to reproduce

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Kustomize
  • Tempo version: 1.2.1

Additional Context
Full panic output:

level=info ts=2022-01-03T12:49:23.234802886Z caller=main.go:189 msg="initialising OpenTracing tracer"
level=info ts=2022-01-03T12:49:23.256529595Z caller=main.go:108 msg="Starting Tempo" version="(version=, branch=HEAD, revision=0eaca8a01)"
level=info ts=2022-01-03T12:49:23.364022024Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-01-03T12:49:23.364817735Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=tempo-ingester-norwayeast-3-0-d7d51877
level=info ts=2022-01-03T12:49:23.364933437Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-01-03T12:49:23.365034438Z caller=module_service.go:64 msg=initialising module=overrides
level=info ts=2022-01-03T12:49:23.364857736Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-01-03T12:49:23.365201941Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-01-03T12:49:23.365511945Z caller=module_service.go:64 msg=initialising module=ingester
level=info ts=2022-01-03T12:49:23.365535045Z caller=ingester.go:330 msg="beginning wal replay"
level=info ts=2022-01-03T12:49:23.365595546Z caller=wal.go:101 msg="beginning replay" file=57716e16-32a0-4e83-9b78-e7b161669887:single-tenant:v2:snappy:v1 size=6422528
level=info ts=2022-01-03T12:49:23.396842291Z caller=memberlist_client.go:506 msg="joined memberlist cluster" reached_nodes=7
level=info ts=2022-01-03T12:49:23.434710229Z caller=wal.go:128 msg="replay complete" file=57716e16-32a0-4e83-9b78-e7b161669887:single-tenant:v2:snappy:v1 duration=69.121083ms
level=info ts=2022-01-03T12:49:23.434804131Z caller=rescan_blocks.go:34 msg="beginning replay" file=57716e16-32a0-4e83-9b78-e7b161669887:single-tenant:v2:none: size=12275712
panic: runtime error: slice bounds out of range [2610010201:824]

goroutine 403 [running]:
github.com/google/flatbuffers/go.(*Table).GetVOffsetT(...)
        /drone/src/vendor/github.com/google/flatbuffers/go/table.go:134
github.com/google/flatbuffers/go.(*Table).Offset(0x0, 0x0)
        /drone/src/vendor/github.com/google/flatbuffers/go/table.go:16 +0xdf
github.com/grafana/tempo/pkg/tempofb.(*KeyValues).Key(0xc00024d848)
        /drone/src/pkg/tempofb/KeyValues.go:37 +0x25
github.com/grafana/tempo/pkg/tempofb.(*SearchBlockHeaderMutable).AddEntry(0xc04125dba0, 0xc00024d8a8)
        /drone/src/pkg/tempofb/SearchBlockHeader_util.go:30 +0xa9
github.com/grafana/tempo/tempodb/search.newStreamingSearchBlockFromWALReplay.func1({0xc0456bea90, 0x262e840, 0xc0456c0000})
        /drone/src/tempodb/search/rescan_blocks.go:92 +0x7c
github.com/grafana/tempo/tempodb/wal.ReplayWALAndGetRecords(0xc0005fa878, {0x269aad0, 0x3926a50}, 0x3b, 0xc00024dab0)
        /drone/src/tempodb/wal/replay.go:52 +0x335
github.com/grafana/tempo/tempodb/search.newStreamingSearchBlockFromWALReplay({0xc0001319f8, 0xc041296a80}, {0xc041290676, 0x3b})
        /drone/src/tempodb/search/rescan_blocks.go:90 +0x211
github.com/grafana/tempo/tempodb/search.RescanBlocks({0xc00011e380, 0x2630660})
        /drone/src/tempodb/search/rescan_blocks.go:38 +0x4dd
github.com/grafana/tempo/modules/ingester.(*Ingester).replayWal(0xc000806300)
        /drone/src/modules/ingester/ingester.go:337 +0x1c6
github.com/grafana/tempo/modules/ingester.(*Ingester).starting(0xc000806300, {0x2677ce0, 0xc040bbcd40})
        /drone/src/modules/ingester/ingester.go:101 +0x2c
github.com/grafana/dskit/services.(*BasicService).main(0xc0003e8820)
        /drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:157 +0x78
created by github.com/grafana/dskit/services.(*BasicService).StartAsync.func1
        /drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:119 +0xbe
@andersosthus
Copy link
Author

Let me know if you need more info from me, or if you need me to do some debugging

@mdisibio
Copy link
Contributor

mdisibio commented Jan 3, 2022

Thanks for reporting this. At first glance a panic in the flatbuffers-area is generally from a corrupted file. Did the ingester crash with a different error prior to this panic? If there is a still a copy of the file 57716e16-32a0-4e83-9b78-e7b161669887:single-tenant:v2:none: we can debug the replay. WAL replay should tolerate corrupted files, by just replaying what it can, so there is definitely room for improvement here.

Occasionally this type of panic can also be caused by a bug in flatbuffer struct access, but not seeing that so far.

@andersosthus
Copy link
Author

Hi, not sure what happened before, but if my suspisions are correct, the node it was running on might have been terminated and not given the ingester proper time to shutdown.

I'll see if the file is there still and get it out if you want to debug it.

@andersosthus
Copy link
Author

The file was still there, in the wal/search directory. I got a copy of it (12mb) if you would like to debug it. I also grabbed the snappy file in case it is needed.

For my part, I can just wipe the PVC and restart the ingester, since the two other ingesters are running fine, there shouldn't have been any data loss, but I can hold off on for a day in case you need something else from the volume as well.

I can upload it to Google Drive and send you a link on Slack if that would be ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants