Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compactor is in infinite loop when broken block #621

Closed
dmitriy-lukyanchikov opened this issue Nov 8, 2018 · 7 comments
Closed

compactor is in infinite loop when broken block #621

dmitriy-lukyanchikov opened this issue Nov 8, 2018 · 7 comments

Comments

@dmitriy-lukyanchikov
Copy link

Thanos, Prometheus and Golang version used

thanos, version 0.1.0 (branch: master, revision: 3050831bec12684398ce6deb613788714b7924d9)
  build user:       circleci@a8c441c7e82a
  build date:       20181026-11:11:12
  go version:       go1.10.4

What happened
i tried to reproduce situation when compactor is shutdown in the middle of uploading new compacted blocks. The problem is if the block upload partly if compactor is start again it stuck on process of syncing metas. I think it possible to add check if block is corrupted just remove it from the list of queryable blocks or in this particular case just skip it
What you expected to happen
i expected to see that broken blocks is skipped and print error or warning that one of blocks is broken
How to reproduce it (as minimally and precisely as possible):
start compactor what until it compact and start to upload and in the middle of process just kill compactor
Full logs to relevant components

Logs

level=debug ts=2018-11-08T12:36:52.954662741Z caller=compact.go:174 msg="download meta" block=01CVSHT550PVJBPVKW7905KTAP
level=debug ts=2018-11-08T12:36:53.07216329Z caller=compact.go:174 msg="download meta" block=01CVSJ9TQ6XFP4VBDST5HS47NJ
level=debug ts=2018-11-08T12:36:53.181458976Z caller=compact.go:174 msg="download meta" block=01CVSJYK2V0EQNPYTS5A88WF81
level=debug ts=2018-11-08T12:36:53.295023922Z caller=compact.go:174 msg="download meta" block=01CVSKPHPSQ5YEGDX0NNWY710H
level=debug ts=2018-11-08T12:36:53.430717664Z caller=compact.go:174 msg="download meta" block=01CVSMG4JS1KXZPBA976FD4ZZT
level=error ts=2018-11-08T12:36:53.524348802Z caller=compact.go:207 msg="retriable error" err="compaction failed: sync: retrieve bucket block metas: downloading meta.json for 01CVSMG4JS1KXZPBA976FD4ZZT: meta.json bkt get for 01CVSMG4JS1KXZPBA976FD4ZZT: The specified key does not exist."

Anything else we need to know

@bwplotka bwplotka added the bug label Nov 8, 2018
@bwplotka
Copy link
Member

bwplotka commented Nov 8, 2018

Yup, valid issue. But there was something related regarding partial block: #377

@dmitriy-lukyanchikov
Copy link
Author

Hello, read the #377, @bwplotka what if load file to s3 and save it with prefix *.tmp and rename only if it loaded successfully, does it make sense?

@dmitriy-lukyanchikov
Copy link
Author

Hm looks like moving or renaming not possible in s3, i think if all components will skip partially uploaded blocks it will work, but not sure

@ebedarev
Copy link

@bwplotka ,
I'm planning to work on a PR with a fix for this issue. Before doing that, I'd like to know your opinion what solution would be acceptable here. I think the simplest way to go here is to delete corrupted block (block that is missing metadata) that previously was created by compactor, and then compactor would re-create. The main trick here is to identify such block as previously created by compactor. I currently see 3 options:

  1. Use debug/metas information. It has information about all the blocks that were written to a bucket. Theoretically this meta even would be enough to recover a block if the only stuff that is missing is metadata.json. But I'm not sure how correct that would be since it's a debugging functionality and I suspect it might be disabled in the future
  2. Write temporary meta data for each block before uploading. It would be another copy of the same metadata as meta.json in a block or debug/metas to help identify a block was created by a compactor. After a block is successfully uploaded this temporary metadata will be deleted.
  3. Use storage specific tags to mark objects (files) as ones created by compactor. This approach would require an implementation for each specific cloud provider.

Maybe you have some plan on how to fix it. Please let me know

@bwplotka
Copy link
Member

bwplotka commented Mar 4, 2019

Hey, wow quite a long time from the initial response, sorry for delay.

The way we want to solve this is specified here: https://github.com/improbable-eng/thanos/blob/master/docs/proposals/approved/201901-read-write-operations-bucket.md

@bwplotka
Copy link
Member

.. there is no timeline on above one, so we need some faster fix to partial blocks...

The root cause of this issue is compactor creashed/restarted in the middle of upload and did not have time to finish it. We need to handle this case.

@bwplotka
Copy link
Member

bwplotka commented Apr 18, 2019

Fix: #1053

@bwplotka bwplotka closed this as completed May 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants