-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest Crashes Due To Zero Byte WAL File #1603
Comments
Hi @tonychoe thanks for your report. This indeed is strange behaviour because we have guard code to ensure zero length wal files are never created. Following line ensures that if there's no ingested data we never create a WAL file tempo/modules/ingester/instance.go Line 254 in 7791a36
I ran the following test on my local machine to simulate "crashing a container" to see if we write zero length files in that case, but could not reproduce your issue -
However, from your terminal output, it appears that its not actually a zero length WAL file but rather a corrupt block created when replaying a legit WAL file. The bloom filters, meta.json, index etc, are only created on completion of WAL replay. Here is where that happens - tempo/tempodb/encoding/v2/streaming_block.go Line 145 in 7791a36
Are you able to reproduce this crash reliably? In that case could you please attach the entire (recursive) directory structure of the |
@annanay25 thanks for reviewing and sharing the insights. This has been a repeated problem on one of our production region, so we had to implement a temp fix while writing this issue - an init container for ingester that So I have only this path copied in my note: To get the entire dir structure, we need to revert the fix back and let the issue coming back. We'll get back. |
@tonychoe I see. Yes in that case it looks like there was a bug while replaying the WAL. Is it possible to retrieve the logs for when the ingester pod was crashed? Was this done manually (under steps to reproduce its mentioned "Crash the ingester pod")? I'm not sure how an empty |
We have now seen this internally along with another possibly related error:
Situation was the same: crash loop until the block was removed (in our case with a whole new disk). Although we haven't determined the root cause yet, have a few ideas on error-handling improvements:
|
Internally we determined that the ingester pod was expectedly terminated due to a hardware fault on the k8s node, which led to the improperly flushed meta.json. Therefore this should be considered an expected though rare occurrence, and the changes to continue starting up seem sufficient. Perhaps in the future Tempo could automatically delete the bad block if it could determine there would be no data loss. |
Describe the bug
We have the environment ready for go-live but no traffic currently. When the ingester gets crashed by whatever reason, it goes into the crash loop while complaining the zero byte WAL files. We assume the empty WAL files are legit because there was no traffic in this environment. We could revive the ingester by deleting those empty WAL files.
Could this be related to some handling empty WAL?
This is one of the tenant's WAL directory in PV attached to the crashed ingester.
var/tempo/wal/blocks/fa/f592df09-58aa-434f-9674-f39810d7352f
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The ingester boots up
Environment:
The text was updated successfully, but these errors were encountered: