-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime error: slice bounds out of range #14345
Comments
Hi @ohtyap can you provide some more information for the team around your deployment:
That should help help us out |
@Jayclifford345 thanks a lot, sure here the infos: Version: We encountered the problem with loki Deployment Type: Simple Scalable (on AWS with S3 storage) The query and range of data: It happens everytime when loki wants to retrieve affect loglines/chunks 🤷 Basically a The loki config:
|
I can deterministically reproduce this, and I get the following error in all querier right before the crash loop: goroutine 12813 [running]: sum(count_over_time( on a 16 hour range it happens on 8 hour range as well, but not on 4 hour range, maybe not the range is what counts, but the data in the extra range. The version we use: b4f7181 (HEAD, tag: v3.0.0) |
Thank you all for the extra information. I will raise this at the Loki engineer call tomorrow. |
Thank You! |
@Jayclifford345 Is there anything we can help with - like providing additional information or something specific to check? |
Hi @ohtyap, sorry for the late reply we sadly didn't have the engineer call this week since its focus week. Will make sure the team is aware on Monday to take a look. |
@Jayclifford345 No worries; I suspected it will be hard to re-produce. So please, ping me in case we can check or try something or any other way to help out on this one. But I hope there is a solution or a fix, as otherwise, loki is sadly not usable for us (because this bug makes the usage quite unreliable - broken dashboards and alarms etc.) 🤞 |
Hey @ohtyap This is a bug in the chunk decoder. Either the chunk is corrupted, or it was wrongly encoded.
|
Thanks for your help. I will try to upgrade Yes, we are using structured metadata. Is there a way to check the chunk "manually" to see if the chunk is corrupted or wrongly encoded? As mentioned above, we were able to narrow it down to a timeframe a few seconds long. If there is a reasonable way to locate the files on S3 and check them manually, I would also be willing to debug them in this direction. |
Ok, that can narrow down the possible
There is a tool I think the trickier part will be to figure out what chunk is causing that. We may need some extra logging there to get the filename of the chunk that fails to process. Without warranty, I already have a suspicion that #13720 could have introduced this bug, because that's more or less the only thing that changed "recently". |
I will try loki3.2 as soon it is available via helm chart 👍 Meanwhile, I will try my best to narrow it down via |
Thanks @ohtyap for testing. Feel free to close the issue once you feel confident that the bug is fix with 3.2. |
It's now running for one week without issues - timeframes that had problems before can be requested again (without crashes). So I consider this one as solved. Thanks again everyone for your help! |
Describe the bug
When querying certain timeframes within loki,
loki-read
is crashing (see stacktrace output below). Sadly, I do not have more info than the stacktrace - I was not able to see any pattern when this happens. It happens for different labels (e.g. completly different types of logs), different times etc.Any idea how to debug this further?
To Reproduce
I am not able to deliberately reproduce the issue ... it just happens from time to time (and basically completly breaks the usage of loki)
Expected behavior
loki not crashing 😅
Environment:
Screenshots, Promtail config, or terminal output
The text was updated successfully, but these errors were encountered: