runtime error: slice bounds out of range #14345

ohtyap · 2024-10-01T17:12:46Z

Describe the bug
When querying certain timeframes within loki, loki-read is crashing (see stacktrace output below). Sadly, I do not have more info than the stacktrace - I was not able to see any pattern when this happens. It happens for different labels (e.g. completly different types of logs), different times etc.
Any idea how to debug this further?

To Reproduce
I am not able to deliberately reproduce the issue ... it just happens from time to time (and basically completly breaks the usage of loki)

Expected behavior
loki not crashing 😅

Environment:

Infrastructure: Kubernetes
Deployment tool: helm

Screenshots, Promtail config, or terminal output


2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485
2024-10-01 16:12:17.196 | created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 11537
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1de
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2()
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/object_client.go:187 +0x5ca
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(_, {_, _}, _, {{0x33f196330e3b21ab, {0xc00b1cf338, 0x4}, 0x19247502928, 0x1924761e382, 0x2e112952}, ...})
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/chunk.go:359 +0x596
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk.(*Chunk).Decode(0xc00f757b10, 0xc00ab8dd38, {0xc039b80000, 0x14215f, 0x14235f})
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/facade.go:64 +0x34
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.(*Facade).UnmarshalFromBuf(0xc00e86f4a0, {0xc039b8019a?, 0x35a2f00?, 0x4da0e40?})
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/memchunk.go:392
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...)
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/memchunk.go:479 +0xd9e
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc039b8019a, 0x141fc5, 0x1421c5}, 0x0, 0x0, 0x0)
2024-10-01 16:12:17.196 | goroutine 11486 [running]:
2024-10-01 16:12:17.196 | 
2024-10-01 16:12:17.196 | panic: runtime error: slice bounds out of range [:1360462] with capacity 1319365

The text was updated successfully, but these errors were encountered:

Jayclifford345 · 2024-10-02T08:52:36Z

Hi @ohtyap can you provide some more information for the team around your deployment:

Version
Deployment type
The query
Range of data
Loki config

That should help help us out

ohtyap · 2024-10-02T09:18:15Z

@Jayclifford345 thanks a lot, sure here the infos:

Version: We encountered the problem with loki 3.1.0 - we upgraded in the meanwhile to 3.1.1 (happens there too). But not sure if it is really a 3.1.X problem.

Deployment Type: Simple Scalable (on AWS with S3 storage)

The query and range of data: It happens everytime when loki wants to retrieve affect loglines/chunks 🤷 Basically a {pod="xyz"} is enough to trigger the error if an "infected" timeframe is selected. In some cases we were able to reduce the affected timeframe to a few seconds. If someone is selecting these timeframe (by even just selecting 1sec), the crash happens.

The loki config:

loki:
  server:
    grpc_server_max_recv_msg_size: 1.048576e+08
    grpc_server_max_send_msg_size: 1.048576e+08
  storage:
    type: 's3'
    s3:
      region: XYZ
    bucketNames:
      chunks: XYZ
      ruler: XYZ
      admin: XYZ
  limits_config:
    max_line_size: 512KB
    max_query_length: 0
    ingestion_rate_mb: 24
    ingestion_burst_size_mb: 32
    reject_old_samples_max_age: 30d
    split_queries_by_interval: 1h
    query_timeout: 10m
    tsdb_max_query_parallelism: 100
    retention_period: 90d
    retention_stream:
      - selector: '{channel="XYZ"}'
        period: 372d
        priority: 100

  schemaConfig:
    configs:
      - from: 2024-04-30
        store: tsdb
        object_store: aws
        schema: v13
        index:
          prefix: index_
          period: 24h
  querier:
    max_concurrent: 16

  query_scheduler:
    max_outstanding_requests_per_tenant: 3276

  compactor:
    retention_enabled: true
    working_directory: /retention
    delete_request_store: s3
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    compaction_interval: 10m

  ingester:
    chunk_encoding: snappy
    autoforget_unhealthy: true

axelbodo · 2024-10-04T20:25:07Z

I can deterministically reproduce this, and I get the following error in all querier right before the crash loop:
panic: runtime error: slice bounds out of range [:837727] with capacity 760512

goroutine 12813 [running]:
github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc02f7001f8, 0xb98c0, 0xb9ac0}, 0x0, 0x0, 0x0)
/src/loki/pkg/chunkenc/memchunk.go:479 +0xdbe
github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...)
/src/loki/pkg/chunkenc/memchunk.go:392
github.com/grafana/loki/v3/pkg/chunkenc.(*Facade).UnmarshalFromBuf(0xc02f04a2d0, {0xc02f7001f8?, 0x3260580?, 0x4944280?>
/src/loki/pkg/chunkenc/facade.go:64 +0x34
github.com/grafana/loki/v3/pkg/storage/chunk.(*Chunk).Decode(0xc02d1b7b10, 0xc00079d570, {0xc02f700000, 0xb9ab8, 0xb9cb>
/src/loki/pkg/storage/chunk/chunk.go:359 +0x5a2
github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(, {, _}, _, {{0x4246a33a92bd0a8b, {0xc02f04c96>
/src/loki/pkg/storage/chunk/client/object_client.go:187 +0x34f
github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2()
/src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1c4
created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 12716
/src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485

sum(count_over_time(
{
service_name="logstash-loki-pipeline"
, loki_log_group="k8s-pod"
}
| logfmt
[1m])) by(namespace)

on a 16 hour range it happens on 8 hour range as well, but not on 4 hour range, maybe not the range is what counts, but the data in the extra range.

The version we use: b4f7181 (HEAD, tag: v3.0.0)

Jayclifford345 · 2024-10-07T08:38:23Z

Thank you all for the extra information. I will raise this at the Loki engineer call tomorrow.

axelbodo · 2024-10-09T15:34:56Z

Thank You!
I narrowed down the query range to 1s and the over_time interval to 10ms, and some of the querier produced this panic, however when just 1 or 2 is panicking, grafana doesn' show context cancelled on the ui, so it surely relates somechunks in that interval.

ohtyap · 2024-10-11T11:50:24Z

@Jayclifford345 Is there anything we can help with - like providing additional information or something specific to check?

Jayclifford345 · 2024-10-11T12:27:04Z

Hi @ohtyap, sorry for the late reply we sadly didn't have the engineer call this week since its focus week. Will make sure the team is aware on Monday to take a look.

ohtyap · 2024-10-11T12:46:36Z

@Jayclifford345 No worries; I suspected it will be hard to re-produce. So please, ping me in case we can check or try something or any other way to help out on this one. But I hope there is a solution or a fix, as otherwise, loki is sadly not usable for us (because this bug makes the usage quite unreliable - broken dashboards and alarms etc.) 🤞

chaudum · 2024-10-14T15:24:33Z

Hey @ohtyap

This is a bug in the chunk decoder. Either the chunk is corrupted, or it was wrongly encoded.
To help us reproduce the issue,

would you be able to test with 3.2.0 as well, so we could narrow down to possible commits?
do you ingest structured metadata?

ohtyap · 2024-10-14T16:25:07Z

@chaudum

Thanks for your help. I will try to upgrade loki this week, but I have to check the Helm chart first, etc...

Yes, we are using structured metadata.

Is there a way to check the chunk "manually" to see if the chunk is corrupted or wrongly encoded? As mentioned above, we were able to narrow it down to a timeframe a few seconds long. If there is a reasonable way to locate the files on S3 and check them manually, I would also be willing to debug them in this direction.

chaudum · 2024-10-15T06:18:06Z

@chaudum

Thanks for your help. I will try to upgrade loki this week, but I have to check the Helm chart first, etc...

Yes, we are using structured metadata.

Ok, that can narrow down the possible

Is there a way to check the chunk "manually" to see if the chunk is corrupted or wrongly encoded? As mentioned above, we were able to narrow it down to a timeframe a few seconds long. If there is a reasonable way to locate the files on S3 and check them manually, I would also be willing to debug them in this direction.

There is a tool chunks-inspect which you can find here https://github.com/grafana/loki/blob/main/cmd/chunks-inspect/

I think the trickier part will be to figure out what chunk is causing that. We may need some extra logging there to get the filename of the chunk that fails to process.

Without warranty, I already have a suspicion that #13720 could have introduced this bug, because that's more or less the only thing that changed "recently".

ohtyap · 2024-10-15T07:51:24Z

@chaudum

I will try loki3.2 as soon it is available via helm chart 👍

Meanwhile, I will try my best to narrow it down via chunks-inspect. Thanks for your help; it's appreciated a lot.

ohtyap · 2024-10-16T09:48:32Z

@chaudum
We roll out loki 3.2.0 today. We haven't be able to reproduce the error in this version yet. We can query the timeframe that caused the errors in 3.1.x; the result seems legit. As the referenced PR #13720 was merged in 3.2.0, I suspect this was fixing our issue.

Thanks for your help!

chaudum · 2024-10-17T20:22:52Z

Thanks @ohtyap for testing. Feel free to close the issue once you feel confident that the bug is fix with 3.2.

ohtyap · 2024-10-22T04:26:39Z

It's now running for one week without issues - timeframes that had problems before can be requested again (without crashes). So I consider this one as solved.

Thanks again everyone for your help!

ohtyap closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime error: slice bounds out of range #14345

runtime error: slice bounds out of range #14345

ohtyap commented Oct 1, 2024

Jayclifford345 commented Oct 2, 2024

ohtyap commented Oct 2, 2024

axelbodo commented Oct 4, 2024 •

edited

Loading

Jayclifford345 commented Oct 7, 2024

axelbodo commented Oct 9, 2024

ohtyap commented Oct 11, 2024

Jayclifford345 commented Oct 11, 2024

ohtyap commented Oct 11, 2024 •

edited

Loading

chaudum commented Oct 14, 2024

ohtyap commented Oct 14, 2024

chaudum commented Oct 15, 2024

ohtyap commented Oct 15, 2024

ohtyap commented Oct 16, 2024

chaudum commented Oct 17, 2024

ohtyap commented Oct 22, 2024

runtime error: slice bounds out of range #14345

runtime error: slice bounds out of range #14345

Comments

ohtyap commented Oct 1, 2024

Jayclifford345 commented Oct 2, 2024

ohtyap commented Oct 2, 2024

axelbodo commented Oct 4, 2024 • edited Loading

Jayclifford345 commented Oct 7, 2024

axelbodo commented Oct 9, 2024

ohtyap commented Oct 11, 2024

Jayclifford345 commented Oct 11, 2024

ohtyap commented Oct 11, 2024 • edited Loading

chaudum commented Oct 14, 2024

ohtyap commented Oct 14, 2024

chaudum commented Oct 15, 2024

ohtyap commented Oct 15, 2024

ohtyap commented Oct 16, 2024

chaudum commented Oct 17, 2024

ohtyap commented Oct 22, 2024

axelbodo commented Oct 4, 2024 •

edited

Loading

ohtyap commented Oct 11, 2024 •

edited

Loading