Loki silently rejects data when chunks dir is too large #362

sed-i · 2024-03-08T16:42:21Z

Bug Description

When the chunks dir contains too many files, loki stops ingesting new data. (Setting tune2fs -O large_dir /dev/vda2 resolved it.)

Should have an alert fired in that scenario. If loki doesn't have metrics for this then think about pebble checks/notices or node exporter.
Need to think if there's anything generic we could modify with the chunks config.

To Reproduce

See canonical/cos-proxy-operator#130.

Environment

See canonical/cos-proxy-operator#130.

Relevant log output

See https://github.com/canonical/cos-proxy-operator/issues/130.

Additional context

Observed by @dnegreira.

The text was updated successfully, but these errors were encountered:

dnegreira · 2024-03-12T09:47:16Z

@sed-i maybe relevant: grafana/loki#364

lathiat · 2024-03-12T10:40:43Z

Some context from grafana/loki#1502

Firstly, we need to be careful not to enable large_dir as a workaround on a /boot device (including where /boot is stored on /)

As @wschoot points out EXERCISE CAUTION before you run tune2fs -O large_dir /dev/... if the device uses grub to boot it will fail to boot on the next reboot. This is true for any system using < grub-2.12 (which was only released (Wed, 20 Dec 2023) and will take a while to rollout via the distros.

Secondly they moved away from this single large directory in v2.5, in these issues:
grafana/loki#364

However to take advantage of that we need to use Schema v12, and looking at the charm we are possibly still using v11:

loki-k8s-operator/src/config_builder.py

Line 104 in 6c26adf

def _schema_config(self) -> dict:

    @property
    def _schema_config(self) -> dict:
        return {
            "configs": [
                {
                    "from": "2020-10-24",
                    "index": {"period": "24h", "prefix": "index_"},
                    "object_store": "filesystem",
                    "schema": "v11",
                    "store": "boltdb",
                }
            ]
        }

mmkay · 2024-04-22T15:30:27Z

We need to check if we can still reproduce this once we have compaction. We should likely do tune2fs ... in our charm if so.

This happens on boltdb-shipper which isn't the recommended option anymore by the upstream. tsdb is the new recommendation - see #380 for more context.

IbraAoad · 2024-04-24T13:09:39Z

To help narrow doing our options on this issue, here are some potential approaches we could explore

Approach 1: Configuring ingester config

we might consider playing around with the ingester configs to optimize chunk size and idle period. Increasing chunk size and extending the idle period can potentially reduce the number of chunks and mitigate the large number of files, yet could potentially lead to increased memory usage, longer query latencies, losing more data that hasn't been flushed to disk yet and less efficient storage usage if chunks are significantly larger than the data they contain

Approach 2: Using tune2fs large_dir

Enable large_dir but this will come with it's own implications which is in a nutshell if the device uses grub to boot it will fail to boot on the next reboot

Approach 3: TSDB with v12 schema

In v12 not all chunks are stored in a flat dir like older schemas see upstream#5291
- we could benefit from this but migration is a bit tricky, Loki requires setting a future date for when to start using this schema, to understand this imagine we set this date to 25/04/2024 and a customer decided to upgrade the charm on the 25/05/2024, this would mean that loki would understand that all data in that period are written in v12 and will use v12 in queries which will result in data corruption for data in that period
Since TSDB won't replace boltdb-shipper, adding a new PVC for it might disrupt the upgrade path as juju doesn't currently support this, (maybe let data live inside /loki/boltdb-shipper-active??)

ERROR Juju on containers does not support updating storage on a statefulset.
The new charm's metadata contains updated storage declarations.
You'll need to deploy a new charm rather than upgrading if you need this change.

sed-i · 2024-04-24T14:44:14Z

TSDB v12 seems like the right path forward.
I wonder if we could decide what to render in the "from": config section on install/upgrade.
What if we dump a copy of the loki config to persistent storage on "remove" hook? Next time we go through startup/upgrade we check if the file is there and render the v12 "from" to be e.g. tomorrow.

frittentheke · 2024-04-24T15:26:10Z

If you dive into the already referenced grafana/loki#1502, or particularly my comment about using a hashing approach to avoid having insane amounts of files: grafana/loki#1502 (comment)

This is what has been used since forever for other services like EMail services which store a larger count of files and it should not be much of a problem to implement for Loki when regular file storage.

sed-i added Type: Bug Status: Triage labels Mar 8, 2024

honghan-wong mentioned this issue Apr 3, 2024

Allow configuration of loki #369

Closed

IbraAoad mentioned this issue Apr 17, 2024

Add Support for Time-Based Log Retention in Loki #377

Merged

IbraAoad mentioned this issue Apr 30, 2024

Adding TSBD storage to Loki and switching to v12 schema #393

Merged

IbraAoad closed this as completed in #393 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki silently rejects data when chunks dir is too large #362

Loki silently rejects data when chunks dir is too large #362

sed-i commented Mar 8, 2024

dnegreira commented Mar 12, 2024

lathiat commented Mar 12, 2024

mmkay commented Apr 22, 2024 •

edited

Loading

IbraAoad commented Apr 24, 2024

sed-i commented Apr 24, 2024

frittentheke commented Apr 24, 2024

Loki silently rejects data when chunks dir is too large #362

Loki silently rejects data when chunks dir is too large #362

Comments

sed-i commented Mar 8, 2024

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

dnegreira commented Mar 12, 2024

lathiat commented Mar 12, 2024

mmkay commented Apr 22, 2024 • edited Loading

IbraAoad commented Apr 24, 2024

Approach 1: Configuring ingester config

Approach 2: Using tune2fs large_dir

Approach 3: TSDB with v12 schema

sed-i commented Apr 24, 2024

frittentheke commented Apr 24, 2024

mmkay commented Apr 22, 2024 •

edited

Loading