Loki crashes when the storage is full #2314

Kristian-ZH · 2020-07-08T07:53:24Z

Describe the bug
We have a PVC with 1GB storage mounted to Loki's data folder.

          volumeMounts:
            - name: config
              mountPath: /etc/loki
            - name: loki
              mountPath: "/data"

  volumeClaimTemplates:
    - metadata:
        name: loki
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

The retention period of the Loki is 14 days

    auth_enabled: false
    ingester:
      chunk_idle_period: 3m
      chunk_block_size: 262144
      chunk_retain_period: 3m
      max_transfer_retries: 3
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
    schema_config:
      configs:
      - from: 2018-04-15
        store: boltdb
        object_store: filesystem
        schema: v11
        index:
          prefix: index_
          period: 24h
    server:
      http_listen_port: 3100
    storage_config:
      boltdb:
        directory: /data/loki/index
      filesystem:
        directory: /data/loki/chunks
    chunk_store_config: 
      max_look_back_period: 360h
    table_manager:
      retention_deletes_enabled: true
      retention_period: 360h

We filled the storage with logs for a weak and after that Loki started constantly sending this error and can not accept more logs:

level=error ts=2020-07-08T07:37:33.472897526Z caller=flush.go:198 org_id=fake msg="failed to flush user" err="write /data/loki/chunks/ZmFrZS8zMDM4ZDQ2ZmU2NDRmN2FmOjE3MzJjNGU2YjAwOjE3MzJjNGU2YjAxOjk0Y2Q5NmZh: no space left on device"

Also we are not able to create a queries from Grafana because it says

No labels found

To Reproduce
Steps to reproduce the behaviour:

Loki:1.5.0
Fluent-bit:1.4.6

Expected behaviour
I expect Loki to trigger a deletion of the oldest chunks and indices (As the Elastic-search curator does) when its storage is full and Loki is unable to accept more logs. Otherwise once the storage max capacity is hit, the Loki dies...

The text was updated successfully, but these errors were encountered:

cyriltovena · 2020-07-11T11:26:35Z

We don’t support that yet, but it’s in our plan.

wardbekker · 2020-07-14T14:02:49Z

Similar to #162 - time-based and volume-based retention

stale · 2020-08-14T08:20:52Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

kaflake · 2021-05-21T05:23:40Z

I have the same issue. Is there a workaround for this? I using the helm-chart loki-stack, with the default settings.

Kristian-ZH · 2021-05-21T06:00:31Z

No, You can deploy a sidecar container, which can monitor Loki's disk and to clean some of the chunks when they reach +90%

kaflake · 2021-05-21T06:25:01Z

Hm okay. Thanks for the quick answer. Seems not the memory. The problems a the inodes.

Kristian-ZH · 2021-05-21T06:31:12Z

you can get the inode usage with df -hi . And if it exceeds desired percentages then trigger a manual cleanup procedure

NawiLan · 2022-01-18T12:36:54Z

@Kristian-ZH could you please share the cleanup script? Thanks in advance

kaflake · 2022-01-18T12:43:28Z

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

Kristian-ZH · 2022-01-18T16:07:07Z

@Kristian-ZH could you please share the cleanup script? Thanks in advance

We have a Golang application which is containerised and deployed in the cluster.
If you are interested in it, you can follow the code here: https://github.com/gardener/logging/tree/master/pkg/loki/curator

NawiLan · 2022-01-31T11:50:31Z

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

thanks a lot for your script, can you please tell me how to make a sidecar container, I tried with https://gist.github.com/AntonFriberg/692eb1a95d61aa001dbb4ab5ce00d291, but for some reason the task is not completed

kaflake · 2022-02-03T06:16:22Z

delete_files_if_low_memory is the script from above.

This is the Dockerfile for the image.

FROM alpine:3.13

# use http else apk add not working behind proxy
RUN sed -i 's/https/http/g' /etc/apk/repositories
RUN apk add --no-cache tini

# Configure cron
COPY crontab /var/spool/cron/crontabs/root
COPY delete_files_if_low_memory.sh /delete_files_if_low_memory.sh
RUN chmod 755 /delete_files_if_low_memory.sh

ENV SPACEMONITORING_FOLDER="."
ENV SPACEMONITORING_MAXUSEDPERCENTE=90
ENV SPACEMONITORING_DELETINGITERATION=10

ENTRYPOINT ["/sbin/tini", "--"]
CMD ["crond", "-f",  "-l", "2"]

crontab

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
* * * * * /delete_files_if_low_memory.sh
# crontab requires an empty line at the end of the file

In the values.yaml for loki I have this part:

loki:
  extraContainers:
  - name: pvcleanup
    image: "[ownRegistry]/pvcleanup:latest"
    env:
      - name: SPACEMONITORING_FOLDER
        value: "/data/loki/chunks"
    volumeMounts:
      - name: storage
        mountPath: "/data"

aseychell · 2022-02-14T09:15:30Z

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

Thanks for sharing this script! Just what we needed. Just 1 small modification we had to do is to use df -P to get the inode usage on a single line without wordwrap and the rest of the command works perfectly.

noamApps · 2022-03-29T11:52:34Z

delete_files_if_low_memory is the script from above.

This is the Dockerfile for the image.

FROM alpine:3.13

# use http else apk add not working behind proxy
RUN sed -i 's/https/http/g' /etc/apk/repositories
RUN apk add --no-cache tini

# Configure cron
COPY crontab /var/spool/cron/crontabs/root
COPY delete_files_if_low_memory.sh /delete_files_if_low_memory.sh
RUN chmod 755 /delete_files_if_low_memory.sh

ENV SPACEMONITORING_FOLDER="."
ENV SPACEMONITORING_MAXUSEDPERCENTE=90
ENV SPACEMONITORING_DELETINGITERATION=10

ENTRYPOINT ["/sbin/tini", "--"]
CMD ["crond", "-f",  "-l", "2"]

crontab

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
* * * * * /delete_files_if_low_memory.sh
# crontab requires an empty line at the end of the file

In the values.yaml for loki I have this part:

loki:
  extraContainers:
  - name: pvcleanup
    image: "[ownRegistry]/pvcleanup:latest"
    env:
      - name: SPACEMONITORING_FOLDER
        value: "/data/loki/chunks"
    volumeMounts:
      - name: storage
        mountPath: "/data"

The script itself works great, however this container configuration didn't work for me, the problem for me was that I have loki's psp enabled (enforcing non-root execution) which cauesed crond to fail (it must elevate to run the job)
I ended up using a simple while loop that wraps the script execution with a 1m sleep and that did the trick.

Thank you for sharing this solution!

Abuelodelanada · 2022-07-13T22:39:02Z

Hi mates!

Working in the Loki Charmed Operator we noticed this ugly bug. 😱

We thought that maybe we would tackle this issue with some external solution, however wouldn't it be better to implement a functionality in Loki itself that allows us to set a storage limit like in Prometheus --storage.tsdb.retention.size and thus prevent Loki from crashing?

The purpose of the following document is to draft a solution to this. If we come up with a workable solution, we could contribute it to the Loki project.

Jen Villa (Grafana Product Manager) told us:

If you are interested in developing the feature yourselves and contributing it back to the project, we'd be happy to review. If you want to move forward with this, I suggest starting with a description of what your proposed implementation would look like.

Comments are welcome!!

https://docs.google.com/document/d/15V42tcDlZR46hLq8o-2MsN1BRWhiRGwF0r8rgkV2Mwk/edit

@Kristian-ZH @noamApps @aseychell @kaflake @NawiLan @cyriltovena

juliagomezi · 2022-12-14T09:54:16Z

I had this same issue and I applied the sidecarecontainer workaround proposed here: #2314 (comment)
It worked, and the pod went from Crashing to Running, however, after a week I got the following error and the pod started crashing again:

level=error ts=2022-12-12T17:14:17.695026462Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19326/1669895100. Please fix or remove this file." err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-12-12T17:14:17.695081821Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19326/1669896000. Please fix or remove this file." err="file size too small"

It looks like somehow the index files got corrupted. Not sure how to work around this without losing all of my logs. The only alternative I can think of is deleting the PV and attaching a new one to the pod, but this would mean losing all of my logs.

This is my Loki configuration:

config:
  auth_enabled: false
  chunk_store_config:
    chunk_cache_config:
      fifocache:
        max_size_bytes: 300MB
    max_look_back_period: 672h
  compactor:
    retention_enabled: true
    shared_store: filesystem
    working_directory: /data/loki/boltdb-shipper-compactor
  ingester:
    chunk_block_size: 262144
    chunk_idle_period: 3m
    chunk_retain_period: 1m
    lifecycler:
      ring:
        kvstore:
          store: inmemory
        replication_factor: 1
    max_transfer_retries: 0
    wal:
      dir: /data/loki/wal
  limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    retention_period: 672h
  query_range:
    results_cache:
      cache:
        fifocache:
          max_size_bytes: 300MB
  schema_config:
    configs:
    - from: "2020-10-24"
      index:
        period: 24h
        prefix: index_
      object_store: filesystem
      schema: v11
      store: boltdb-shipper
  server:
    http_listen_port: 3100
  storage_config:
    boltdb_shipper:
      active_index_directory: /data/loki/boltdb-shipper-active
      cache_location: /data/loki/boltdb-shipper-cache
      cache_ttl: 24h
      shared_store: filesystem
    filesystem:
      directory: /data/loki/chunks
  table_manager:
    retention_deletes_enabled: false
    retention_period: 0s
image:
  pullPolicy: Always
  pullSecrets:
  - docregcred
  repository: localhost:30000/grafana/loki
  tag: 2.4.2
ingress:
  annotations: {}
  enabled: false
  hosts:
  - host: chart-example.local
    paths: []
  tls: []
initContainers:
- env:
  - name: SPACEMONITORING_FOLDER
    value: /data/loki/chunks
  - name: SPACEMONITORING_MAXUSEDPERCENT
    value: "75"
  - name: SPACEMONITORING_DELETINGPERCENT
    value: "25"
  image: localhost:30000/pvcleanup:1.0.0
  name: pvcleanup
  volumeMounts:
  - mountPath: /data
    name: storage-csf
persistence:
  accessModes:
  - ReadWriteOnce
  annotations: {}
  enabled: true
  size: 10Gi
  storageClassName: longhorn
podAnnotations:
  prometheus.io/port: http-metrics
  prometheus.io/scrape: "true"
  sidecar.istio.io/inject: "false"
replicas: 1
resources:
  limits:
    cpu: 200m
    memory: 500Mi
  requests:
    cpu: 50m
    memory: 200Mi

VenkateswaranJ · 2023-12-12T10:52:00Z

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

After deleting chunks, how do I recreate index? #4755

gdemonet mentioned this issue Jul 29, 2020

Add Loki size-based purge mechanism scality/metalk8s#2685

Open

stale bot added the stale A stale issue or PR that will automatically be closed. label Aug 14, 2020

stale bot closed this as completed Aug 21, 2020

d-schmidt mentioned this issue Sep 3, 2021

Size-based retention #2227

Closed

YakobovLior mentioned this issue Sep 22, 2021

Clearing Loki data manually - object not found in storage #4365

Closed

Abuelodelanada mentioned this issue Mar 17, 2022

Implement size based retention #5659

Open

Abuelodelanada mentioned this issue Apr 4, 2022

Clean up logs based on disk usage canonical/loki-k8s-operator#131

Closed

Abuelodelanada mentioned this issue Aug 10, 2022

Loki Block Disk Space Cleanup Proposal #6876

Open

andvary mentioned this issue Apr 9, 2024

victorialogs: add space-based retention VictoriaMetrics/VictoriaMetrics#5355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki crashes when the storage is full #2314

Loki crashes when the storage is full #2314

Kristian-ZH commented Jul 8, 2020

cyriltovena commented Jul 11, 2020

wardbekker commented Jul 14, 2020 •

edited

Loading

stale bot commented Aug 14, 2020

kaflake commented May 21, 2021

Kristian-ZH commented May 21, 2021

kaflake commented May 21, 2021

Kristian-ZH commented May 21, 2021

NawiLan commented Jan 18, 2022

kaflake commented Jan 18, 2022 •

edited

Loading

Kristian-ZH commented Jan 18, 2022

NawiLan commented Jan 31, 2022

kaflake commented Feb 3, 2022

aseychell commented Feb 14, 2022

noamApps commented Mar 29, 2022

Abuelodelanada commented Jul 13, 2022

juliagomezi commented Dec 14, 2022 •

edited

Loading

VenkateswaranJ commented Dec 12, 2023

Loki crashes when the storage is full #2314

Loki crashes when the storage is full #2314

Comments

Kristian-ZH commented Jul 8, 2020

cyriltovena commented Jul 11, 2020

wardbekker commented Jul 14, 2020 • edited Loading

stale bot commented Aug 14, 2020

kaflake commented May 21, 2021

Kristian-ZH commented May 21, 2021

kaflake commented May 21, 2021

Kristian-ZH commented May 21, 2021

NawiLan commented Jan 18, 2022

kaflake commented Jan 18, 2022 • edited Loading

Kristian-ZH commented Jan 18, 2022

NawiLan commented Jan 31, 2022

kaflake commented Feb 3, 2022

aseychell commented Feb 14, 2022

noamApps commented Mar 29, 2022

Abuelodelanada commented Jul 13, 2022

juliagomezi commented Dec 14, 2022 • edited Loading

VenkateswaranJ commented Dec 12, 2023

wardbekker commented Jul 14, 2020 •

edited

Loading

kaflake commented Jan 18, 2022 •

edited

Loading

juliagomezi commented Dec 14, 2022 •

edited

Loading