Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki crashes when the storage is full #2314

Closed
Kristian-ZH opened this issue Jul 8, 2020 · 17 comments
Closed

Loki crashes when the storage is full #2314

Kristian-ZH opened this issue Jul 8, 2020 · 17 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@Kristian-ZH
Copy link

Describe the bug
We have a PVC with 1GB storage mounted to Loki's data folder.

          volumeMounts:
            - name: config
              mountPath: /etc/loki
            - name: loki
              mountPath: "/data"
  volumeClaimTemplates:
    - metadata:
        name: loki
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

The retention period of the Loki is 14 days

    auth_enabled: false
    ingester:
      chunk_idle_period: 3m
      chunk_block_size: 262144
      chunk_retain_period: 3m
      max_transfer_retries: 3
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
    schema_config:
      configs:
      - from: 2018-04-15
        store: boltdb
        object_store: filesystem
        schema: v11
        index:
          prefix: index_
          period: 24h
    server:
      http_listen_port: 3100
    storage_config:
      boltdb:
        directory: /data/loki/index
      filesystem:
        directory: /data/loki/chunks
    chunk_store_config: 
      max_look_back_period: 360h
    table_manager:
      retention_deletes_enabled: true
      retention_period: 360h

We filled the storage with logs for a weak and after that Loki started constantly sending this error and can not accept more logs:

level=error ts=2020-07-08T07:37:33.472897526Z caller=flush.go:198 org_id=fake msg="failed to flush user" err="write /data/loki/chunks/ZmFrZS8zMDM4ZDQ2ZmU2NDRmN2FmOjE3MzJjNGU2YjAwOjE3MzJjNGU2YjAxOjk0Y2Q5NmZh: no space left on device"

Also we are not able to create a queries from Grafana because it says

No labels found

To Reproduce
Steps to reproduce the behaviour:

  1. Loki:1.5.0
  2. Fluent-bit:1.4.6

Expected behaviour
I expect Loki to trigger a deletion of the oldest chunks and indices (As the Elastic-search curator does) when its storage is full and Loki is unable to accept more logs. Otherwise once the storage max capacity is hit, the Loki dies...

@cyriltovena
Copy link
Contributor

We don’t support that yet, but it’s in our plan.

@wardbekker
Copy link
Member

wardbekker commented Jul 14, 2020

Similar to #162 - time-based and volume-based retention

@stale
Copy link

stale bot commented Aug 14, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Aug 14, 2020
@stale stale bot closed this as completed Aug 21, 2020
@kaflake
Copy link

kaflake commented May 21, 2021

I have the same issue. Is there a workaround for this? I using the helm-chart loki-stack, with the default settings.

@Kristian-ZH
Copy link
Author

No, You can deploy a sidecar container, which can monitor Loki's disk and to clean some of the chunks when they reach +90%

@kaflake
Copy link

kaflake commented May 21, 2021

Hm okay. Thanks for the quick answer. Seems not the memory. The problems a the inodes.

@Kristian-ZH
Copy link
Author

you can get the inode usage with df -hi . And if it exceeds desired percentages then trigger a manual cleanup procedure

@NawiLan
Copy link

NawiLan commented Jan 18, 2022

@Kristian-ZH could you please share the cleanup script? Thanks in advance

@kaflake
Copy link

kaflake commented Jan 18, 2022

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

@Kristian-ZH
Copy link
Author

@Kristian-ZH could you please share the cleanup script? Thanks in advance

We have a Golang application which is containerised and deployed in the cluster.
If you are interested in it, you can follow the code here: https://github.com/gardener/logging/tree/master/pkg/loki/curator

@NawiLan
Copy link

NawiLan commented Jan 31, 2022

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

thanks a lot for your script, can you please tell me how to make a sidecar container, I tried with https://gist.github.com/AntonFriberg/692eb1a95d61aa001dbb4ab5ce00d291, but for some reason the task is not completed

@kaflake
Copy link

kaflake commented Feb 3, 2022

delete_files_if_low_memory is the script from above.

This is the Dockerfile for the image.

FROM alpine:3.13

# use http else apk add not working behind proxy
RUN sed -i 's/https/http/g' /etc/apk/repositories
RUN apk add --no-cache tini

# Configure cron
COPY crontab /var/spool/cron/crontabs/root
COPY delete_files_if_low_memory.sh /delete_files_if_low_memory.sh
RUN chmod 755 /delete_files_if_low_memory.sh

ENV SPACEMONITORING_FOLDER="."
ENV SPACEMONITORING_MAXUSEDPERCENTE=90
ENV SPACEMONITORING_DELETINGITERATION=10

ENTRYPOINT ["/sbin/tini", "--"]
CMD ["crond", "-f",  "-l", "2"]

crontab

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
* * * * * /delete_files_if_low_memory.sh
# crontab requires an empty line at the end of the file

In the values.yaml for loki I have this part:

loki:
  extraContainers:
  - name: pvcleanup
    image: "[ownRegistry]/pvcleanup:latest"
    env:
      - name: SPACEMONITORING_FOLDER
        value: "/data/loki/chunks"
    volumeMounts:
      - name: storage
        mountPath: "/data"

@aseychell
Copy link

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

Thanks for sharing this script! Just what we needed. Just 1 small modification we had to do is to use df -P to get the inode usage on a single line without wordwrap and the rest of the command works perfectly.

@noamApps
Copy link

delete_files_if_low_memory is the script from above.

This is the Dockerfile for the image.

FROM alpine:3.13

# use http else apk add not working behind proxy
RUN sed -i 's/https/http/g' /etc/apk/repositories
RUN apk add --no-cache tini

# Configure cron
COPY crontab /var/spool/cron/crontabs/root
COPY delete_files_if_low_memory.sh /delete_files_if_low_memory.sh
RUN chmod 755 /delete_files_if_low_memory.sh

ENV SPACEMONITORING_FOLDER="."
ENV SPACEMONITORING_MAXUSEDPERCENTE=90
ENV SPACEMONITORING_DELETINGITERATION=10

ENTRYPOINT ["/sbin/tini", "--"]
CMD ["crond", "-f",  "-l", "2"]

crontab

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
* * * * * /delete_files_if_low_memory.sh
# crontab requires an empty line at the end of the file

In the values.yaml for loki I have this part:

loki:
  extraContainers:
  - name: pvcleanup
    image: "[ownRegistry]/pvcleanup:latest"
    env:
      - name: SPACEMONITORING_FOLDER
        value: "/data/loki/chunks"
    volumeMounts:
      - name: storage
        mountPath: "/data"

The script itself works great, however this container configuration didn't work for me, the problem for me was that I have loki's psp enabled (enforcing non-root execution) which cauesed crond to fail (it must elevate to run the job)
I ended up using a simple while loop that wraps the script execution with a 1m sleep and that did the trick.

Thank you for sharing this solution!

@Abuelodelanada
Copy link

Hi mates!

Working in the Loki Charmed Operator we noticed this ugly bug. 😱

We thought that maybe we would tackle this issue with some external solution, however wouldn't it be better to implement a functionality in Loki itself that allows us to set a storage limit like in Prometheus --storage.tsdb.retention.size and thus prevent Loki from crashing?

The purpose of the following document is to draft a solution to this. If we come up with a workable solution, we could contribute it to the Loki project.

Jen Villa (Grafana Product Manager) told us:

If you are interested in developing the feature yourselves and contributing it back to the project, we'd be happy to review. If you want to move forward with this, I suggest starting with a description of what your proposed implementation would look like.

Comments are welcome!!

https://docs.google.com/document/d/15V42tcDlZR46hLq8o-2MsN1BRWhiRGwF0r8rgkV2Mwk/edit

@Kristian-ZH @noamApps @aseychell @kaflake @NawiLan @cyriltovena

@juliagomezi
Copy link

juliagomezi commented Dec 14, 2022

I had this same issue and I applied the sidecarecontainer workaround proposed here: #2314 (comment)
It worked, and the pod went from Crashing to Running, however, after a week I got the following error and the pod started crashing again:

level=error ts=2022-12-12T17:14:17.695026462Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19326/1669895100. Please fix or remove this file." err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-12-12T17:14:17.695081821Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19326/1669896000. Please fix or remove this file." err="file size too small"

It looks like somehow the index files got corrupted. Not sure how to work around this without losing all of my logs. The only alternative I can think of is deleting the PV and attaching a new one to the pod, but this would mean losing all of my logs.

This is my Loki configuration:

config:
  auth_enabled: false
  chunk_store_config:
    chunk_cache_config:
      fifocache:
        max_size_bytes: 300MB
    max_look_back_period: 672h
  compactor:
    retention_enabled: true
    shared_store: filesystem
    working_directory: /data/loki/boltdb-shipper-compactor
  ingester:
    chunk_block_size: 262144
    chunk_idle_period: 3m
    chunk_retain_period: 1m
    lifecycler:
      ring:
        kvstore:
          store: inmemory
        replication_factor: 1
    max_transfer_retries: 0
    wal:
      dir: /data/loki/wal
  limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    retention_period: 672h
  query_range:
    results_cache:
      cache:
        fifocache:
          max_size_bytes: 300MB
  schema_config:
    configs:
    - from: "2020-10-24"
      index:
        period: 24h
        prefix: index_
      object_store: filesystem
      schema: v11
      store: boltdb-shipper
  server:
    http_listen_port: 3100
  storage_config:
    boltdb_shipper:
      active_index_directory: /data/loki/boltdb-shipper-active
      cache_location: /data/loki/boltdb-shipper-cache
      cache_ttl: 24h
      shared_store: filesystem
    filesystem:
      directory: /data/loki/chunks
  table_manager:
    retention_deletes_enabled: false
    retention_period: 0s
image:
  pullPolicy: Always
  pullSecrets:
  - docregcred
  repository: localhost:30000/grafana/loki
  tag: 2.4.2
ingress:
  annotations: {}
  enabled: false
  hosts:
  - host: chart-example.local
    paths: []
  tls: []
initContainers:
- env:
  - name: SPACEMONITORING_FOLDER
    value: /data/loki/chunks
  - name: SPACEMONITORING_MAXUSEDPERCENT
    value: "75"
  - name: SPACEMONITORING_DELETINGPERCENT
    value: "25"
  image: localhost:30000/pvcleanup:1.0.0
  name: pvcleanup
  volumeMounts:
  - mountPath: /data
    name: storage-csf
persistence:
  accessModes:
  - ReadWriteOnce
  annotations: {}
  enabled: true
  size: 10Gi
  storageClassName: longhorn
podAnnotations:
  prometheus.io/port: http-metrics
  prometheus.io/scrape: "true"
  sidecar.istio.io/inject: "false"
replicas: 1
resources:
  limits:
    cpu: 200m
    memory: 500Mi
  requests:
    cpu: 50m
    memory: 200Mi

@VenkateswaranJ
Copy link

I use this. I create a sidecarecontainer which call this script with cron

#!/bin/sh
# Cause filesize is simular always just delate last x% (default: 10) Files.
folder="."
if [ ! -z $SPACEMONITORING_FOLDER ]
	then 
		folder=$SPACEMONITORING_FOLDER
fi

maxusedPercent=90
if [ ! -z $SPACEMONITORING_MAXUSEDPERCENTE ]
	then 
		maxusedPercent=$SPACEMONITORING_MAXUSEDPERCENTE
fi

deletingIteration=10
if [ ! -z $SPACEMONITORING_DELETINGITERATION ]
	then 
		deletingIteration=$SPACEMONITORING_DELETINGITERATION
fi

echo "Folder $folder will be monitored that minimum $maxusedPercent% is free. Delete every ${deletingIteration}th files on cleanup."

usedInodesPercent=$(df -i $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
usedSpacePercent=$(df $folder | sed -n '2p' | awk '{print $5}' | cut -d% -f1)
fileCount=$(ls $folder | wc -l)
deletingFileCount=$((fileCount / deletingIteration))

# debug
# usedInodesPercent=89
# usedSpacePercent=89
# echo $folder
# echo $usedInodesPercent
# echo $usedSpacePercent
# echo $fileCount
# echo $deletingFileCount

if [ $usedInodesPercent -ge $maxusedPercent -o $usedSpacePercent -ge $maxusedPercent ]
	then
		echo More than ${maxusedPercent}% Space or Inodes are used -\> remove $deletingFileCount files
		ls $folder -1t | tail -$deletingFileCount | awk -v prefix="$folder/" '{print prefix $0}' | tr '\n' '\0' | xargs -0 rm 
	else
		echo No cleanup needed
fi

After deleting chunks, how do I recreate index? #4755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

10 participants