Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor retention removing all streams when retention_stream is set #4409

Closed
pauljohnston2009 opened this issue Oct 5, 2021 · 10 comments
Closed
Labels
feature/retention type/bug Somehing is not working as expected

Comments

@pauljohnston2009
Copy link

Describe the bug

When retention_stream is set in the limits config all steams are truncated, even when they do not match the selector.

Similar issue was reported in #3881

If I leave the config with retention_stream in it, logs are only stored for 24h. Without the retention_stream, logs last for several days, left over weekend and had logs for all 4 days Thursday - Tuesday.

For testing I run the loki query {app=~".+"} across a time period until I find the earliest log (within 24 hours always returns for several apps)

So far have tested:

  • Config with retention_stream for several days
    • No pod restarts
    • Resulted in only 24hours of logs for any apps (even apps not matching selector)
    retention_stream:
          - period: 24h
            priority: 1
            selector: '{app="my-app", level="DEBUG"}'
    
  • Config without retention_stream key for 4 days
    • No pod restarts
    • Resulted in only 4 days of logs for all apps
    • All logs for all apps kept
  • Added retention_stream back
    • Few pods restarted due to config change (ingestor, compactor, distributor, index-gateway, querier, query-frontend)
    • Resulted in only 24hours of logs
    • selector does not seem to matter app="app-that-does-not-exist"
    NAME                              READY   STATUS    RESTARTS   AGE
    compactor-0                       1/1     Running   0          26m
    consul-65f46787df-mk946           4/4     Running   0          7d
    distributor-785854f874-9hlkk      1/1     Running   0          26m
    gateway-7dc77fcc46-vzcrh          1/1     Running   0          6d22h
    index-gateway-0                   1/1     Running   0          26m
    ingester-0                        1/1     Running   0          26m
    memcached-0                       2/2     Running   0          7d
    memcached-frontend-0              2/2     Running   0          7d
    memcached-index-queries-0         2/2     Running   0          7d
    promtail-58ww2                    0/1     Running   0          3d18h
    promtail-dk7rr                    1/1     Running   0          6d22h
    promtail-qr4rd                    1/1     Running   0          6d22h
    promtail-tf7qk                    1/1     Running   0          6d22h
    promtail-zmnbx                    1/1     Running   0          6d22h
    querier-656bdc8c46-spbc4          1/1     Running   0          26m
    query-frontend-5f79d87945-5qs5b   1/1     Running   0          26m
    
        retention_stream:
          - period: 24h
            priority: 1
            selector: '{app="app-that-does-not-exist"}'
    

compactor logs after restart seemed to indicate that is was only using the smallest period

level=info ts=2021-10-05T00:01:32.748229077Z caller=expiration.go:62 msg="smallest retention period 24h0m0s"

To Reproduce

Steps to reproduce the behavior:

Note: I originally started with loki grafana/loki:2.3.0 and could not get retention through the compactor to work at all (running same config) so switched to grafana/loki:main-fbfc8ab

install loki with tanka as per https://grafana.com/docs/loki/latest/installation/tanka/

customizations:

  • have lowered all replicas to 1
  • reduced memory/cpu limits for testing (but not seeing any crashes)
  • removed persistent storage from all pods (since we don't care about loosing in memory logs on restarts)

Expected behavior

I expected retention to work on a per stream basis and fallback to retention_period if the selector does not match the stream as described in the docs https://grafana.com/docs/loki/latest/operations/storage/retention/

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: tanka

Screenshots, Promtail config, or terminal output

loki config

chunk_store_config:
    chunk_cache_config:
        memcached:
            batch_size: 100
            expiration: 2h
            parallelism: 100
        memcached_client:
            consistent_hash: true
            host: memcached.loki.svc.cluster.local
            service: memcached-client
compactor:
    compaction_interval: 10m
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    retention_enabled: true
    shared_store: s3
    working_directory: /data/retention
distributor:
    ring:
        kvstore:
            consul:
                consistent_reads: false
                host: consul.loki.svc.cluster.local:8500
                http_client_timeout: 20s
                watch_burst_size: 1
                watch_rate_limit: 1
            store: consul
frontend:
    compress_responses: true
    log_queries_longer_than: 5s
    max_outstanding_per_tenant: 256
frontend_worker:
    frontend_address: query-frontend.loki.svc.cluster.local:9095
    grpc_client_config:
        max_send_msg_size: 1.048576e+08
    match_max_concurrent: true
ingester:
    autoforget_unhealthy: true
    chunk_block_size: 262144
    chunk_idle_period: 15m
    lifecycler:
        heartbeat_period: 5s
        interface_names:
          - eth0
        join_after: 30s
        num_tokens: 512
        ring:
            heartbeat_timeout: 1m
            kvstore:
                consul:
                    consistent_reads: true
                    host: consul.loki.svc.cluster.local:8500
                    http_client_timeout: 20s
                store: consul
            replication_factor: 1
    max_transfer_retries: 60
ingester_client:
    grpc_client_config:
        max_recv_msg_size: 6.7108864e+07
    remote_timeout: 1s
limits_config:
    enforce_metric_name: false
    ingestion_burst_size_mb: 20
    ingestion_rate_mb: 10
    ingestion_rate_strategy: global
    max_cache_freshness_per_query: 10m
    max_global_streams_per_user: 100000
    max_query_length: 12000h
    max_query_parallelism: 16
    max_streams_per_user: 0
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    retention_period: 336h
    retention_stream:
      - period: 24h
        priority: 1
        selector: '{app="app-that-does-not-exist"}'
querier:
    max_concurrent: 4
    query_ingesters_within: 2h
query_range:
    align_queries_with_step: true
    cache_results: true
    max_retries: 5
    results_cache:
        cache:
            memcached:
                expiration: 2h
            memcached_client:
                consistent_hash: true
                host: memcached-frontend.loki.svc.cluster.local
                max_idle_conns: 16
                service: memcached-client
                timeout: 500ms
                update_interval: 1m
    split_queries_by_interval: 30m
ruler: {}
schema_config:
    configs:
      - from: "2021-09-22"
        index:
            period: 24h
            prefix: loki_index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
server:
    graceful_shutdown_timeout: 5s
    grpc_server_max_concurrent_streams: 1000
    grpc_server_max_recv_msg_size: 1.048576e+08
    grpc_server_max_send_msg_size: 1.048576e+08
    grpc_server_min_time_between_pings: 10s
    grpc_server_ping_without_stream_allowed: true
    http_listen_port: 3100
    http_server_idle_timeout: 120s
    http_server_write_timeout: 1m
storage_config:
    aws:
        access_key_id: key
        bucketnames: loki
        endpoint: http://s3.loki.svc.cluster.local:9090
        http_config:
            idle_conn_timeout: 90s
            insecure_skip_verify: false
            response_header_timeout: 0s
        insecure: false
        region: ""
        s3forcepathstyle: true
        secret_access_key: secret-access-key
        sse_encryption: false
    boltdb_shipper:
        active_index_directory: /data/index
        cache_location: /data/boltdb-cache
        index_gateway_client:
            server_address: dns:///index-gateway.loki.svc.cluster.local:9095
        shared_store: s3
    index_queries_cache_config:
        memcached:
            batch_size: 100
            expiration: 2h
            parallelism: 100
        memcached_client:
            consistent_hash: true
            host: memcached-index-queries.loki.svc.cluster.local
            service: memcached-client
table_manager:
    creation_grace_period: 3h
    poll_interval: 10m
    retention_deletes_enabled: false
    retention_period: 0

full compactor logs

2021-10-05 00:01:32.437465 I | proto: duplicate proto type registered: purgeplan.DeletePlan
2021-10-05 00:01:32.437512 I | proto: duplicate proto type registered: purgeplan.ChunksGroup
2021-10-05 00:01:32.437516 I | proto: duplicate proto type registered: purgeplan.ChunkDetails
2021-10-05 00:01:32.437519 I | proto: duplicate proto type registered: purgeplan.Interval
2021-10-05 00:01:32.535944 I | proto: duplicate proto type registered: grpc.PutChunksRequest
2021-10-05 00:01:32.535968 I | proto: duplicate proto type registered: grpc.GetChunksRequest
2021-10-05 00:01:32.535972 I | proto: duplicate proto type registered: grpc.GetChunksResponse
2021-10-05 00:01:32.535978 I | proto: duplicate proto type registered: grpc.Chunk
2021-10-05 00:01:32.535986 I | proto: duplicate proto type registered: grpc.ChunkID
2021-10-05 00:01:32.535989 I | proto: duplicate proto type registered: grpc.DeleteTableRequest
2021-10-05 00:01:32.535994 I | proto: duplicate proto type registered: grpc.DescribeTableRequest
2021-10-05 00:01:32.535998 I | proto: duplicate proto type registered: grpc.WriteBatch
2021-10-05 00:01:32.536002 I | proto: duplicate proto type registered: grpc.WriteIndexRequest
2021-10-05 00:01:32.536005 I | proto: duplicate proto type registered: grpc.DeleteIndexRequest
2021-10-05 00:01:32.536007 I | proto: duplicate proto type registered: grpc.QueryIndexResponse
2021-10-05 00:01:32.536009 I | proto: duplicate proto type registered: grpc.Row
2021-10-05 00:01:32.536012 I | proto: duplicate proto type registered: grpc.IndexEntry
2021-10-05 00:01:32.536014 I | proto: duplicate proto type registered: grpc.QueryIndexRequest
2021-10-05 00:01:32.536016 I | proto: duplicate proto type registered: grpc.UpdateTableRequest
2021-10-05 00:01:32.536020 I | proto: duplicate proto type registered: grpc.DescribeTableResponse
2021-10-05 00:01:32.536023 I | proto: duplicate proto type registered: grpc.CreateTableRequest
2021-10-05 00:01:32.536025 I | proto: duplicate proto type registered: grpc.TableDesc
2021-10-05 00:01:32.536029 I | proto: duplicate proto type registered: grpc.TableDesc.TagsEntry
2021-10-05 00:01:32.536032 I | proto: duplicate proto type registered: grpc.ListTablesResponse
2021-10-05 00:01:32.536035 I | proto: duplicate proto type registered: grpc.Labels
2021-10-05 00:01:32.536242 I | proto: duplicate proto type registered: storage.Entry
2021-10-05 00:01:32.536251 I | proto: duplicate proto type registered: storage.ReadBatch
level=info ts=2021-10-05T00:01:32.539774229Z caller=main.go:129 msg="Starting Loki" version="(version=main-fbfc8ab, branch=main, revision=fbfc8abda)"
level=info ts=2021-10-05T00:01:32.54001961Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-10-05T00:01:32.73336007Z caller=util.go:109 msg="downloaded file delete_requests.gz from table delete_requests"
level=info ts=2021-10-05T00:01:32.747570184Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2021-10-05T00:01:32.747938216Z caller=module_service.go:64 msg=initialising module=compactor
level=info ts=2021-10-05T00:01:32.748006186Z caller=marker.go:178 msg="mark processor started" workers=150 delay=2h0m0s
level=info ts=2021-10-05T00:01:32.748169787Z caller=loki.go:280 msg="Loki started"
level=info ts=2021-10-05T00:01:32.748228427Z caller=marker.go:203 msg="no marks file found"
level=info ts=2021-10-05T00:01:32.748229077Z caller=expiration.go:62 msg="smallest retention period 24h0m0s"
level=info ts=2021-10-05T00:01:37.897322673Z caller=compactor.go:261 msg="compacting table" table-name=loki_index_18900
level=info ts=2021-10-05T00:01:41.629143226Z caller=table.go:78 table-name=loki_index_18900 msg="listed files" count=1
level=info ts=2021-10-05T00:01:42.037022825Z caller=util.go:109 msg="downloaded file compactor-1633068732.gz from table loki_index_18900"
level=info ts=2021-10-05T00:01:42.038851189Z caller=marker.go:78 msg="mark file created" file=/data/compactor/retention/markers/1633392102037286105
level=info ts=2021-10-05T00:01:42.342294827Z caller=table.go:399 table-name=loki_index_18900 msg="removing source db files from storage" count=1
level=info ts=2021-10-05T00:01:42.347361519Z caller=compactor.go:266 msg="finished compacting table" table-name=loki_index_18900
level=info ts=2021-10-05T00:01:42.347421939Z caller=compactor.go:261 msg="compacting table" table-name=loki_index_18901
level=info ts=2021-10-05T00:01:47.499463878Z caller=table.go:78 table-name=loki_index_18901 msg="listed files" count=2
level=info ts=2021-10-05T00:01:47.499521658Z caller=table.go:161 table-name=loki_index_18901 msg="starting compaction of dbs"
level=info ts=2021-10-05T00:01:47.499532898Z caller=table.go:166 table-name=loki_index_18901 msg="using compactor-1633378929.gz as seed file"
level=info ts=2021-10-05T00:01:47.734456691Z caller=util.go:109 msg="downloaded file compactor-1633378929.gz from table loki_index_18901"
level=info ts=2021-10-05T00:01:47.73811675Z caller=util.go:109 msg="downloaded file ingester-0-1633069945525416694-1633384800.gz from table loki_index_18901"
level=info ts=2021-10-05T00:01:47.833555319Z caller=table.go:261 table-name=loki_index_18901 msg="finished compacting the dbs"
level=info ts=2021-10-05T00:01:47.834461131Z caller=marker.go:78 msg="mark file created" file=/data/compactor/retention/markers/1633392107833631489
level=info ts=2021-10-05T00:01:48.056888914Z caller=table.go:399 table-name=loki_index_18901 msg="removing source db files from storage" count=2
level=info ts=2021-10-05T00:01:48.133271557Z caller=compactor.go:266 msg="finished compacting table" table-name=loki_index_18901
level=info ts=2021-10-05T00:01:48.133332677Z caller=compactor.go:261 msg="compacting table" table-name=loki_index_18902
level=info ts=2021-10-05T00:01:52.595105605Z caller=table.go:78 table-name=loki_index_18902 msg="listed files" count=3
level=info ts=2021-10-05T00:01:52.595169335Z caller=table.go:161 table-name=loki_index_18902 msg="starting compaction of dbs"
level=info ts=2021-10-05T00:01:52.595198725Z caller=table.go:166 table-name=loki_index_18902 msg="using compactor-1633377133.gz as seed file"
level=info ts=2021-10-05T00:01:52.83529031Z caller=util.go:109 msg="downloaded file compactor-1633377133.gz from table loki_index_18902"
level=info ts=2021-10-05T00:01:52.838532047Z caller=util.go:109 msg="downloaded file ingester-0-1633069945525416694-1633383900.gz from table loki_index_18902"
level=info ts=2021-10-05T00:01:52.934913868Z caller=util.go:109 msg="downloaded file ingester-0-1633069945525416694-1633391100.gz from table loki_index_18902"
level=info ts=2021-10-05T00:01:52.945884895Z caller=table.go:261 table-name=loki_index_18902 msg="finished compacting the dbs"
level=info ts=2021-10-05T00:01:52.946909897Z caller=marker.go:78 msg="mark file created" file=/data/compactor/retention/markers/1633392112945968835
level=info ts=2021-10-05T00:01:53.2572477Z caller=table.go:399 table-name=loki_index_18902 msg="removing source db files from storage" count=3
level=info ts=2021-10-05T00:01:53.335705278Z caller=compactor.go:266 msg="finished compacting table" table-name=loki_index_18902
level=info ts=2021-10-05T00:01:53.335779358Z caller=compactor.go:261 msg="compacting table" table-name=loki_index_18903
level=info ts=2021-10-05T00:01:58.092962026Z caller=table.go:78 table-name=loki_index_18903 msg="listed files" count=1
level=info ts=2021-10-05T00:01:58.338791514Z caller=util.go:109 msg="downloaded file compactor-1633309333.gz from table loki_index_18903"
level=info ts=2021-10-05T00:01:58.339843737Z caller=marker.go:78 msg="mark file created" file=/data/compactor/retention/markers/1633392118339077545
level=info ts=2021-10-05T00:01:58.656234014Z caller=table.go:399 table-name=loki_index_18903 msg="removing source db files from storage" count=1
level=info ts=2021-10-05T00:01:58.662159298Z caller=compactor.go:266 msg="finished compacting table" table-name=loki_index_18903
level=info ts=2021-10-05T00:01:58.662228428Z caller=compactor.go:261 msg="compacting table" table-name=loki_index_18904
level=info ts=2021-10-05T00:02:03.132161128Z caller=table.go:78 table-name=loki_index_18904 msg="listed files" count=4
level=info ts=2021-10-05T00:02:03.132217188Z caller=table.go:161 table-name=loki_index_18904 msg="starting compaction of dbs"
level=info ts=2021-10-05T00:02:03.132235018Z caller=table.go:166 table-name=loki_index_18904 msg="using compactor-1633390338.gz as seed file"
level=info ts=2021-10-05T00:02:03.340305711Z caller=util.go:109 msg="downloaded file compactor-1633390338.gz from table loki_index_18904"
level=info ts=2021-10-05T00:02:03.535289411Z caller=util.go:109 msg="downloaded file ingester-0-1633069945525416694-1633390200.gz from table loki_index_18904"
level=info ts=2021-10-05T00:02:03.535349551Z caller=util.go:109 msg="downloaded file ingester-0-1633069945525416694-1633391100.gz from table loki_index_18904"
level=info ts=2021-10-05T00:02:03.538508809Z caller=util.go:109 msg="downloaded file ingester-0-1633069945525416694-1633392000.gz from table loki_index_18904"
level=info ts=2021-10-05T00:02:03.751049063Z caller=table.go:261 table-name=loki_index_18904 msg="finished compacting the dbs"
level=info ts=2021-10-05T00:02:03.833763981Z caller=marker.go:78 msg="mark file created" file=/data/compactor/retention/markers/1633392123751146663
level=info ts=2021-10-05T00:02:03.955734658Z caller=util.go:129 msg="compressing the file" src=/data/compactor/loki_index_18904/1633392123 dest=/data/compactor/loki_index_18904/1633392123.gz
level=info ts=2021-10-05T00:02:04.934291511Z caller=table.go:392 table-name=loki_index_18904 msg="uploading the compacted file" fileName=compactor-1633392124.gz
level=info ts=2021-10-05T00:02:05.218679535Z caller=table.go:399 table-name=loki_index_18904 msg="removing source db files from storage" count=4
level=info ts=2021-10-05T00:02:05.228443759Z caller=compactor.go:266 msg="finished compacting table" table-name=loki_index_18904
level=info ts=2021-10-05T00:02:32.749091126Z caller=marker.go:203 msg="no marks file found"
level=info ts=2021-10-05T00:03:32.748597846Z caller=marker.go:203 msg="no marks file found"
level=info ts=2021-10-05T00:04:32.748405705Z caller=marker.go:203 msg="no marks file found"
@jeremy-smith-maco
Copy link

+1

2 similar comments
@tezhm
Copy link

tezhm commented Oct 22, 2021

+1

@userar
Copy link

userar commented Oct 22, 2021

+1

@kavirajk
Copy link
Contributor

/cc @cyriltovena

@weiyu2015
Copy link

i hit a similar issues, but when I check the chunk folder, the data still there, just no query from output, even query last 5 mins data. However, it works again after restarting it.

@stale
Copy link

stale bot commented Jan 9, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 9, 2022
@LinTechSo
Copy link
Contributor

LinTechSo commented Jan 30, 2022

i hit a similar issues, any updates ?
with same configuration
I expected retention to work on a per stream basis and fallback to retention_period if the selector does not match the stream as described in the docs https://grafana.com/docs/loki/latest/operations/storage/retention/

Note: I originally started with loki grafana/loki:2.4.0 and could not get retention through the compactor to work at all

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Jan 30, 2022
@weiyu2015
Copy link

update for my fix:
in v2.4, some default value change, especially max_global_streams_per_user for my case,
change it to no limit (0) in config fix my issue.

config | new default | old default|
max_global_streams_per_user | 5000 | 0 (no limit|

@LinTechSo
Copy link
Contributor

LinTechSo commented Feb 2, 2022

hi, thanks for your feedback, my question is if I change Loki YAML retention config and change the retention time with nondefault value in Kubernetes cluster (helm chart + Loki version 2.4), does Loki immediately apply it or not? what is the reload config in the Loki Config file for retention?

@sandeepsukhani
Copy link
Contributor

PR #4573 has fixed the issue of default stream retention not being applied properly.
Using the latest master build or release v2.4.2, you can verify the same.
Please feel free to reopen this if the issue persists.

@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/retention type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

9 participants