Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series API endpoint unreliable for sparse logs, breaks label_values in Grafana #8034

Open
DieterDP-ng opened this issue Jan 3, 2023 · 7 comments
Labels
type/bug Somehing is not working as expected

Comments

@DieterDP-ng
Copy link

I have a Grafana dashboard that displays logs for several pods in a kubernetes environment. We ingest logs from both console and log files, and apply a source label to differentiate these. In the dashboard we allow selecting the Kubernetes namespace, Kubernetes pod and source using dropdowns (Grafana variables) so the user can select which logs he wants to see.

One of these pods only outputs a single log line every +- 15 minutes. For this pod, I noticed that the source dropdown will not get filled in, i.e. the Grafana label_values call fails for Loki data . This causes it to default to None, and applied to the LogQL query, results in no logs being found. In the end, it seems that there are no logs, while in fact there are.

For reference, the source variable is a Grafana configured as: label_values({namespace="$namespace", pod="$pod"}, source)

I tracked this problem down to the Loki series API, as this is the one used by label_values.

When using the /api/v1/series API, Loki returns no data.

curl -g 'http://localhost:42135/loki/api/v1/series?end=1672767401037029205&start=1672763401037029205' --data-urlencode 'match[]={namespace="devic1-shared",pod="yarn-resourcemanager-0-0"}'
{"status":"success","data":[]}

However, when using the /api/v1/query_range over the same time range, data is returned:

curl -G -s 'http://localhost:42135/loki/api/v1/query_range?' --data-urlencode 'end=1672767401037029205' --data-urlencode 'start=1672763401037029205' --data-urlencode 'query={namespace="devic1-shared",pod="yarn-resourcemanager-0-0"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "streams",
    "result": [
      {
        "stream": {
          "component": "resourcemanager",
          "filename": "/var/lib/kubelet/pods/d59a01f9-8733-4560-b7cc-78cd0514c375/volumes/kubernetes.io~empty-dir/logs/gc.log",
          "job": "devic1-shared/yarn",
          "namespace": "devic1-shared",
          "node_name": "10.178.0.95",
          "pod": "yarn-resourcemanager-0-0",
          "source": "gc.log",
          "app": "yarn"
        },
        "values": [
          [
            "1672766971030608973",
            "[2023-01-03T17:29:30.927+0000][94592.430s][1] GC(152) Pause Young (Normal) (G1 Evacuation Pause) 125M->39M(145M) 1.925ms"
          ],
          [
            "1672766250984737472",
            "[2023-01-03T17:17:30.950+0000][93872.452s][1] GC(151) Pause Young (Normal) (G1 Evacuation Pause) 125M->39M(145M) 26.382ms"
          ],
          [
            "1672765471022969603",
            "[2023-01-03T17:04:30.927+0000][93092.430s][1] GC(150) Pause Young (Normal) (G1 Evacuation Pause) 125M->39M(145M) 1.888ms"
          ],
          [
            "1672764691019264949",
            "[2023-01-03T16:51:30.954+0000][92312.457s][1] GC(149) Pause Young (Normal) (G1 Evacuation Pause) 125M->39M(145M) 28.135ms"
          ],
          [
            "1672763970962902691",
            "[2023-01-03T16:39:30.944+0000][91592.446s][1] GC(148) Pause Young (Normal) (G1 Evacuation Pause) 125M->39M(145M) 19.950ms"
          ]
        ]
      }
    ],
    "stats": {
      "summary": {
        "bytesProcessedPerSecond": 231286,
        "linesProcessedPerSecond": 1644,
        "totalBytesProcessed": 703,
        "totalLinesProcessed": 5,
        "execTime": 0.00303952,
        "queueTime": 6.1705e-05,
        "subqueries": 1,
        "totalEntriesReturned": 5
      },
      "querier": {
        "store": {
          "totalChunksRef": 0,
          "totalChunksDownloaded": 0,
          "chunksDownloadTime": 0,
          "chunk": {
            "headChunkBytes": 0,
            "headChunkLines": 0,
            "decompressedBytes": 0,
            "decompressedLines": 0,
            "compressedBytes": 0,
            "totalDuplicates": 0
          }
        }
      },
      "ingester": {
        "totalReached": 1,
        "totalChunksMatched": 0,
        "totalBatches": 1,
        "totalLinesSent": 5,
        "store": {
          "totalChunksRef": 5,
          "totalChunksDownloaded": 5,
          "chunksDownloadTime": 521213,
          "chunk": {
            "headChunkBytes": 0,
            "headChunkLines": 0,
            "decompressedBytes": 703,
            "decompressedLines": 5,
            "compressedBytes": 778,
            "totalDuplicates": 0
          }
        }
      }
    }
  }
}

The strange thing is, the behavior of the first query depends on when exactly you execute it:

  • < 5 mins after a log entry was ingested: the query will return the series
  • > 5 mins, < X: the query will return empty (as above)
  • > X (at some point in the future, I am unsure what triggers this): the query will return the series

I am using Loki v2.6.1.

@KovalDS
Copy link

KovalDS commented Feb 16, 2023

Have the same issue using loki-stack helm chart on Loki 2.6.1 as well as 2.7.13 in single binary mode with filesystem storage. Series api doesn't return all label values that match query, even though those definitely exist - Logs query with those label values return some logs.

Personal observation - this happens for pods that doesn't write a lot of logs. For me period after which query returns series again from last log ingestion is about 2 and a half hours

@DieterDP-ng Did you come up with some solution or workaround for this?

@DieterDP-ng
Copy link
Author

@KovalDS We already spent some time on it, but didn't find the specific cause or a workaround yet unfortunately.

@KovalDS
Copy link

KovalDS commented Feb 16, 2023

@DieterDP-ng Well, that's unfortunate. Going to spend some time investigating as well. Thanks for your reply!

@jiayi-1994
Copy link

Create a cronjob and create a pod every minute to reproduce the "series" interface. A large number of Pods cannot be queried, but log "queryRange" can search for the corresponding pod logs @DieterDP-ng

@chaudum
Copy link
Contributor

chaudum commented Jun 19, 2023

@DieterDP-ng Thanks for the detailed report. Do you have a chance to reproduce it with the latest Loki version 2.8.2? Nevertheless, we will have a look at the issue.

@DieterDP-ng
Copy link
Author

Tested on Loki 2.8.2. The pod from my opening post was no longer doing sparse output, but I reproduced on another pod. This pod executed a job that only logged a few lines, and nothing else once it was completed.

It seems there is still a difference between both API endpoints.

curl -g 'http://localhost:37883/loki/api/v1/series?end=1687181123&start=1687170323' --data-urlencode 'match[]={namespace="minikube-shared",pod="oozie-upload-sharelib-zx2kv"}' | jq
{
  "status": "success",
  "data": []
}


curl -G -s 'http://localhost:37883/loki/api/v1/query_range?' --data-urlencode 'end=1687181123037029205' --data-urlencode 'start=1687170323037029205' --data-urlencode 'query={namespace="minikube-shared",pod="oozie-upload-sharelib-zx2kv"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "streams",
    "result": [
      {
        "stream": {
          "app": "oozie",
          "container": "pre-install-job",
          "filename": "/var/log/pods/minikube-shared_oozie-upload-sharelib-zx2kv_32529b27-cedb-4eda-9c9e-504c61230ffb/pre-install-job/0.log",
          "instance": "oozie",
          "job": "minikube-shared/oozie",
          "namespace": "minikube-shared",
          "node_name": "minikube",
          "pod": "oozie-upload-sharelib-zx2kv",
          "source": "console",
          "stream": "stdout"
        },
        "values": [
          [
            "1687177122247220308",
            "12:18:42,247  INFO Services:520 - Shutdown\n"
          ],
          [
            "1687177120156263988",
            "Copy tasks are done\n"
          ],
          [
            "1687177104615391461",
            "Running 453 copy tasks on 8 threads\n"
          ],
          ... (abbreviated the output)
          [
            "1687177103876489632",
            "12:18:23,875  INFO XLogService:520 - Log4j configuration file [oozie-log4j.properties]\n"
          ],
          [
            "1687177103875454822",
            "*******************************************************************************\n"
          ],
          [
            "1687177103875419866",
            "12:18:23,874  INFO XLogService:520 - \n ******************************************************************************* \n  STARTUP MSG: Oozie BUILD_VERSION [5.2.1-8-lily] compiled by [benny] on [${build.time}]\n  STARTUP MSG:       revision [${vc.revision}]@[${vc.url}]\n"
          ]
        ]
      },
      {
        "stream": {
          "app": "oozie",
          "container": "pre-install-job",
          "filename": "/var/log/pods/minikube-shared_oozie-upload-sharelib-zx2kv_32529b27-cedb-4eda-9c9e-504c61230ffb/pre-install-job/0.log",
          "instance": "oozie",
          "job": "minikube-shared/oozie",
          "namespace": "minikube-shared",
          "node_name": "minikube",
          "pod": "oozie-upload-sharelib-zx2kv",
          "source": "console",
          "stream": "stderr"
        },
        "values": [
          [
            "1687177101618548331",
            "SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]\n"
          ],
          [
            "1687177101535446586",
            "SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.\n"
          ],
          [
            "1687177101535444826",
            "SLF4J: Found binding in [jar:file:/opt/oozie/embedded-oozie-server/webapp/WEB-INF/lib/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]\n"
          ],
          [
            "1687177101535441180",
            "SLF4J: Found binding in [jar:file:/opt/oozie/embedded-oozie-server/webapp/WEB-INF/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]\n"
          ],
          [
            "1687177101535408733",
            "SLF4J: Class path contains multiple SLF4J bindings.\n"
          ]
        ]
      }
    ],
    "stats": {
      "summary": {
        "bytesProcessedPerSecond": 226788,
        "linesProcessedPerSecond": 1881,
        "totalBytesProcessed": 5184,
        "totalLinesProcessed": 43,
        "execTime": 0.022858,
        "queueTime": 0.051464,
        "subqueries": 0,
        "totalEntriesReturned": 43,
        "splits": 13,
        "shards": 208
      },
      "querier": {
        "store": {
          "totalChunksRef": 0,
          "totalChunksDownloaded": 0,
          "chunksDownloadTime": 0,
          "chunk": {
            "headChunkBytes": 0,
            "headChunkLines": 0,
            "decompressedBytes": 0,
            "decompressedLines": 0,
            "compressedBytes": 0,
            "totalDuplicates": 0
          }
        }
      },
      "ingester": {
        "totalReached": 160,
        "totalChunksMatched": 0,
        "totalBatches": 2,
        "totalLinesSent": 43,
        "store": {
          "totalChunksRef": 2,
          "totalChunksDownloaded": 2,
          "chunksDownloadTime": 140806,
          "chunk": {
            "headChunkBytes": 0,
            "headChunkLines": 0,
            "decompressedBytes": 5184,
            "decompressedLines": 43,
            "compressedBytes": 1980,
            "totalDuplicates": 0
          }
        }
      },
      "cache": {
        "chunk": {
          "entriesFound": 0,
          "entriesRequested": 0,
          "entriesStored": 0,
          "bytesReceived": 0,
          "bytesSent": 0,
          "requests": 0,
          "downloadTime": 0
        },
        "index": {
          "entriesFound": 0,
          "entriesRequested": 0,
          "entriesStored": 0,
          "bytesReceived": 0,
          "bytesSent": 0,
          "requests": 0,
          "downloadTime": 0
        },
        "result": {
          "entriesFound": 0,
          "entriesRequested": 0,
          "entriesStored": 0,
          "bytesReceived": 0,
          "bytesSent": 0,
          "requests": 0,
          "downloadTime": 0
        }
      }
    }
  }
}

@joao-aguilera-c
Copy link

Is there some news on this? I'm still experiencing this in loki:2.9.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

5 participants