Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode. #12912

numa1985 · 2024-05-08T08:06:42Z

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode.

We are using azure Kubernetes service consisting of 1 system node in a system node pool and 3 user nodes in user node pool for deploying Loki .

Kubernetes Version : 1.29.2

Affinity was working fine and all the pods were landing up in user node pool in 5.44.4,after upgrade to 6.5.0 when we set affinity we are encountering below error.

Error

coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]])
May 7th 2024 10:59:51Error
coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]])
May 7th 2024 10:59:51Error
Error: UPGRADE FAILED: execution error at (loki/templates/validate.yaml:31:4): You have more than zero replicas configured for both the single binary and simple scalable targets. If this was intentional change the deploymentMode to the transitional 'SingleBinary<->SimpleScalable' mode
May 7th 2024 10:59:51Error
Helm Upgrade returned non-zero exit code: 1. Deployment terminated.
May 7th 2024 10:59:51Fatal
The remote script failed with exit code 1

ubuntu@NARU-Pr5530:~$ kubectl describe pod loki-chunks-cache-0 -n loki|tail -5
Type Reason Age From Message

Warning FailedScheduling 2m43s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
Warning FailedScheduling 2m42s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
Normal NotTriggerScaleUp 2m40s cluster-autoscaler pod didn't trigger scale-up: 1 max node group size reached

If we are not using affinity in version 6.5.0 ,few pods are landing up in the system node and ending with the resources issues and failing , and as well we don't pods to land up in system node.
Is there any way to fix this ?

Values.yaml ( used in 5.44.4)

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

loki:
auth_enabled: false
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
split_queries_by_interval: 0
commonConfig:
replication_factor: 1
storage:
type: filesystem

singleBinary:
replicas: 1
persistence:
size: 50Gi
enableStatefulSetAutoDeletePVC: true
affinity: |
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- user
weight: 50

Values.yaml ( used in 6.5.0)

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

deploymentMode: SingleBinary
loki:
auth_enabled: false
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
limits_config:
split_queries_by_interval: 0
commonConfig:
replication_factor: 1
storage:
type: filesystem
schemaConfig:
configs:
- from: 2024-04-01
object_store: filesystem
store: tsdb
schema: v13
index:
prefix: loki_index_
period: 24h
ingester:
chunk_encoding: snappy
tracing:
enabled: true
querier:
max_concurrent: 1

backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0

singleBinary:
replicas: 1
persistence:
size: 50Gi
enableStatefulSetAutoDeletePVC: true
enabled: true
extraArgs:
- -config.expand-env=true

chunksCache:
allocatedMemory: 1024
writebackSizeLimit: 10MB

After updrading to 6.5.0 the loki-0 pod going for crash loopback with below error.

Error

ubuntu@NARU-Pr5530:$ kubectl logs loki-0 -n loki
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
line 2: field Error not found in type loki.ConfigWrapper
ubuntu@NARU-Pr5530:$

ubuntu@NARU-Pr5530:~$ kubectl get pods -n loki -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
loki-0 0/1 CrashLoopBackOff 1 (11s ago) 74s 10.101.80.28 aks-npu2-21504394-vmss00000f
loki-canary-6qngw 1/1 Running 0 74s 10.101.80.158 aks-npsystem01-10976478-vmss000000
loki-canary-6v6bz 1/1 Running 0 75s 10.101.80.136 aks-npu2-21504394-vmss000000
loki-canary-krnqv 1/1 Running 0 75s 10.101.80.240 aks-npu2-21504394-vmss00000f
loki-canary-twcl5 1/1 Running 0 75s 10.101.80.213 aks-npu2-21504394-vmss00000h
loki-chunks-cache-0 0/2 Pending 0 74s
loki-gateway-668c5dff6c-l7hd5 1/1 Running 0 74s 10.101.80.173 aks-npsystem01-10976478-vmss000000
loki-results-cache-0 2/2 Running 0 74s 10.101.80.175 aks-npsystem01-10976478-vmss000000

Kindly do the needful.

numa1985 · 2024-05-17T09:03:33Z

Currently we are out of monitoring due to the issue mentioned ,it will be really great if some one can have can assist on this.

numa1985 · 2024-05-20T13:45:07Z

Was able to fix the affinity issue .Only issue can't able to figure out was config issue. PFB values.yaml.Kindly do the needful.Thanks

numa1985 · 2024-05-20T17:35:37Z

Was able to fix all the issues .Thanks

krptg0 · 2024-05-23T09:36:57Z

Was able to fix all the issues .Thanks

Care for sharing how you did solve the issue ?

numa1985 · 2024-05-23T13:28:36Z

@krptg0 : Issue was related to the "parallelise_shardable_queries: true" variable used to be under "loki.query_range" in the chart version we used in 5.44.4 ,but after upgrade to 6.5.0 it should be moved to loki.structuredConfig.query_range ,which also needs to updated in the grafana documentation page for now until permanent fix . Seems this is a bug in the latest chart and I saw some user already derived case for the same few weeks back.

5.44.4

loki:
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
split_queries_by_interval: 0

6.5.0

loki:
commonConfig:
replication_factor: 1
query_scheduler:
max_outstanding_requests_per_tenant: 2048
structuredConfig:
query_range:
parallelise_shardable_queries: true

Thanks

sslny57 · 2024-05-23T15:56:01Z

@krptg0 I don't see that solving the issue. would you be able to share your config which worked for you .

helm upgrade --reset-values my-loki -f values-loki.yaml grafana/loki -n vector --debug --version 6.5.2 upgrade.go:155: [debug] preparing upgrade for my-loki upgrade.go:536: [debug] resetting values to the chart's original version coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.read.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:read]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.tableManager.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:table-manager]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.write.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:write]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.gateway.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:gateway]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.backend.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:backend]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.read.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:read]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.gateway.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:gateway]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.backend.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:backend]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.write.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:write]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.tableManager.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:table-manager]] topologyKey:kubernetes.io/hostname]]]]) Error: UPGRADE FAILED: execution error at (loki/templates/validate.yaml:40:4): You must provide a schema_config for Loki, one is not provided as this will be individual for every Loki cluster. See https://grafana.com/docs/loki/latest/operations/storage/schema/ for schema information. For quick testing (with no persistence) add--set loki.useTestSchema=true helm.go:84: [debug] execution error at (loki/templates/validate.yaml:40:4): You must provide a schema_config for Loki, one is not provided as this will be individual for every Loki cluster. See https://grafana.com/docs/loki/latest/operations/storage/schema/ for schema information. For quick testing (with no persistence) add--set loki.useTestSchema=true UPGRADE FAILED main.newUpgradeCmd.func2 helm.sh/helm/v3/cmd/helm/upgrade.go:229 github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra@v1.8.0/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC github.com/spf13/cobra@v1.8.0/command.go:1115 github.com/spf13/cobra.(*Command).Execute github.com/spf13/cobra@v1.8.0/command.go:1039 main.main helm.sh/helm/v3/cmd/helm/helm.go:83 runtime.main runtime/proc.go:267 runtime.goexit runtime/asm_amd64.s:1650

I am currently on 5.47.2

helm ls -a NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION my-grafana vector 2 2024-05-21 11:41:27.3348885 +0530 IST deployed grafana-7.3.11 10.4.1 my-loki vector 1 2024-05-21 10:58:05.3864634 +0530 IST deployed loki-5.47.2 2.9.6

sslny57 · 2024-05-23T16:40:09Z

values_29042024_loki.txt

helm upgrade --reset-values my-loki -f values_29042024.yaml grafana/loki -n vector --debug --version 6.5.2

The Helm file attached was suitable for upgrading, but a couple of pods encountered errors still.
output.txt

sslny57 · 2024-05-23T16:59:57Z

in gateway pod:

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  17m                   default-scheduler  Successfully assigned vector/my-loki-gateway-548dd78cd8-wrgnd to ip-10-0-1-223.eu-west-2.compute.internal
  Normal   Pulled     17m                   kubelet            Container image "docker.io/nginxinc/nginx-unprivileged:1.24-alpine" already present on machine
  Normal   Created    17m                   kubelet            Created container nginx
  Normal   Started    17m                   kubelet            Started container nginx
  Warning  Unhealthy  2m8s (x101 over 16m)  kubelet            Readiness probe errored: strconv.Atoi: parsing "http": invalid syntax

sslny57 · 2024-05-23T17:11:05Z




NAME                                              READY   STATUS             RESTARTS      AGE
loki-backend-0                                    2/2     Running            3 (12m ago)   12m
loki-backend-1                                    1/2     CrashLoopBackOff   3 (13s ago)   75s
loki-canary-9hrdt                                 1/1     Running            0             25m
loki-canary-gqktk                                 1/1     Running            0             24m
loki-canary-q6r28                                 1/1     Running            0             23m
loki-canary-rbgbl                                 1/1     Running            0             26m
loki-read-b76c4bff4-kv9qj                         1/1     Running            0             81s
loki-read-b76c4bff4-sjjg4                         1/1     Running            0             50s
loki-write-0                                      1/1     Running            0             25m
loki-write-1                                      0/1     Running            0             8s
loki-write-2                                      1/1     Running            0             80s
my-grafana-7cfd6ffc59-cjhtp                       1/1     Running            0             27m
my-loki-chunks-cache-0                            2/2     Running            0             12m
my-loki-gateway-548dd78cd8-wrgnd                  0/1     Running            0             27m
my-loki-gateway-66f8b59d65-75z95                  0/1     Running            0             34m
my-loki-grafana-agent-operator-6b4f987557-655hx   1/1     Running            0             27m
my-loki-logs-5sr6b                                2/2     Running            0             2d10h
my-loki-logs-cdskt                                2/2     Running            0             2d11h
my-loki-logs-jvdnv                                2/2     Running            0             21m
my-loki-logs-z28sp                                2/2     Running            0             2d11h
my-loki-results-cache-0                           2/2     Running            0             12m
my-vector-0                                       1/1     Running            0             26m


$  kubectl logs loki-backend-1 -c loki
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x22f02b0]

goroutine 1 [running]:
github.com/grafana/loki/v3/pkg/loki.(*Loki).updateConfigForShipperStore(0xc000a2ff40?)
        /src/loki/pkg/loki/modules.go:755 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomStore(0xc000bf3500)
        /src/loki/pkg/loki/modules.go:715 +0x68
github.com/grafana/dskit/modules.(*Manager).initModule(0xc000a62708, {0x7ffd42c2c27d, 0x7}, 0x1?, 0xc0016800c0?)
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x0?, {0xc0008f4910, 0x1, 0xc000c36360?})
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc000bf3500, {0x0?, {0x4?, 0x3?, 0x4912940?}})
        /src/loki/pkg/loki/loki.go:453 +0x9d
main.main()
        /src/loki/cmd/loki/main.go:122 +0x113b

sslny57 · 2024-05-23T21:34:51Z

fixed this making change to helm

https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L337-L345

  readinessProbe:
    httpGet:
      path: /
      port: http-metrics
    initialDelaySeconds: 15
    timeoutSeconds: 1``

acar-ctpe · 2024-07-02T15:16:34Z

@sslny57 how does that fix the segmentation error?

JStickler added area/helm type/bug Somehing is not working as expected upgrade labels May 13, 2024

numa1985 closed this as completed May 20, 2024

sslny57 mentioned this issue May 23, 2024

v3.0.0: loki backend SIGSEGV if index_gateway.mode: ring #12270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode. #12912

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode. #12912

numa1985 commented May 8, 2024 •

edited

Loading

numa1985 commented May 17, 2024

numa1985 commented May 20, 2024 •

edited

Loading

numa1985 commented May 20, 2024 •

edited

Loading

krptg0 commented May 23, 2024

numa1985 commented May 23, 2024

sslny57 commented May 23, 2024 •

edited

Loading

sslny57 commented May 23, 2024 •

edited

Loading

sslny57 commented May 23, 2024 •

edited

Loading

sslny57 commented May 23, 2024

sslny57 commented May 23, 2024

acar-ctpe commented Jul 2, 2024

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode. #12912

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode. #12912

Comments

numa1985 commented May 8, 2024 • edited Loading

Error

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

Error

numa1985 commented May 17, 2024

numa1985 commented May 20, 2024 • edited Loading

numa1985 commented May 20, 2024 • edited Loading

krptg0 commented May 23, 2024

numa1985 commented May 23, 2024

sslny57 commented May 23, 2024 • edited Loading

sslny57 commented May 23, 2024 • edited Loading

sslny57 commented May 23, 2024 • edited Loading

sslny57 commented May 23, 2024

sslny57 commented May 23, 2024

acar-ctpe commented Jul 2, 2024

numa1985 commented May 8, 2024 •

edited

Loading

numa1985 commented May 20, 2024 •

edited

Loading

numa1985 commented May 20, 2024 •

edited

Loading

sslny57 commented May 23, 2024 •

edited

Loading

sslny57 commented May 23, 2024 •

edited

Loading

sslny57 commented May 23, 2024 •

edited

Loading