Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"no values found for nginx metric request-success-rate" with Prometheus Operator and nginx provider #421

Closed
mkorejo opened this issue Feb 1, 2020 · 10 comments
Labels
question Further information is requested

Comments

@mkorejo
Copy link

mkorejo commented Feb 1, 2020

I installed Flagger with Flux as follows:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: flagger
  namespace: nginx-ingress
spec:
  releaseName: flagger
  chart:
    repository: https://flagger.app
    name: flagger
    version: 0.22.0
  values:
    crd:
      create: true
    meshProvider: nginx
    metricsServer: http://prometheus-operator-prometheus.prometheus-operator:9090

nginx-ingress is installed as follows:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: nginx-ingress
  namespace: nginx-ingress
spec:
  releaseName: nginx-ingress
  chart:
    repository: https://kubernetes-charts.storage.googleapis.com/
    name: nginx-ingress
    version: 1.24.4
  values:
    controller:
      extraArgs:
        publish-service: nginx-ingress/nginx-ingress-controller
        default-ssl-certificate: k1analyzer/k1analyzer-clusterwide-letsencrypt-secret
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true
          additionalLabels:
            release: prometheus-operator
      service:
        externalTrafficPolicy: "Local"
  valuesFrom:
  - configMapKeyRef:
      name: nginx-ingress-config
      optional: true

We are also using the prometheus-operator. I can confirm from Prometheus dashboard that nginx metrics are being collected, and also confirmed flagger can connect to the metricsServer endpoint specified in the HelmRelease:
image

I made some changes to the podinfo helm chart to support creating an ingress and providing this to the canary spec. My canary spec:

> kg canary -o yaml 
apiVersion: v1
items:
- apiVersion: flagger.app/v1alpha3
  kind: Canary
  metadata:
    annotations:
      flux.weave.works/antecedent: flagger:helmrelease/podinfo-frontend
    creationTimestamp: "2020-01-30T22:13:32Z"
    generation: 8
    labels:
      app: frontend
      chart: podinfo-3.1.0
      heritage: Tiller
      release: podinfo-frontend
    name: podinfo-frontend
    namespace: flagger
    resourceVersion: "4526699"
    selfLink: /apis/flagger.app/v1alpha3/namespaces/flagger/canaries/podinfo-frontend
    uid: c09d2cd6-43ad-11ea-b8b2-7222c6d53b77
  spec:
    canaryAnalysis:
      interval: 15s
      maxWeight: 50
      metrics:
      - interval: 1m
        name: request-success-rate
        threshold: 99
      - interval: 1m
        name: request-duration
        threshold: 500
      stepWeight: 5
      threshold: 10
      webhooks:
      - metadata:
          cmd: curl -sd 'test' http://podinfo-frontend-canary.flagger:9898/token |
            grep token
          type: bash
        name: acceptance-test
        timeout: 30s
        type: pre-rollout
        url: http://flagger-loadtester.flagger/
      - metadata:
          cmd: hey -z 1m -q 5 -c 2 http://podinfo-frontend.flagger:9898
        name: load-test-get
        timeout: 5s
        url: http://flagger-loadtester.flagger/
      - metadata:
          cmd: 'hey -z 1m -q 5 -c 2 -m POST -d ''{"test": true}'' http://podinfo-frontend.flagger:9898/echo'
        name: load-test-post
        timeout: 5s
        url: http://flagger-loadtester.flagger/
    ingressRef:
      apiVersion: extensions/v1beta1
      kind: Ingress
      name: podinfo-frontend
    progressDeadlineSeconds: 60
    provider: nginx
    service:
      port: 9898
    targetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: podinfo-frontend
  status:
    canaryWeight: 0
    conditions:
    - lastTransitionTime: "2020-02-01T01:42:13Z"
      lastUpdateTime: "2020-02-01T01:42:13Z"
      message: Canary analysis failed, deployment scaled to zero.
      reason: Failed
      status: "False"
      type: Promoted
    failedChecks: 0
    iterations: 0
    lastAppliedSpec: "3118295861456058183"
    lastTransitionTime: "2020-02-01T01:42:13Z"
    phase: Failed
    trackedConfigs:
      configmap/podinfo-frontend: 270c8d855a0c1374
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

podinfo-frontend:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: podinfo-frontend
  namespace: flagger
spec:
  forceUpgrade: true
  rollback:
    enable: true
    force: true
    wait: true
  releaseName: podinfo-frontend
  chart:
    # repository: https://flagger.app
    # name: podinfo
    # version: 3.1.0
    git: https://github.com/mkorejo/flagger.git
    path: charts/podinfo
    ref: master
  values:
    # backend: http://podinfo-backend:9898/echo
    canary:
      enabled: true
      provider: nginx
      acceptancetest:
        enabled: true
        url: http://flagger-loadtester.flagger/
      loadtest:
        enabled: true
        url: http://flagger-loadtester.flagger/
    hpa:
      enabled: false
      minReplicas: 2
      maxReplicas: 4
      cpu: 80
      memory: 512Mi
    image:
      tag: 3.1.1
    ingress:
      enabled: true
      hostname: podinfo.poc.k1analyzer-nonprod.com
      annotations:
        kubernetes.io/ingress.class: "nginx"
        cert-manager.io/cluster-issuer: "az-60e-letsencrypt"
      tls:
        - hosts:
            - podinfo.poc.k1analyzer-nonprod.com
          secretName: podinfo.poc.k1analyzer-nonprod.com-tls
    nameOverride: frontend

My issue: every canary progression fails with:
Halt advancement no values found for nginx metric request-success-rate probably podinfo-frontend.flagger is not receiving traffic

I confirmed the hey loadtesting is working from the flagger-loadtester pod. Any thoughts as to what's going on? Thanks very much.

@stefanprodan
Copy link
Member

You need to do the load test against the public address so that traffic goes over nginx, see https://docs.flagger.app/usage/nginx-progressive-delivery

@stefanprodan
Copy link
Member

Or use the ClusterIP address of your nginx ingress and set the Host header in hey.

@mkorejo
Copy link
Author

mkorejo commented Feb 1, 2020

Hi @stefanprodan, thanks for the quick reply. Also great work on Flagger!

Unfortunately, still having issues even when switching the load tests to hit the public IP/hostname:

> kd canary
Name:         podinfo-frontend
Namespace:    flagger
Labels:       app=frontend
              chart=podinfo-3.1.0
              heritage=Tiller
              release=podinfo-frontend
Annotations:  flux.weave.works/antecedent: flagger:helmrelease/podinfo-frontend
API Version:  flagger.app/v1alpha3
Kind:         Canary
Metadata:
  Creation Timestamp:  2020-01-30T22:13:32Z
  Generation:          10
  Resource Version:    4708962
  Self Link:           /apis/flagger.app/v1alpha3/namespaces/flagger/canaries/podinfo-frontend
  UID:                 c09d2cd6-43ad-11ea-b8b2-7222c6d53b77
Spec:
  Canary Analysis:
    Interval:    15s
    Max Weight:  50
    Metrics:
      Interval:   1m
      Name:       request-success-rate
      Threshold:  99
      Interval:   1m
      Name:       request-duration
      Threshold:  500
    Step Weight:  5
    Threshold:    10
    Webhooks:
      Metadata:
        Cmd:    curl -sd 'test' http://podinfo-frontend-canary.flagger:9898/token | grep token
        Type:   bash
      Name:     acceptance-test
      Timeout:  30s
      Type:     pre-rollout
      URL:      http://flagger-loadtester.flagger/
      Metadata:
        Cmd:    hey -z 1m -q 5 -c 2 http://podinfo.poc.k1analyzer-nonprod.com
      Name:     load-test-get
      Timeout:  5s
      URL:      http://flagger-loadtester.flagger/
      Metadata:
        Cmd:    hey -z 1m -q 5 -c 2 -m POST -d '{"test": true}' http://podinfo.poc.k1analyzer-nonprod.com/echo
      Name:     load-test-post
      Timeout:  5s
      URL:      http://flagger-loadtester.flagger/
  Ingress Ref:
    API Version:              extensions/v1beta1
    Kind:                     Ingress
    Name:                     podinfo-frontend
  Progress Deadline Seconds:  60
  Provider:                   nginx
  Service:
    Port:  9898
    Traffic Policy:
      Tls:
        Mode:  DISABLE
  Target Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         podinfo-frontend
Status:
  Canary Weight:  0
  Conditions:
    Last Transition Time:  2020-02-01T18:10:43Z
    Last Update Time:      2020-02-01T18:10:43Z
    Message:               Canary analysis failed, deployment scaled to zero.
    Reason:                Failed
    Status:                False
    Type:                  Promoted
  Failed Checks:           0
  Iterations:              0
  Last Applied Spec:       3118295861456058183
  Last Transition Time:    2020-02-01T18:10:43Z
  Phase:                   Failed
  Tracked Configs:
    configmap/podinfo-frontend:  270c8d855a0c1374
Events:
  Type     Reason  Age                   From     Message
  ----     ------  ----                  ----     -------
  Warning  Synced  11m (x3 over 16h)     flagger  Rolling back podinfo-frontend.flagger failed checks threshold reached 10
  Warning  Synced  11m (x3 over 16h)     flagger  Canary failed! Scaling down podinfo-frontend.flagger
  Normal   Synced  3m59s (x4 over 16h)   flagger  New revision detected! Scaling up podinfo-frontend.flagger
  Normal   Synced  3m44s (x13 over 16h)  flagger  Starting canary analysis for podinfo-frontend.flagger
  Normal   Synced  3m44s (x3 over 16h)   flagger  Pre-rollout check acceptance-test passed
  Normal   Synced  3m44s (x3 over 16h)   flagger  Advance podinfo-frontend.flagger canary weight 5
  Warning  Synced  119s (x27 over 16h)   flagger  Halt advancement no values found for nginx metric request-success-rate probably podinfo-frontend.flagger is not receiving traffic

I updated the Flagger HelmRelease to install another prometheus (prometheus.install=true) and this seems to be working. Need to dig into how to get this to work with Prometheus Operator.

@stefanprodan
Copy link
Member

Any news on this?

@stefanprodan stefanprodan added the question Further information is requested label Feb 22, 2020
@grzegdl
Copy link

grzegdl commented Mar 11, 2020

I have the same issue when using prometheus-operator and recent nginx-ingress. After a quick look it seems that flagger is using different namespace label schema.

i.e. in case of podinfo test:

sum(rate(nginx_ingress_controller_requests{namespace="test",ingress="podinfo",status!~"5.*"}[1m]))/sum(rate(nginx_ingress_controller_requests{namespace="test",ingress="podinfo"}[1m]))*100

instead of:

sum(rate(nginx_ingress_controller_requests{exported_namespace="test",ingress="podinfo",status!~"5.*"}[1m]))/sum(rate(nginx_ingress_controller_requests{exported_namespace="test",ingress="podinfo"}[1m]))*100

In short: namespace -> exported_namespace

@stefanprodan
Copy link
Member

stefanprodan commented Mar 12, 2020

I guess prometheus-operator changes that label since Flagger e2e tests for NGINX are passing #489.

The solution is to use metric templates e.g.:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: ingress-nginx
spec:
  provider:
    type: prometheus
    address: http://promethues.monitoring:9090
  query: |
    100 - sum(
      rate(
        nginx_ingress_controller_requests{
          exported_namespace="{{ namespace }}",
          ingress="{{ ingress }}",
          status!~"5.*"
        }[{{ interval }}]
      )
    ) 
    / 
    sum(
      rate(
        nginx_ingress_controller_requests{
          exported_namespace="{{ namespace }}",
          ingress="{{ ingress }}"
        }[{{ interval }}]
      )
    ) 
    * 100

Replace request-success-rate with:

    metrics:
    - name: error-rate
      templateRef:
        name: error-rate
        namespace: ingress-nginx
      thresholdRange:
        max: 1
      interval: 1m

@grzegdl
Copy link

grzegdl commented Mar 13, 2020

Yea, that did the trick. That's probably a common use-case that people will be using flagger along with prometheus-operator. Maybe issue should be documented somewhere.

@stefanprodan
Copy link
Member

This has been documented here https://docs.flagger.app/v/master/tutorials/prometheus-operator

@davidriskified
Copy link

Link is broken :-)

@L3o-pold
Copy link
Contributor

https://docs.flagger.app/tutorials/prometheus-operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants