feat(analysis): Add Dry-Run Mode #1627

agrawroh · 2021-11-06T21:52:24Z

Background

We are now using Argo Rollouts in production and have a few additional requirements to extend it to all our backend services. One of the top ask/requirements in the checklist is to have an ability to evaluate new analyses for their real behavior, prior to actually rolling back based on those analyses. One possible solution would be to add a Dry-Run mode to the AnalysisTemplate so that after analyzing the metrics which are marked as Dry-Run the controller does not result in the rollback/failures.

Use Cases

When we add a new M3/PROM metric to the AnalysisTemplate, users . Adding a Dry-Run mode can help us identify the issues (if any) in our metrics and resolve them before we start evaluating them to make the actual rollout decisions.

For those metrics which are marked to be in the Dry-Run mode, we need a mechanism to understand/answer the question: ”How would Argo have evaluated this query?”, but explicitly take no action.

Changes

In this PR, we are adding a Dry-Run mode at the per-metric level.

dryRun can be used on a metric to control whether or not to evaluate that metric in a dry-run mode. A metric running
in the dry-run mode won't impact the final state of the rollout or experiment even if it fails or the evaluation comes
out as inconclusive.

The following example queries prometheus every 5 minutes to get the total number of 4XX and 5XX errors, and even if the
evaluation of the metric to monitor the 5XX error-rate fail, the analysis run will pass.

  dryRun:
  - metricName: total-5xx-errors
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

A wildcard '*' can be used to make all the metrics run in the dry-run mode. In the following example, even if one or
both metrics fail, the analysis run will pass.

  dryRun:
  - metricName: .*
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

Dry-Run Summary

If one or more metrics are running in the dry-run mode, the summary of the dry-run results gets appended to the analysis
run message. Assuming that the total-4xx-errors metric fails in the above example but, the total-5xx-errors
succeeds, the final dry-run summary will look like this.

Message: Run Terminated
Dry Run Summary: 
  Count: 4
  Successful: 1
  Failed: 1
  Inconclusive: 1
  Error: 1
Dry Run Summary: 
  Count: 2
  Successful: 1
  Failed: 1
  Inconclusive: 0
  Error: 0
Metric Results:
...

Dry-Run Rollouts

If a rollout wants to dry run its analysis, it simply needs to specify the dryRun field to its analysis stanza. If a
rollout wants to dry run its analysis, it simply needs to specify the dryRun field to its analysis stanza. In the
following example, all the metrics from random-fail and always-pass get merged and executed in the dry-run mode.

kind: Rollout
spec:
...
  steps:
  - analysis:
      templates:
      - templateName: random-fail
      - templateName: always-pass
      dryRun:
      - metricName: .*

Metrics

The analysis_run_metric_phase metrics will have an additional "dry_run" label to indicate whether or not the metric is running in a Dry-Run mode.

Testing

Checklist

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

Signed-off-by: Rohit Agrawal rohit.agrawal@databricks.com

codecov · 2021-11-06T22:37:30Z

Codecov Report

Merging #1627 (6a600b7) into master (9d32c13) will increase coverage by 0.09%.
The diff coverage is 89.95%.

@@            Coverage Diff             @@
##           master    #1627      +/-   ##
==========================================
+ Coverage   81.97%   82.07%   +0.09%     
==========================================
  Files         116      116              
  Lines       15929    16064     +135     
==========================================
+ Hits        13058    13184     +126     
- Misses       2201     2208       +7     
- Partials      670      672       +2

Impacted Files	Coverage Δ
controller/metrics/prommetrics.go	`100.00% <ø> (ø)`
utils/analysis/helpers.go	`78.18% <85.48%> (+1.18%)`	⬆️
analysis/analysis.go	`85.74% <91.17%> (+0.99%)`	⬆️
controller/metrics/analysis.go	`87.67% <100.00%> (+0.17%)`	⬆️
.../apis/rollouts/validation/validation_references.go	`85.14% <100.00%> (ø)`
rollout/analysis.go	`79.59% <100.00%> (ø)`
utils/analysis/factory.go	`94.44% <100.00%> (ø)`
rollout/trafficrouting/istio/controller.go	`52.43% <0.00%> (+1.62%)`	⬆️
rollout/temlateref.go	`82.16% <0.00%> (+1.62%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d32c13...6a600b7. Read the comment docs.