Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(analysis): Add Dry-Run Mode #1627

Merged
merged 13 commits into from
Dec 2, 2021
Merged

Conversation

agrawroh
Copy link
Member

@agrawroh agrawroh commented Nov 6, 2021

Background

We are now using Argo Rollouts in production and have a few additional requirements to extend it to all our backend services. One of the top ask/requirements in the checklist is to have an ability to evaluate new analyses for their real behavior, prior to actually rolling back based on those analyses. One possible solution would be to add a Dry-Run mode to the AnalysisTemplate so that after analyzing the metrics which are marked as Dry-Run the controller does not result in the rollback/failures.

Use Cases

When we add a new M3/PROM metric to the AnalysisTemplate, users . Adding a Dry-Run mode can help us identify the issues (if any) in our metrics and resolve them before we start evaluating them to make the actual rollout decisions.

For those metrics which are marked to be in the Dry-Run mode, we need a mechanism to understand/answer the question: ”How would Argo have evaluated this query?”, but explicitly take no action.

Changes

In this PR, we are adding a Dry-Run mode at the per-metric level.

dryRun can be used on a metric to control whether or not to evaluate that metric in a dry-run mode. A metric running
in the dry-run mode won't impact the final state of the rollout or experiment even if it fails or the evaluation comes
out as inconclusive.

The following example queries prometheus every 5 minutes to get the total number of 4XX and 5XX errors, and even if the
evaluation of the metric to monitor the 5XX error-rate fail, the analysis run will pass.

  dryRun:
  - metricName: total-5xx-errors
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

A wildcard '*' can be used to make all the metrics run in the dry-run mode. In the following example, even if one or
both metrics fail, the analysis run will pass.

  dryRun:
  - metricName: .*
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

Dry-Run Summary

If one or more metrics are running in the dry-run mode, the summary of the dry-run results gets appended to the analysis
run message. Assuming that the total-4xx-errors metric fails in the above example but, the total-5xx-errors
succeeds, the final dry-run summary will look like this.

Message: Run Terminated
Dry Run Summary: 
  Count: 4
  Successful: 1
  Failed: 1
  Inconclusive: 1
  Error: 1
Dry Run Summary: 
  Count: 2
  Successful: 1
  Failed: 1
  Inconclusive: 0
  Error: 0
Metric Results:
...

Dry-Run Rollouts

If a rollout wants to dry run its analysis, it simply needs to specify the dryRun field to its analysis stanza. If a
rollout wants to dry run its analysis, it simply needs to specify the dryRun field to its analysis stanza. In the
following example, all the metrics from random-fail and always-pass get merged and executed in the dry-run mode.

kind: Rollout
spec:
...
  steps:
  - analysis:
      templates:
      - templateName: random-fail
      - templateName: always-pass
      dryRun:
      - metricName: .*

Metrics

The analysis_run_metric_phase metrics will have an additional "dry_run" label to indicate whether or not the metric is running in a Dry-Run mode.

Testing

Screen Shot 2021-11-07 at 8 18 54 PM

Checklist

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

Signed-off-by: Rohit Agrawal rohit.agrawal@databricks.com

@agrawroh agrawroh force-pushed the dry-run-mode branch 3 times, most recently from 242ff08 to b758b4b Compare November 6, 2021 22:28
@codecov
Copy link

codecov bot commented Nov 6, 2021

Codecov Report

Merging #1627 (6a600b7) into master (9d32c13) will increase coverage by 0.09%.
The diff coverage is 89.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1627      +/-   ##
==========================================
+ Coverage   81.97%   82.07%   +0.09%     
==========================================
  Files         116      116              
  Lines       15929    16064     +135     
==========================================
+ Hits        13058    13184     +126     
- Misses       2201     2208       +7     
- Partials      670      672       +2     
Impacted Files Coverage Δ
controller/metrics/prommetrics.go 100.00% <ø> (ø)
utils/analysis/helpers.go 78.18% <85.48%> (+1.18%) ⬆️
analysis/analysis.go 85.74% <91.17%> (+0.99%) ⬆️
controller/metrics/analysis.go 87.67% <100.00%> (+0.17%) ⬆️
.../apis/rollouts/validation/validation_references.go 85.14% <100.00%> (ø)
rollout/analysis.go 79.59% <100.00%> (ø)
utils/analysis/factory.go 94.44% <100.00%> (ø)
rollout/trafficrouting/istio/controller.go 52.43% <0.00%> (+1.62%) ⬆️
rollout/temlateref.go 82.16% <0.00%> (+1.62%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d32c13...6a600b7. Read the comment docs.

@agrawroh agrawroh force-pushed the dry-run-mode branch 4 times, most recently from e10124b to 0287e05 Compare November 7, 2021 03:11
@agrawroh agrawroh changed the title [WORK IN PROGRESS] Dry-Run feat(analysis): Add Dry-Run Mode Nov 7, 2021
@agrawroh agrawroh force-pushed the dry-run-mode branch 5 times, most recently from f4bc268 to 4e93fce Compare November 7, 2021 04:57
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
@agrawroh agrawroh force-pushed the dry-run-mode branch 4 times, most recently from 5418b07 to 76caceb Compare November 22, 2021 08:23
@agrawroh agrawroh force-pushed the dry-run-mode branch 2 times, most recently from 9183978 to bd3a75c Compare November 22, 2021 17:01
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
@agrawroh agrawroh requested a review from jessesuen November 23, 2021 07:11
@agrawroh agrawroh added the ready-for-review Ready for final review label Nov 23, 2021
pkg/apis/rollouts/v1alpha1/analysis_types.go Outdated Show resolved Hide resolved
controller/metrics/prommetrics.go Outdated Show resolved Hide resolved
controller/metrics/prommetrics.go Outdated Show resolved Hide resolved
utils/analysis/helpers.go Outdated Show resolved Hide resolved
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last change request about the status type structure and I think this will be good to go.

pkg/apis/rollouts/v1alpha1/analysis_types.go Outdated Show resolved Hide resolved
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
…un-mode

Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
@agrawroh agrawroh requested a review from jessesuen December 1, 2021 02:31
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
Signed-off-by: Rohit Agrawal <rohit.agrawal@databricks.com>
@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 1, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
7.2% 7.2% Duplication

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great feature. Thank you!

@jessesuen jessesuen merged commit ec3d383 into argoproj:master Dec 2, 2021
@eilonmonday
Copy link

@jessesuen amazing feature!
when is the release?

@agrawroh agrawroh removed the ready-for-review Ready for final review label Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants