[COST-3758] retry failed operator queries up to 5 times #195

maskarb · 2023-08-03T16:07:59Z

retry a given failed query up to 5 times
retry an entire time range up to 5 times
- track a count of how many times a particular start time has been tried. If we've tried a range 5 times, we skip it
- we will maintain this tracking until we've had a successful query. If a success only comes after we've upgraded an operator, we will retry as far back as possible based on the retention period and the lastquerysuccesstime.
- If a success comes without updating the operator, then the missed data is not collected.
update the txt_replace.py script to use argparse and accept an optional namespace argument so you can generate a CSV with that namespace so you can deploy the CSV without manually updating the namespace

With the changes to the script, I've been using this series of make commands in order to deploy the operator to a cluster:

make docker-build-no-test IMG=quay.io/$USERNAME/koku-metrics-operator:v$VERSION; make docker-push IMG=quay.io/$USERNAME/koku-metrics-operator:v$VERSION; make bundle IMG=quay.io/$USERNAME/koku-metrics-operator:v$VERSION NAMESPACE=koku-metrics-operator; oc apply -f koku-metrics-operator/2.0.1/manifests/koku-metrics-operator.clusterserviceversion.yaml

In order to deploy an operator via a CSV like this, an OperatorGroup needs to be created:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    app: koku-metrics-operator
    control-plane: controller-manager
  name: koku-metrics-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: koku-metrics-operator
  namespace: koku-metrics-operator
spec:
  targetNamespaces:
  - koku-metrics-operator

I've built this little "proxy" server so that we can simulate slow queries:
https://github.com/maskarb/devfile-sample-go-basic

In the developer view of an Openshift cluster, that repo can be imported and deployed. Then KokuMetricsConfig can be updated like this:

  prometheus_config:
    collect_previous_data: true
    context_timeout: 10
    disable_metrics_collection_cost_management: false
    disable_metrics_collection_resource_optimization: false
    service_address: 'http://prom-test-server-koku-metrics-operator.apps-crc.testing' (this route is found in the cluster UI)

With this service address, the prometheus traffic will flow thru the "proxy" server and will randomly sleep for longer than 10 seconds. This will cause a timeout to occur which is visible in the operator logs. Once that query timeout occurs, that individual query will be requeued and tried again.

codecov · 2023-08-03T16:12:32Z

Codecov Report

Merging #195 (1138bcc) into main (cb4a24a) will increase coverage by 0.09%.
The diff coverage is 93.82%.

@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
+ Coverage   89.07%   89.17%   +0.09%     
==========================================
  Files          11       11              
  Lines        2390     2457      +67     
==========================================
+ Hits         2129     2191      +62     
- Misses        183      187       +4     
- Partials       78       79       +1

Flag	Coverage Δ
unittests	`89.17% <93.82%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
controllers/prometheus.go	`93.52% <87.50%> (-2.52%)`	⬇️
collector/collector.go	`86.56% <100.00%> (ø)`
collector/prometheus.go	`96.64% <100.00%> (+0.64%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb4a24a...1138bcc. Read the comment docs.

chambridge

LGTM

collector/prometheus.go

samdoran

A handful of comments/suggestions but nothing major.

scripts/txt_replace.py

collector/prometheus.go

sonarqubecloud · 2023-08-08T13:16:21Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

95.2% Coverage
0.0% Duplication

retry up to 5 times

728b876

maskarb changed the title ~~retry up to 5 times~~ [COST-3758] retry failed operator queries up to 5 times Aug 3, 2023

fix time format

0c1fdf5

maskarb force-pushed the COST-3758-retry-failed-query branch from d3b7484 to 0c1fdf5 Compare August 3, 2023 16:23

maskarb added 6 commits August 3, 2023 15:44

update log message. update makefile to generate csv for testing

2d42fa4

add retry for ros reports

41b4731

add test for is_query_needed

fc29e0f

add more tests

2f1d682

remove focus

3c637b3

undo comment

811fbb1

maskarb self-assigned this Aug 4, 2023

maskarb requested a review from a team August 4, 2023 18:04

chambridge previously approved these changes Aug 7, 2023

View reviewed changes

samdoran reviewed Aug 7, 2023

View reviewed changes

collector/prometheus.go Outdated Show resolved Hide resolved

samdoran previously approved these changes Aug 7, 2023

View reviewed changes

scripts/txt_replace.py Outdated Show resolved Hide resolved

scripts/txt_replace.py Outdated Show resolved Hide resolved

scripts/txt_replace.py Outdated Show resolved Hide resolved

scripts/txt_replace.py Outdated Show resolved Hide resolved

address PR feedback

f990676

maskarb dismissed stale reviews from samdoran and chambridge via f990676 August 7, 2023 17:00

samdoran reviewed Aug 7, 2023

View reviewed changes

collector/prometheus.go Outdated Show resolved Hide resolved

address pr feedback

1138bcc

samdoran approved these changes Aug 8, 2023

View reviewed changes

maskarb merged commit d05c07b into main Aug 8, 2023

maskarb deleted the COST-3758-retry-failed-query branch August 8, 2023 13:24

maskarb added a commit that referenced this pull request Sep 8, 2023

[COST-3758] retry failed operator queries up to 5 times (#195)

30128e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[COST-3758] retry failed operator queries up to 5 times #195

[COST-3758] retry failed operator queries up to 5 times #195

maskarb commented Aug 3, 2023 •

edited

Loading

codecov bot commented Aug 3, 2023 •

edited

Loading

chambridge left a comment

samdoran left a comment

sonarqubecloud bot commented Aug 8, 2023

[COST-3758] retry failed operator queries up to 5 times #195

[COST-3758] retry failed operator queries up to 5 times #195

Conversation

maskarb commented Aug 3, 2023 • edited Loading

codecov bot commented Aug 3, 2023 • edited Loading

Codecov Report

chambridge left a comment

Choose a reason for hiding this comment

samdoran left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Aug 8, 2023

maskarb commented Aug 3, 2023 •

edited

Loading

codecov bot commented Aug 3, 2023 •

edited

Loading