Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COST-3758] retry failed operator queries up to 5 times #195

Merged
merged 10 commits into from
Aug 8, 2023

Conversation

maskarb
Copy link
Member

@maskarb maskarb commented Aug 3, 2023

  • retry a given failed query up to 5 times
  • retry an entire time range up to 5 times
    • track a count of how many times a particular start time has been tried. If we've tried a range 5 times, we skip it
    • we will maintain this tracking until we've had a successful query. If a success only comes after we've upgraded an operator, we will retry as far back as possible based on the retention period and the lastquerysuccesstime.
    • If a success comes without updating the operator, then the missed data is not collected.
  • update the txt_replace.py script to use argparse and accept an optional namespace argument so you can generate a CSV with that namespace so you can deploy the CSV without manually updating the namespace

With the changes to the script, I've been using this series of make commands in order to deploy the operator to a cluster:

make docker-build-no-test IMG=quay.io/$USERNAME/koku-metrics-operator:v$VERSION; make docker-push IMG=quay.io/$USERNAME/koku-metrics-operator:v$VERSION; make bundle IMG=quay.io/$USERNAME/koku-metrics-operator:v$VERSION NAMESPACE=koku-metrics-operator; oc apply -f koku-metrics-operator/2.0.1/manifests/koku-metrics-operator.clusterserviceversion.yaml

In order to deploy an operator via a CSV like this, an OperatorGroup needs to be created:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    app: koku-metrics-operator
    control-plane: controller-manager
  name: koku-metrics-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: koku-metrics-operator
  namespace: koku-metrics-operator
spec:
  targetNamespaces:
  - koku-metrics-operator

I've built this little "proxy" server so that we can simulate slow queries:
https://github.com/maskarb/devfile-sample-go-basic

In the developer view of an Openshift cluster, that repo can be imported and deployed. Then KokuMetricsConfig can be updated like this:

  prometheus_config:
    collect_previous_data: true
    context_timeout: 10
    disable_metrics_collection_cost_management: false
    disable_metrics_collection_resource_optimization: false
    service_address: 'http://prom-test-server-koku-metrics-operator.apps-crc.testing' (this route is found in the cluster UI)

With this service address, the prometheus traffic will flow thru the "proxy" server and will randomly sleep for longer than 10 seconds. This will cause a timeout to occur which is visible in the operator logs. Once that query timeout occurs, that individual query will be requeued and tried again.

@maskarb maskarb changed the title retry up to 5 times [COST-3758] retry failed operator queries up to 5 times Aug 3, 2023
@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Merging #195 (1138bcc) into main (cb4a24a) will increase coverage by 0.09%.
The diff coverage is 93.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
+ Coverage   89.07%   89.17%   +0.09%     
==========================================
  Files          11       11              
  Lines        2390     2457      +67     
==========================================
+ Hits         2129     2191      +62     
- Misses        183      187       +4     
- Partials       78       79       +1     
Flag Coverage Δ
unittests 89.17% <93.82%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
controllers/prometheus.go 93.52% <87.50%> (-2.52%) ⬇️
collector/collector.go 86.56% <100.00%> (ø)
collector/prometheus.go 96.64% <100.00%> (+0.64%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb4a24a...1138bcc. Read the comment docs.

@maskarb maskarb force-pushed the COST-3758-retry-failed-query branch from d3b7484 to 0c1fdf5 Compare August 3, 2023 16:23
@maskarb maskarb self-assigned this Aug 4, 2023
@maskarb maskarb requested a review from a team August 4, 2023 18:04
chambridge
chambridge previously approved these changes Aug 7, 2023
Copy link
Contributor

@chambridge chambridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

collector/prometheus.go Outdated Show resolved Hide resolved
samdoran
samdoran previously approved these changes Aug 7, 2023
Copy link
Contributor

@samdoran samdoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A handful of comments/suggestions but nothing major.

scripts/txt_replace.py Outdated Show resolved Hide resolved
scripts/txt_replace.py Outdated Show resolved Hide resolved
scripts/txt_replace.py Outdated Show resolved Hide resolved
scripts/txt_replace.py Outdated Show resolved Hide resolved
@maskarb maskarb dismissed stale reviews from samdoran and chambridge via f990676 August 7, 2023 17:00
collector/prometheus.go Outdated Show resolved Hide resolved
@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

95.2% 95.2% Coverage
0.0% 0.0% Duplication

@maskarb maskarb merged commit d05c07b into main Aug 8, 2023
@maskarb maskarb deleted the COST-3758-retry-failed-query branch August 8, 2023 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants