Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent data when grabbing the same historic metric multiple times #701

Open
simonhfot opened this issue Dec 5, 2024 · 1 comment

Comments

@simonhfot
Copy link

simonhfot commented Dec 5, 2024

Hello!

I'm grabbing the same CPU metrics for the same historical period 10 times:

$ promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar1.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar2.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar3.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar4.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar5.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar6.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar7.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar8.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar9.txt.xz &

However, I get two sets of results:

$ sha256sum /tmp/bar*.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar1.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar2.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar3.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar4.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar5.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar6.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar7.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar8.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar9.txt.xz

Looking at a couple of differing result lines:

$ xzdiff /tmp/bar1.txt.xz /tmp/bar2.txt.xz | egrep "example.*example-786bb57f8c-lczss"
< {"metric":{"container":"example","instance":"10.33.169.25:10250","namespace":"example","node":"k8s-node-example.example.com","pod":"example-786bb57f8c-lczss"},"values":[[1733090400,"1.221"],[1733090460,"1.067"],[1733090520,"1.165"],[1733090580,"2.093"],[1733090640,"1.086"],[1733090700,"1.260"],[1733090760,"1.424"],[1733090820,"1.085"],[1733090880,"0.879"],[1733090940,"1.464"],[1733091000,"1.255"],[1733091060,"0.930"],[1733091120,"0.800"],[1733091180,"0.983"],[1733091240,"1.806"],[1733091300,"1.226"],[1733091360,"1.212"],[1733091420,"0.936"],[1733091480,"0.938"],[1733091540,"1.151"],[1733091600,"0.913"],[1733091660,"0.870"],[1733091720,"9.322"],[1733091780,"0.929"],[1733091840,"2.868"],[1733091900,"1.054"],[1733091960,"1.132"],[1733092020,"15.254"],[1733092080,"2.275"],[1733092140,"0.965"],[1733092200,"1.108"],[1733092260,"1.183"],[1733092320,"13.217"],[1733092380,"3.062"],[1733092440,"0.378"],[1733092500,"1.163"],[1733092560,"0.846"],[1733092620,"3.220"],[1733092680,"0.946"],[1733092740,"0.864"],[1733092800,"1.084"],[1733092860,"1.052"],[1733092920,"1.037"],[1733092980,"0.900"],[1733093040,"2.639"],[1733093100,"0.637"],[1733093160,"1.016"],[1733093220,"6.778"],[1733093280,"0.788"],[1733093340,"1.094"],[1733093400,"0.567"],[1733093460,"0.831"],[1733093520,"0.808"],[1733093580,"0.898"],[1733093640,"1.462"],[1733093700,"1.283"],[1733093760,"1.075"],[1733093820,"0.812"],[1733093880,"1.049"],[1733093940,"0.680"],[1733094000,"2.660"]]},
> {"metric":{"container":"example","instance":"10.33.169.25:10250","namespace":"example","node":"k8s-node-example.example.com","pod":"example-786bb57f8c-lczss"},"values":[[1733090400,"1.038"],[1733090460,"0.916"],[1733090520,"10.579"],[1733090580,"2.163"],[1733090640,"0.686"],[1733090700,"1.260"],[1733090760,"1.424"],[1733090820,"12.166"],[1733090880,"0.642"],[1733090940,"1.464"],[1733091000,"1.149"],[1733091060,"1.114"],[1733091120,"0.538"],[1733091180,"0.964"],[1733091240,"1.943"],[1733091300,"1.064"],[1733091360,"1.102"],[1733091420,"11.507"],[1733091480,"1.398"],[1733091540,"1.078"],[1733091600,"0.706"],[1733091660,"0.870"],[1733091720,"13.516"],[1733091780,"0.532"],[1733091840,"2.257"],[1733091900,"0.868"],[1733091960,"1.046"],[1733092020,"15.254"],[1733092080,"2.275"],[1733092140,"1.014"],[1733092200,"1.161"],[1733092260,"1.183"],[1733092320,"13.217"],[1733092380,"2.239"],[1733092440,"1.235"],[1733092500,"1.023"],[1733092560,"1.129"],[1733092620,"13.110"],[1733092680,"0.916"],[1733092740,"1.531"],[1733092800,"1.020"],[1733092860,"0.856"],[1733092920,"0.615"],[1733092980,"0.925"],[1733093040,"2.345"],[1733093100,"1.097"],[1733093160,"1.045"],[1733093220,"14.319"],[1733093280,"2.728"],[1733093340,"1.094"],[1733093400,"0.911"],[1733093460,"0.587"],[1733093520,"12.624"],[1733093580,"0.936"],[1733093640,"1.462"],[1733093700,"1.283"],[1733093760,"1.075"],[1733093820,"11.700"],[1733093880,"1.018"],[1733093940,"0.947"],[1733094000,"2.336"]]},

Example of CPU differences during the same 1 minute:

[1733090520,"1.165"]
[1733090520,"10.579"]

Question 1: Why are the results not consistent for each promql command?
Question 2: If the results are intentionally different due to some kind of business logic, how (in)accurate are they?

Thanks in advance!

@jacksontj
Copy link
Owner

Question 1: Why are the results not consistent for each promql command?

This can be for a variety of reasons -- but at a high-level this generally means the downstreams don't agree on the data. For example; if you have 2 servers that have slightly different data the merge logic in promxy doesn't guarantee that it will return the "same" result each time as the merge logic is based on a first-responder basis by default. There is an option to configure it to prefer max but that has some tradeoffs itself.

Question 2: If the results are intentionally different due to some kind of business logic, how (in)accurate are they?

In general if the downstream data is consistent promxy should return the same result. I'd suggest investigating the trace log output to see what the downstream data is and if there is a reason for it varying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants