Inconsistent data when grabbing the same historic metric multiple times #701

simonhfot · 2024-12-05T02:15:28Z

Hello!

I'm grabbing the same CPU metrics for the same historical period 10 times:

$ promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar1.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar2.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar3.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar4.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar5.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar6.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar7.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar8.txt.xz &
  promql --timeout 300 --no-headers --host 'https://prometheus.example..com' --start '2024-12-01T22:00:00Z' --end '2024-12-01T23:00:00Z' --step '1m' --output json 'sort_desc(sum(       increase(container_cpu_usage_seconds_total{container!=""}[1m])) by(node, instance, namespace, pod, container))'             | perl -lane 's~\.(\d\d\d)\d+~.$1~gs    ; s~(\{"metric")~\n$1~gs; print;' | xz -c > /tmp/bar9.txt.xz &

However, I get two sets of results:

$ sha256sum /tmp/bar*.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar1.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar2.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar3.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar4.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar5.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar6.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar7.txt.xz
19de8b31f135ae167069f8c745738380f163b73cb512c287c83d77642bc3dd97  /tmp/bar8.txt.xz
3b1b75f13790ba192ff2a75ac63dcdfdfc9d11f8f9542fbc473d3d12f1a28cc4  /tmp/bar9.txt.xz

Looking at a couple of differing result lines:

$ xzdiff /tmp/bar1.txt.xz /tmp/bar2.txt.xz | egrep "example.*example-786bb57f8c-lczss"
< {"metric":{"container":"example","instance":"10.33.169.25:10250","namespace":"example","node":"k8s-node-example.example.com","pod":"example-786bb57f8c-lczss"},"values":[[1733090400,"1.221"],[1733090460,"1.067"],[1733090520,"1.165"],[1733090580,"2.093"],[1733090640,"1.086"],[1733090700,"1.260"],[1733090760,"1.424"],[1733090820,"1.085"],[1733090880,"0.879"],[1733090940,"1.464"],[1733091000,"1.255"],[1733091060,"0.930"],[1733091120,"0.800"],[1733091180,"0.983"],[1733091240,"1.806"],[1733091300,"1.226"],[1733091360,"1.212"],[1733091420,"0.936"],[1733091480,"0.938"],[1733091540,"1.151"],[1733091600,"0.913"],[1733091660,"0.870"],[1733091720,"9.322"],[1733091780,"0.929"],[1733091840,"2.868"],[1733091900,"1.054"],[1733091960,"1.132"],[1733092020,"15.254"],[1733092080,"2.275"],[1733092140,"0.965"],[1733092200,"1.108"],[1733092260,"1.183"],[1733092320,"13.217"],[1733092380,"3.062"],[1733092440,"0.378"],[1733092500,"1.163"],[1733092560,"0.846"],[1733092620,"3.220"],[1733092680,"0.946"],[1733092740,"0.864"],[1733092800,"1.084"],[1733092860,"1.052"],[1733092920,"1.037"],[1733092980,"0.900"],[1733093040,"2.639"],[1733093100,"0.637"],[1733093160,"1.016"],[1733093220,"6.778"],[1733093280,"0.788"],[1733093340,"1.094"],[1733093400,"0.567"],[1733093460,"0.831"],[1733093520,"0.808"],[1733093580,"0.898"],[1733093640,"1.462"],[1733093700,"1.283"],[1733093760,"1.075"],[1733093820,"0.812"],[1733093880,"1.049"],[1733093940,"0.680"],[1733094000,"2.660"]]},
> {"metric":{"container":"example","instance":"10.33.169.25:10250","namespace":"example","node":"k8s-node-example.example.com","pod":"example-786bb57f8c-lczss"},"values":[[1733090400,"1.038"],[1733090460,"0.916"],[1733090520,"10.579"],[1733090580,"2.163"],[1733090640,"0.686"],[1733090700,"1.260"],[1733090760,"1.424"],[1733090820,"12.166"],[1733090880,"0.642"],[1733090940,"1.464"],[1733091000,"1.149"],[1733091060,"1.114"],[1733091120,"0.538"],[1733091180,"0.964"],[1733091240,"1.943"],[1733091300,"1.064"],[1733091360,"1.102"],[1733091420,"11.507"],[1733091480,"1.398"],[1733091540,"1.078"],[1733091600,"0.706"],[1733091660,"0.870"],[1733091720,"13.516"],[1733091780,"0.532"],[1733091840,"2.257"],[1733091900,"0.868"],[1733091960,"1.046"],[1733092020,"15.254"],[1733092080,"2.275"],[1733092140,"1.014"],[1733092200,"1.161"],[1733092260,"1.183"],[1733092320,"13.217"],[1733092380,"2.239"],[1733092440,"1.235"],[1733092500,"1.023"],[1733092560,"1.129"],[1733092620,"13.110"],[1733092680,"0.916"],[1733092740,"1.531"],[1733092800,"1.020"],[1733092860,"0.856"],[1733092920,"0.615"],[1733092980,"0.925"],[1733093040,"2.345"],[1733093100,"1.097"],[1733093160,"1.045"],[1733093220,"14.319"],[1733093280,"2.728"],[1733093340,"1.094"],[1733093400,"0.911"],[1733093460,"0.587"],[1733093520,"12.624"],[1733093580,"0.936"],[1733093640,"1.462"],[1733093700,"1.283"],[1733093760,"1.075"],[1733093820,"11.700"],[1733093880,"1.018"],[1733093940,"0.947"],[1733094000,"2.336"]]},

Example of CPU differences during the same 1 minute:

[1733090520,"1.165"]
[1733090520,"10.579"]

Question 1: Why are the results not consistent for each promql command?
Question 2: If the results are intentionally different due to some kind of business logic, how (in)accurate are they?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

jacksontj · 2024-12-15T06:10:23Z

Question 1: Why are the results not consistent for each promql command?

This can be for a variety of reasons -- but at a high-level this generally means the downstreams don't agree on the data. For example; if you have 2 servers that have slightly different data the merge logic in promxy doesn't guarantee that it will return the "same" result each time as the merge logic is based on a first-responder basis by default. There is an option to configure it to prefer max but that has some tradeoffs itself.

Question 2: If the results are intentionally different due to some kind of business logic, how (in)accurate are they?

In general if the downstream data is consistent promxy should return the same result. I'd suggest investigating the trace log output to see what the downstream data is and if there is a reason for it varying.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent data when grabbing the same historic metric multiple times #701

Inconsistent data when grabbing the same historic metric multiple times #701

simonhfot commented Dec 5, 2024 •

edited

Loading

jacksontj commented Dec 15, 2024

Inconsistent data when grabbing the same historic metric multiple times #701

Inconsistent data when grabbing the same historic metric multiple times #701

Comments

simonhfot commented Dec 5, 2024 • edited Loading

jacksontj commented Dec 15, 2024

simonhfot commented Dec 5, 2024 •

edited

Loading