Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus] "Error on ingesting out-of-order exemplars" message in logs #1795

Open
puckpuck opened this issue Nov 26, 2024 · 13 comments
Open
Labels
bug Something isn't working

Comments

@puckpuck
Copy link
Contributor

puckpuck commented Nov 26, 2024

Bug Report

The following error continues to show up in the Prometheus logs:

prometheus  | ts=2024-11-26T04:44:06.414Z caller=write_handler.go:279 level=warn component=web msg="Error on ingesting out-of-order exemplars" num_dropped=44

This error started happening after upgrading flagd to version 0.11.4 in this PR

@puckpuck puckpuck added the bug Something isn't working label Nov 26, 2024
@puckpuck
Copy link
Contributor Author

puckpuck commented Nov 26, 2024

@beeme1mr This issue in particular did not exists with prior versions of flagd

I did an incremental change to 0.11.3, and saw this error stopped happening, so I suspect this is something very specific to the 0.11.3 -> 0.11.4 upgrade. The release notes for flagd, are not super obvious on why this error would occur outside of a prometheus client upgrade.

@beeme1mr
Copy link
Contributor

beeme1mr commented Nov 26, 2024

We didn't make any telemetry-related changes in the last release but I'll look into it.

Are you sure this issue is caused by flagd? It's not clear from the attached logs.

Sorry, reread your comment.

@beeme1mr
Copy link
Contributor

I can't get the demo app running locally due to an unrelated error.

 ⠋ Container otel-col          Creating                                                                                                                                                                      0.1s
Error response from daemon: invalid mount config: must use either propagation mode "rslave" or "rshared" when mount source is within the daemon root, daemon root: "/var/lib/docker", bind mount source: "/", propagation: "rprivate"
make: *** [Makefile:138: start] Error 1

Another user reported this already, but the fix has been reverted.

@beeme1mr
Copy link
Contributor

Possibly related.

prometheus/prometheus#13933

@julianocosta89
Copy link
Member

I can't get the demo app running locally due to an unrelated error.

 ⠋ Container otel-col          Creating                                                                                                                                                                      0.1s
Error response from daemon: invalid mount config: must use either propagation mode "rslave" or "rshared" when mount source is within the daemon root, daemon root: "/var/lib/docker", bind mount source: "/", propagation: "rprivate"
make: *** [Makefile:138: start] Error 1

Another user reported this already, but the fix has been reverted.

@beeme1mr the latest release doesn't have that anymore.
Have you pulled the latest?

@beeme1mr
Copy link
Contributor

Yeah, I'm trying to run the latest version of the demo in Ubuntu on WSL.

@julianocosta89
Copy link
Member

Ah, I see. Re-reading your message it actually makes sense.
We have removed the rslave param because it is not required anymore in the latest docker version.
It seemed to be an issue that happened in one single version.

Multiple users reported that they were facing issues to run with the rslave param.

Could you check if updating docker solves for you?
If not, maybe you could edit the docker-compose.yaml file locally.

But ideally the demo would run in all setups.

@beeme1mr
Copy link
Contributor

I'm running the latest version available through apt-get. However, it's not the latest version according to the Docker release notes. I'll check again tomorrow.

@dyladan
Copy link
Member

dyladan commented Dec 18, 2024

Looks like flagd 0.11.4 contained an updated of flagd/core to 0.10.3 which itself contained a change from 1.28.0 to 1.30.0 of the opentelemetry-go monorepo. This definitely seems like a possible cause of the issue. Not sure what all is in that change but it doesn't look like it is flagd's fault, since they're just using basic APIs and not doing anything overly fancy. Indeed, they're not doing anything specific to exemplars at all.

@open-telemetry/go-maintainers is there any chance there is a known issue which may have caused this?

@dashpole
Copy link

Exemplars were enabled by default in 1.31.0, so that wouldn't have changed between 1.28.0 and 1.30.0. But if it was 1.31, that would possibly explain it.

From https://github.com/prometheus/prometheus/blob/4a6f8704efcabfe9ee0f74eab58d4c11579547be/tsdb/exemplar.go#L257:

Since during the scrape the exemplars are sorted first by timestamp, then value, then labels,
if any of these conditions are true, we know that the exemplar is either a duplicate
of a previous one (but not the most recent one as that is checked above) or out of order.

So sounds like this could be out of order or a duplicate issue.

Are we sending OTLP to Prometheus? Or are we exporting prometheus or PRW from the collector?

@puckpuck
Copy link
Contributor Author

Are we sending OTLP to Prometheus? Or are we exporting prometheus or PRW from the collector?

we are sending OTLP to Prometheus

@dashpole
Copy link

Got it. So it is probably an issue with the implementation of exemplar translation in the OTLP receiver of the prometheus server. The exemplar validation code probably assumes things about exemplars that aren't correct for OTel exemplars.

@dashpole
Copy link

@puckpuck if you can add details of our setup to prometheus/prometheus#13933, that would be helpful. Some hypotheses to check:

  • Is the issue triggered by counters with multiple exemplars (possibly with out-of-order timestamps?)
  • Is the issue triggered by histograms with exemplars that don't align with histogram bucket boundaries?
  • Is the issue triggered by exponential histograms with exemplars that aren't in timestamped order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants