Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow: use Prometheus directly instead of abstraction #2037

Closed
rfratto opened this issue Aug 14, 2022 · 3 comments · Fixed by #2431
Closed

Flow: use Prometheus directly instead of abstraction #2037

rfratto opened this issue Aug 14, 2022 · 3 comments · Fixed by #2431
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. proposal Proposal or RFC
Milestone

Comments

@rfratto
Copy link
Member

rfratto commented Aug 14, 2022

Our Prometheus components (prometheus.scrape, prometheus.mutate, prometheus.remote_write) should behave exactly as Prometheus does.

Currently, Flow uses an abstraction for its metrics pipeline:

This abstraction is insufficient in its current state:

  • It does yet not support writing exemplars
  • It will not immediately support the upcoming work to write metadata to the WAL
  • It will not immediately support the upcoming sparse histogram work

It also has behavioral differences:

The list of behavioral differences may or may not be complete: abstractions introduce uncertainty, where it is harder to know if there is any other documented or undocumented behaviors of the underlying code which the abstraction does not satisfy.

For all of the issues above to be resolved, the abstraction would have to be changed to become increasingly Prometheus-like. Instead of maintaining an abstraction over time to match the behavior of a moving target, it would be easier to remove the abstraction and use Prometheus code directly.

As we've seen with operators and Helm charts, good abstractions are hard to make, and should only be attempted after accumulating a ton of experience with all the things it is abstracting over. I would recommend we write components to use their native APIs instead of trying to introduce abstractions. This means that Prometheus components should use Prometheus code, and OpenTelemetry Collector components should use OpenTelemetry Collector code.

NOTE: I don't mean to say that we can't wrap around Prometheus code to reduce boilerplate. Rather, we should not create an intermediate representation for metrics through a pipeline which has to be translated to/from native APIs.

If components use their native APIs, then a shim is needed to move between APIs. This has similar problems to an abstraction, but it reduces the surface area of where bugs can be introduced, allowing the native code path to always work fully as intended. How these shims would be exposed to users is out of scope of this proposal.

@mattdurham
Copy link
Collaborator

I am more concerned with the behavior of the datatypes than the actual implementation. I agree that I haven't pointed out explicitly where the current behavior deviates from the prometheus scraping path. I believe as we do more components and functionality, the pipeline behavior will diverge. This divergence should be explicit instead of implicit though.

For instance, reusing the fanout appender is likely not a good appender to reuse. It has the concept of returning early and primary vs secondary. We would likely need to create our own that would run all the children and then return errors. I also believe the calling of children should in general be asynchronous deeply nested, or wide fan outs have a significant performance penalty.

I do think there is value in having abstractions to prevent mistakes and if we need to convert between formats. But this can be solved in other ways.

@mattdurham
Copy link
Collaborator

Discussion around this topic leads towards removing the FlowMetric concept and replacing it with stock prometheus, but transforming the metrics.receiver into an appendable/appender interface. Noting the non-obvious change of returning an error to the scrape components append/commit/rollback only if all descendants return an error.

The reason for this is if you have two remote_writes and one is succeeding, and one is failing, you want the metrics still flowing to the good remote_write. The current scrape would break the scrape loop on the first error.

@mattdurham
Copy link
Collaborator

I started implementing this today, and I encountered one of the reasons why we went with the payload approach. Each component gets more complex because it has to handle multiple concurrent appenders. One alternative is components are a shim over the structs that are appenders themselves. Ie mutate.metrics's only job is to spawn a sub-mutate appenders and each sub-mutate does the actual work.

Each component also needs to implement storage.appendable interface also. I will continue the change and once done we can evaluate the differences.

@rfratto rfratto added area/signals and removed flow labels Oct 3, 2022
@rfratto rfratto added this to the v0.30.0 milestone Oct 4, 2022
@rfratto rfratto modified the milestones: v0.30.0, v0.29.0 Nov 3, 2022
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 22, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. proposal Proposal or RFC
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants