-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flow: use Prometheus directly instead of abstraction #2037
Comments
I am more concerned with the behavior of the datatypes than the actual implementation. I agree that I haven't pointed out explicitly where the current behavior deviates from the prometheus scraping path. I believe as we do more components and functionality, the pipeline behavior will diverge. This divergence should be explicit instead of implicit though. For instance, reusing the fanout appender is likely not a good appender to reuse. It has the concept of returning early and primary vs secondary. We would likely need to create our own that would run all the children and then return errors. I also believe the calling of children should in general be asynchronous deeply nested, or wide fan outs have a significant performance penalty. I do think there is value in having abstractions to prevent mistakes and if we need to convert between formats. But this can be solved in other ways. |
Discussion around this topic leads towards removing the FlowMetric concept and replacing it with stock prometheus, but transforming the metrics.receiver into an appendable/appender interface. Noting the non-obvious change of returning an error to the scrape components append/commit/rollback only if all descendants return an error. The reason for this is if you have two remote_writes and one is succeeding, and one is failing, you want the metrics still flowing to the good remote_write. The current scrape would break the scrape loop on the first error. |
I started implementing this today, and I encountered one of the reasons why we went with the payload approach. Each component gets more complex because it has to handle multiple concurrent appenders. One alternative is components are a shim over the structs that are appenders themselves. Ie mutate.metrics's only job is to spawn a sub-mutate appenders and each sub-mutate does the actual work. Each component also needs to implement storage.appendable interface also. I will continue the change and once done we can evaluate the differences. |
Our Prometheus components (
prometheus.scrape
,prometheus.mutate
,prometheus.remote_write
) should behave exactly as Prometheus does.Currently, Flow uses an abstraction for its metrics pipeline:
prometheus.Receiver
receives samples at a given timestampprometheus.FlowMetric
represents an individual sampleThis abstraction is insufficient in its current state:
It also has behavioral differences:
up != 1
alerts to no longer fire.The list of behavioral differences may or may not be complete: abstractions introduce uncertainty, where it is harder to know if there is any other documented or undocumented behaviors of the underlying code which the abstraction does not satisfy.
For all of the issues above to be resolved, the abstraction would have to be changed to become increasingly Prometheus-like. Instead of maintaining an abstraction over time to match the behavior of a moving target, it would be easier to remove the abstraction and use Prometheus code directly.
As we've seen with operators and Helm charts, good abstractions are hard to make, and should only be attempted after accumulating a ton of experience with all the things it is abstracting over. I would recommend we write components to use their native APIs instead of trying to introduce abstractions. This means that Prometheus components should use Prometheus code, and OpenTelemetry Collector components should use OpenTelemetry Collector code.
If components use their native APIs, then a shim is needed to move between APIs. This has similar problems to an abstraction, but it reduces the surface area of where bugs can be introduced, allowing the native code path to always work fully as intended. How these shims would be exposed to users is out of scope of this proposal.
The text was updated successfully, but these errors were encountered: