-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support external queue system for exporter via extensions #31682
Comments
Pinging code owners for extension/storage: @dmitryax @atoulme @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
So to be clear, the ask is support for "backends" for the sending queue/buffer, for all exporters? Overall I agree with the motivation. I've seen use cases in the past where a network outage/partition has occurred and one or more exporters overflowed as a result. That data could be useful for post-incident analysis or backfill. That said, we'd likely want to add the ability to lower bound the stale-ness (oldest watermark/epoch). Some vendor backends, whether OTLP "compliant" or via bespoke exporters, don't allow ingestion of telemetry with an observed (metric) timestamp earlier than a (vendor-specific) relative epoch. While we could theoretically grow an alternative "backend" for the exporter queue without bound, in practice that would require customers to either manually configure their distributed queue for autoscaling, or for us to hook into op-amp. I'm also concerned with correlated failures -- if there's a network issue exporting to FooVendor, there's a chance that network issue may also extend to the distributed queue backing the exporter (even if on the same node). We could consider implementing a "dead letter queue" configuration instead. Given some configuration, we could even re-use existing exporters and route metrics matching the DLQ configuration to them. A (poorly thought) example follows:
The disadvantage of a DLQ is that you don't introduce the durability you'd get from your "sending_queue" example prior to the exporter. On the other hand, a non-local sending queue is (imo) conceptually similar to an exporter to begin with. Taking kafka as a sending_queue backend as an example, you could break up your pipeline into two pipelines, the first exporting to kafka, the second consuming from the same kafka stream and exporting to otlp. Then again, the question of backpressure becomes a bit murky in such a scenerio, as (off the top of my head) I don't believe the collector durably deqeues reads from a receiver if and only if an exporter has accepted the data... Then again, this would also be an implementation concern for a backing "sending_queue". Some sort of n-phase commit would need to be spec'd out for this use case. Regardless of the implementation, we should (continue to) come up with a list of design considerations. I'd love to have a working session or collaborate on a google doc etc with you to flesh this out a bit before bringing it to a collector SIG. |
I may be mistaken but isn't the sending queue designed to work with any storage extension? If so, all that is needed is another storage extension. Then, rather than configure redis as part of the exporter, you configure it as an extension and reference it in the exporter. Adapting the example: extensions:
redisstorage:
address: "localhost:6379"
backlog_check_interval: 30
process_key_expiration: 120
scan_key_size: 10
exporters:
logging:
otlp:
endpoint: "localhost:4217"
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 5s
max_elapsed_time: 20s
sending_queue:
enabled: true
storage: redisstorage # refers to component name of extension configured above
requeue_enabled: true
queue_backend: redis
num_consumers: 10
queue_size: 10000
tls:
insecure: true |
So @pepperkick could the ask "implement a redis storage extension for sending queues" satisfy your needs? |
Yes, implementation via extension is the approach I would like to take as that will help creating additional backends later down the line. For Redis I have created the following PR which is implemented via extension Currently I am evaluating using Pulsar for this due to the recent license situation of Redis. |
I will sponsor this new component. |
Component(s)
No response
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
I want the ability to utilize Redis as the exporter queue to support upstream outages for longer period. The current two options of exporter do not fulfill the requirements or have issues with longer outages.
This usecase came up because I needed near 0% data loss on logs over long periods of time because the upstream is known to be down for many mins. The in-memory queue was ruled out because it would lose data on pod restarts. The persistent queue option was promising but it started refusing new data if the queue is full and based on the code it is difficult to disable the refusal which causes loss of new data once the queue is full.
So since I had access to a Redis cluster, I decided to modify the exporter to support Redis as the queue. I was able to test this and get stable data flow, however I was not able to do proper and long testing yet.
Can this feature be implemented via extensions so in future other queue systems could be added as well? I believe the existing queue systems will need to moved to extensions in that case.
Describe alternatives you've considered
Alternatives considered
Additional context
Config example
While queue_size is mentioned 10000, it is actually unlimited because redis can scale up based on usage.
The text was updated successfully, but these errors were encountered: