Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support external queue system for exporter via extensions #31682

Closed
pepperkick opened this issue Mar 11, 2024 · 6 comments · Fixed by #33224
Closed

Support external queue system for exporter via extensions #31682

pepperkick opened this issue Mar 11, 2024 · 6 comments · Fixed by #33224
Labels
Accepted Component New component has been sponsored enhancement New feature or request extension/storage

Comments

@pepperkick
Copy link
Contributor

pepperkick commented Mar 11, 2024

Component(s)

No response

Is your feature request related to a problem? Please describe.

No

Describe the solution you'd like

I want the ability to utilize Redis as the exporter queue to support upstream outages for longer period. The current two options of exporter do not fulfill the requirements or have issues with longer outages.

This usecase came up because I needed near 0% data loss on logs over long periods of time because the upstream is known to be down for many mins. The in-memory queue was ruled out because it would lose data on pod restarts. The persistent queue option was promising but it started refusing new data if the queue is full and based on the code it is difficult to disable the refusal which causes loss of new data once the queue is full.

So since I had access to a Redis cluster, I decided to modify the exporter to support Redis as the queue. I was able to test this and get stable data flow, however I was not able to do proper and long testing yet.

Can this feature be implemented via extensions so in future other queue systems could be added as well? I believe the existing queue systems will need to moved to extensions in that case.

Describe alternatives you've considered

Alternatives considered

  • In-memory Queue: As mentioned, data is lost if the pod restarts due to crashes.
  • Persistent Queue: Starts refusing new data when the queue is full.
  • Export to Pulsar, receive from Pulsar: This does work, but the 2nd collector still needs one of the above queue during exporting, ending up in same situation.

Additional context

Config example

exporters:
  logging:
  otlp:
    endpoint: "localhost:4217"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 5s
      max_elapsed_time: 20s
    sending_queue:
      enabled: true
      requeue_enabled: true
      queue_backend: redis
      num_consumers: 10
      queue_size: 10000
      redis:
        address: "localhost:6379"
        backlog_check_interval: 30
        process_key_expiration: 120
        scan_key_size: 10
    tls:
      insecure: true

While queue_size is mentioned 10000, it is actually unlimited because redis can scale up based on usage.

@pepperkick pepperkick added enhancement New feature or request needs triage New item requiring triage labels Mar 11, 2024
@atoulme atoulme added extension/storage and removed needs triage New item requiring triage labels Mar 12, 2024
Copy link
Contributor

Pinging code owners for extension/storage: @dmitryax @atoulme @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@hughesjj
Copy link
Contributor

So to be clear, the ask is support for "backends" for the sending queue/buffer, for all exporters?

Overall I agree with the motivation. I've seen use cases in the past where a network outage/partition has occurred and one or more exporters overflowed as a result. That data could be useful for post-incident analysis or backfill.

That said, we'd likely want to add the ability to lower bound the stale-ness (oldest watermark/epoch). Some vendor backends, whether OTLP "compliant" or via bespoke exporters, don't allow ingestion of telemetry with an observed (metric) timestamp earlier than a (vendor-specific) relative epoch.

While we could theoretically grow an alternative "backend" for the exporter queue without bound, in practice that would require customers to either manually configure their distributed queue for autoscaling, or for us to hook into op-amp. I'm also concerned with correlated failures -- if there's a network issue exporting to FooVendor, there's a chance that network issue may also extend to the distributed queue backing the exporter (even if on the same node).

We could consider implementing a "dead letter queue" configuration instead. Given some configuration, we could even re-use existing exporters and route metrics matching the DLQ configuration to them. A (poorly thought) example follows:

exporters:
   otlp/experiencing_networkoutage:
      # Happy path, write data to this exporter normally
      dlq:
         max_latency: 15m # Anything older than 15m old goes to dlq
         min_latency: -5m # you can even reject stuff in the future etc
         max_buffer_size: 100000 # start dropping newer data
         FIFO: false # ordering is an important design consideration regardless of impl
         exporters: # if empty then just drop
          - otlp/dlq1
   otlp/dlq1:
      # try writing to redis or kafka or something else if the main sink is "out of order"
      dlq:
         max_latency: 1h # Anything older than 15m old goes to dlq
         FIFO: false # ordering is an important design consideration regardless of impl
         exporters:
           - fileexporter/dlq2
   fileexporter/dlq2:
      # If all else fails, try writing to a local file

The disadvantage of a DLQ is that you don't introduce the durability you'd get from your "sending_queue" example prior to the exporter. On the other hand, a non-local sending queue is (imo) conceptually similar to an exporter to begin with. Taking kafka as a sending_queue backend as an example, you could break up your pipeline into two pipelines, the first exporting to kafka, the second consuming from the same kafka stream and exporting to otlp. Then again, the question of backpressure becomes a bit murky in such a scenerio, as (off the top of my head) I don't believe the collector durably deqeues reads from a receiver if and only if an exporter has accepted the data... Then again, this would also be an implementation concern for a backing "sending_queue". Some sort of n-phase commit would need to be spec'd out for this use case.

Regardless of the implementation, we should (continue to) come up with a list of design considerations. I'd love to have a working session or collaborate on a google doc etc with you to flesh this out a bit before bringing it to a collector SIG.

@djaglowski
Copy link
Member

I may be mistaken but isn't the sending queue designed to work with any storage extension? If so, all that is needed is another storage extension. Then, rather than configure redis as part of the exporter, you configure it as an extension and reference it in the exporter. Adapting the example:

extensions:
  redisstorage:
    address: "localhost:6379"
    backlog_check_interval: 30
    process_key_expiration: 120
    scan_key_size: 10

exporters:
  logging:
  otlp:
    endpoint: "localhost:4217"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 5s
      max_elapsed_time: 20s
    sending_queue:
      enabled: true
      storage: redisstorage # refers to component name of extension configured above
      requeue_enabled: true
      queue_backend: redis
      num_consumers: 10
      queue_size: 10000
    tls:
      insecure: true

@hughesjj
Copy link
Contributor

hughesjj commented Apr 1, 2024

@djaglowski I believe so, yes

So @pepperkick could the ask "implement a redis storage extension for sending queues" satisfy your needs?

@pepperkick
Copy link
Contributor Author

Yes, implementation via extension is the approach I would like to take as that will help creating additional backends later down the line.

For Redis I have created the following PR which is implemented via extension
#31731

Currently I am evaluating using Pulsar for this due to the recent license situation of Redis.

@atoulme
Copy link
Contributor

atoulme commented May 24, 2024

I will sponsor this new component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Component New component has been sponsored enhancement New feature or request extension/storage
Projects
None yet
4 participants