-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributing Load behind ALB/NLBs #33453
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I'm afraid this is more about Prometheus than load balancing itself, but why are you getting those out-of-orders only when receiving data from multiple clusters? Are you having the same metric stream (metric name + attributes) coming from different clusters? Are you dropping the service.instance.id along the way? |
The data coming from each cluster has unique stream labels attached - so one cluster can not really impact another cluster in terms of out-of-order samples. I believe the out-of-order issue has to do with timing .. For a single given stream, imagine this scenario: Given a single stream:
This is a contrived example ... but the point I am trying to demonstrate is that the timing of when the samples are flushed out to Prometheus matters. If two different OTEL collector pods end up with samples for the same stream, but they flush out of order, then you create the out-of-order error situation. We tried using IP-based session stickiness ... but that didn't really work at all, and is problematic for a lot of reasons. Session stickiness based on some cookie would be useful, if the OTEL client supported it.
No - definitely unique streams from different clusters
No we are not |
I got it. Thank you for the example; I understand the scenario better now. I'm even more convinced that the load balancing exporter is doing the right thing and that this should be handled at the Prometheus side of this story. In fact, this scenario should work on any Prometheus-compatible backend that supports out-of-order samples, like Mimir. They'd hold data in memory for a period of time and order them based on the timestamp before writing to disk. I think it shouldn't be the responsibility of the load-balancer to track the timestamps for each data point in case the backend is sending data and can't ingest out-of-order samples. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Sorry for just getting to this now. I agree with @jpkrohling's comments above. I'm not sure there is anything we can do in the exporter to make this better. |
Component(s)
exporter/loadbalancing, exporter/prometheusremotewrite
Describe the issue you're reporting
TLDR: How do people accept ordered-writes of OTLP DataPoints behind L4/L7 Load Balancers?
We're trying to collect our metric data in both a global and local environment - the local environment can write directly to Prometheus and thus leverage the
prometheusremotewritereceiver
with theloadbalancing/oltpexporter
exporter and arouting_key
settings to send data in an ordered fashion.However .. how do we also collect data globally when we have to run the
otel-collector
behind a third party L4 or L7 load balancer, and maintain ordered writes?Full Situation
Today we pass ~200+k/sec
DataPoint
records from ourotel-collector-agent
(DaemonSet) pods by pushing them tootel-metrics-processor
(StatefulSet) pods in each of our clusters. Themetrics-processor
pods perform filtering and then use the prometheusremotewritexporter to write the data to an AWS Managed Prometheus (AMP) endpoint "in cluster" (tied to the cluster), as well as a global secondary AMP endpoint that is in another AWS account and region.Our
otel-collector-agents
flow like this:Once the data hits our
metrics-processors
, the data is duplicated to the in-cluster and remote-cluster AMP endpoints:This system works as is right now... see below though for the change we want to make and why it's not working.
Working:
otel-collector-agents -> Loki (native lokiexporter)
Even at high volume (~500k
LogRecords/s
), we can send the Loki data just fine because Loki accepts out of order writes ... so going through an NLB or an ALB to get to the Loki service works just fine. (We are currently using thelokiexporter
due to grafana/loki#13185, an unrelated issue with Loki).Working:
otel-collector-agents -> loadbalancing/oltpexporter -> metrics-processor -> prometheusremotewrite
When we use the current setup - where the
metrics-processor
pods do all of thePUSH /api/v1/remote_write
calls - things work OK because the load balancing happens in each cluster and we can use a consistent routing key so that eachmetrics-processor
is processing consistent resources.The problem is that we want to move the processing to our central clusters... see below.
Not Working:
otel-metrics-processors -- OTLP --> otlpexporter (grpc) --> prometheusremotewritexporter -> AMP
We would like to standardize on using the
otel-collector
across wide swaths of our infrastructure (not just Kubernetes) to collect metric and log data.. so naturally our first thought was that we could stop direct-writing from themetrics-processor
pods to the remote prometheus endpoint, and introduce a "central otel colection" endpoint. This would give us a clean place to do metric filtering, validation and even throttling/batch-tuning.What happens when we do this though is that there is no real good way to distribute the load across our
central-metrics-processor
pods.. first we tried using Sticky sessions, but on an NLB that can only be done via the source-IP and with global NATing, that lead to extreme hot-spots. We tried without sticky sessions, and due to the requirement of ordered-writes, we saw a significant percentage of metrics dropped by AMP.Is anyone doing this?
The question here is .. is anyone doing this at scale and with an ordered-write endpoint like Prometheus? Do we have to go as far as having two layers of OTLP collectors in our central cluster - a stateless distributed one that then collects the metrics and then handles re-routing of the datapoints with a specific routing_key to an internal
statefulset
that then writes to Prometheus?Even with stickiness disabled - we still had hotspots using GRPC.. I couldn't find a good way to tell the
otel-collector
to create XX number of outbound GRPC pools, instead it seems like it creates one connection and just streams the data.. so that still lead to hot spots, AND out of order writes.What are we missing?
The text was updated successfully, but these errors were encountered: