Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distirbution of traces/span amongst collector #1678

Closed
prana24 opened this issue Jul 23, 2019 · 47 comments · Fixed by #3422
Closed

distirbution of traces/span amongst collector #1678

prana24 opened this issue Jul 23, 2019 · 47 comments · Fixed by #3422
Assignees

Comments

@prana24
Copy link

prana24 commented Jul 23, 2019

Requirement - what kind of business use case are you trying to solve?

Are collector load balanced ?

Problem - what in Jaeger blocks you from solving the requirement?

We have our jaegertracing setup working with back end configured as elastic search. Currently we have two collector replica set up . There are 5-10 services which sends traces to the collector ( the number of services , keep changing ) . I see collectors are not evenly loaded with traffic. One collector reaches to the max queue usage where as other collector is hardly using 20-30% capacity . This causes the drop from the collector which is loaded to the capacity .
Can we load balance the traffic (spans) amongst the both collector ? I am not sure if there is any config and i am missing it.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

@jpkrohling
Copy link
Contributor

If you are using the Jaeger Agent, you can configure them to use gRPC instead of Thrift (--reporter.type=grpc). You can then either pass a static list of collectors, or use gRPC's notation for discovering the servers (--reporter.grpc.host-port=dns:///service-name:14250).

If your tracers are connecting directly to the collector, only TChannel is supported at the moment, and it's not possible to load balance individual requests.

@prana24
Copy link
Author

prana24 commented Jul 23, 2019

we are using agent , will check the configuration . As of now we are managing the jaeger set up and there are different team they are just bombarding the traces and spans. Is there anyway we can do at collector level ?

@yurishkuro
Copy link
Member

@prana24 at Uber we recommend all team to use an internal wrapper for Jaeger client libraries, which makes sure that production services are always using remote sampler that pulls sampling strategies from the backend. This way you can create configuration for the collectors controlling how much each service should sample.

If you have no control over the clients, the brute-force solution is to implement downsampling in the collector (which we do at Uber, but at this point as more of a safety measure). Downsampling is consistently based on trace ID hash, so you don't get partial traces, but downsampling affects all users equally, not just the offending service.

Another approach is throttling clients doing sampling, but it's not currently implemented (#1676).

The best solution imo is tail-based sampling, which Jaeger does not support yet directly, but you can get it with OpenCensus Service.

@prana24
Copy link
Author

prana24 commented Aug 2, 2019

We were using jaeger-agent 1.8.x , i see grpc was probably not enabled in that version . I am upgrading agent to latest ( 1.13.x ) . My collector is still 1.9.x , is this version ok , or i should upgrade that as well ?

@jpkrohling
Copy link
Contributor

If you can, keep both the collector and the agent at the same version.

@prana24
Copy link
Author

prana24 commented Aug 2, 2019

Thank you @jpkrohling , i have done that , i have a basic question about dns:///<service_name>:14250 , what is <service_name> , here it is the same name which we get by command kubectl get service for collector service ?

@jpkrohling
Copy link
Contributor

It's the DNS name under which the service can be reached. In Kubernetes, this is typically service_name.namespace.svc.cluster.local, but depending on the cluster configuration, you might be able to use only the service name as the hostname, if both the client and the the agent/collector are in the same namespace.

If you are using Kubernetes, I recommend taking a look at the jaeger-operator. Even if you decide not to use it for production, you might benefit from seeing how it deploys Jaeger.

@prana24
Copy link
Author

prana24 commented Aug 5, 2019

sure, thank you . I am taking a lookg

@prana24
Copy link
Author

prana24 commented Aug 6, 2019

Hi ,
I have made changes to my agent .yaml , somehow it still sends traffic to one of the collector only . it looks like it is not able to dns look up , adding my agent.yaml and agent log here for reference.

019/08/06 11:19:50 maxprocs: Leaving GOMAXPROCS=24: CPU quota undefined
{"level":"info","ts":1565090391.0357707,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1565090391.0362382,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1565090391.0362995,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14271}
{"level":"info","ts":1565090391.0363176,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14271,"health-status":"unavailable"}
{"level":"info","ts":1565090391.0373657,"caller":"grpc/builder.go:75","msg":"Agent requested insecure grpc connection to collector(s)"}
{"level":"info","ts":1565090391.041124,"caller":"grpc/clientconn.go:242","msg":"parsed scheme: \"dns\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.067472,"caller":"agent/main.go:74","msg":"Starting agent"}
{"level":"info","ts":1565090391.0675416,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1565090391.0675752,"caller":"app/agent.go:68","msg":"Starting jaeger-agent HTTP server","http-port":5778}
{"level":"info","ts":1565090391.0754118,"caller":"dns/dns_resolver.go:264","msg":"grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.jaeger-collector-dev.sampling.svc.cluster.local on 192.168.0.3:53: no such host.\n","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.119492,"caller":"dns/dns_resolver.go:289","msg":"grpc: failed dns TXT record lookup due to lookup _grpc_config.jaeger-collector-dev.sampling.svc.cluster.local on 192.168.0.3:53: no such host.\n","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1195319,"caller":"grpc/resolver_conn_wrapper.go:140","msg":"ccResolverWrapper: got new service config: ","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.119683,"caller":"grpc/resolver_conn_wrapper.go:126","msg":"ccResolverWrapper: sending new addresses to cc: [{192.168.172.54:14250 0  <nil>}]","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1197627,"caller":"base/balancer.go:76","msg":"base.baseBalancer: got new resolver state: {[{192.168.172.54:14250 0  <nil>}] }","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1197968,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241517,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241896,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[{192.168.172.54:14250 0  <nil>}:0xc00018d560]","system":"grpc","grpc_log":true}

Also pasted here agent.yaml

# Source: jaeger-client-mon/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: jaeger-app-1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: jaeger-app-1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: jaeger-app-1
    spec:
      containers:
      - image: docker.artifactory.prod.adnxs.net/jaeger_client_1
        name: jaeger-app-1
        ports:
        - containerPort: 8080
      - image: docker.artifactory.prod.mycompany.net/jaegertracing/jaeger-agent:1.13.1-1-b8a6d4ea680063ab03575e864f233841cfcb45cb58a9c5ddde2e287844c1b679
        name: jaeger-agent-1
        #args: ["--collector.host-port=jaeger-collector-dev.sampling.svc:14267"]
        args: ["--reporter.grpc.host-port=dns:///jaeger-collector-dev.sampling.svc.cluster.local:14250"]
        ports:
        - containerPort: 5775
          protocol: UDP
        - containerPort: 6831
          protocol: UDP
        - containerPort: 6832
          protocol: UDP
        - containerPort: 5778
          protocol: TCP

Any idea what is wrong here ?

@jpkrohling
Copy link
Contributor

Nothing seems wrong there: gRPC tried to load some extra configuration via DNS but couldn't find anything "extra". As you can see in the following log entries, the connection with the collector was established and is ready:

{"level":"info","ts":1565090391.1197968,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241517,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, READY","system":"grpc","grpc_log":true}

So, looks like it's working ;-)

@prana24
Copy link
Author

prana24 commented Aug 6, 2019

It is working but , i was expecting agent should send traces/span to both the collector , which currently sending to only one . i mean to say it is not load balanced , Am i missing something here ?

@jpkrohling
Copy link
Contributor

You might not see round-robin load balancing, as gRPC will reuse the same pipe for multiple requests, but one easy way to check that it's working as expected is by killing one of the collectors. If the agent switches over to the remaining collector, the load balancing is working.

@prana24
Copy link
Author

prana24 commented Aug 6, 2019

Oops !! that is failover right , that is not loadbalanced ? I want to avoid sitaution like this , i have added grafana images here , where collector1 reaches the max capacity and collector2 is sitting idle , because of this we see span drops. ( of course the implementation contains tchannel communication between agent and collector ) so as advised in this issue above i am adding grpc but somehow still i do not see spans are being load balanced between both the collector .
collector_load
collector_span_drop

Let me know if i am doing anything wrong here ?

@jpkrohling
Copy link
Contributor

I just checked the gRPC docs, and it seems that it should indeed be doing round-robin balancing:

It is worth noting that load-balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we still want them to be load-balanced across all servers.

Source: https://github.com/grpc/grpc/blob/master/doc/load-balancing.md

of course the implementation contains tchannel communication between agent and collector

What do you mean here? The communication between Agent and Collector should be via gRPC, not via TChannel.

@jpkrohling
Copy link
Contributor

@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here?

@prana24
Copy link
Author

prana24 commented Aug 6, 2019

Just to clear confusion , the grafana image which i have posted is the production problem which i want to solve( agent , collector running on 1.8 .x with tchannel) .
Since it was recommended that if we use grpc with latest version ( 1.13.1 ) we can see traces/span loadbalanced . I am trying to the same in our dev environment to see if the traffic is really load balanced. But somehow all the spans are being moved to one collector.
I am just concerned about how can i get my traffic loadbalanced , hence all the collector does the work and there is minimum drops

@jpkrohling
Copy link
Contributor

I'm confused now: you are seeing load balanced traffic in production, but not on your dev environment?

@prana24
Copy link
Author

prana24 commented Aug 6, 2019

My production version is 1.8 communication with tchannel , and the grafana images are from production env. it shows that load is unbalanced and also drops.

I want to check if we move to 1.13 .x with grpc we can solve the problem in production , and that is why i am trying 1.13.1 +grpc in dev ( agent.yaml and log which i shared ) ,
But in Dev env. also i do not see load balanced in traffic ,

@jkandasa
Copy link
Member

jkandasa commented Aug 6, 2019

@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here?

@jpkrohling In openshift, we create an additional service(jaegerqe-collector-headless) on the collector for gRPC load-balancing with ClusterIP: None.

@jpkrohling
Copy link
Contributor

That's probably the trick that @prana24 is missing! Thanks @jkandasa!

@prana24
Copy link
Author

prana24 commented Aug 6, 2019

Vow ..!! i cant wait , @jkandasa can you give me more information ? where do i get it ? any reference and topology ?

@jkandasa
Copy link
Member

jkandasa commented Aug 6, 2019

@prana24 AFAIK, there is no specific example to create collector headless service. @objectiser can guide here better.
jaeger-operator creates a headless service by default. Reference in jaeger-operator code

I just copied/modified collector service YAML from generated(by jaeger-operator) service file.
I hope this will work(not tested).
Important line spec.clusterIP: None.
You may add to your existing service and test. If you create a new service named jaeger-collector-headless, do not forget to change it on your agent.

- apiVersion: v1
  kind: Service
  metadata:
    name: jaeger-collector-headless
    labels:
      app: jaeger
      jaeger-infra: collector-service
spec:
  clusterIP: None
  ports:
    - name: jaeger-collector-grpc
      port: 14250
      protocol: TCP
      targetPort: 14250
  selector:
      jaeger-infra: collector-pod
  type: ClusterIP

@prana24
Copy link
Author

prana24 commented Aug 7, 2019

Thanks @jkandasa , i will give it a shot today

@piwenzi
Copy link

piwenzi commented Nov 20, 2019

i have the same problem!!!

error is
root@ubuntu-165:~/jaeger# kubectl logs productpage-v1-787bcf4b68-j88qj jaeger-agent | grep addrConn.createTransport
{"level":"info","ts":1574242860.4926956,"caller":"grpc/clientconn.go:1191","msg":"grpc: addrConn.createTransport failed to connect to {10.33.36.204:14250 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.33.36.204:14250: connect: connection refused". Reconnecting...","system":"grpc","grpc_log":true}
{"level":"info","ts":1574242861.493872,"caller":"grpc/clientconn.go:1191","msg":"grpc: addrConn.createTransport failed to connect to {10.33.36.204:14250 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.33.36.204:14250: connect: connection refused". Reconnecting...","system":"grpc","grpc_log":true}

but network was right
root@ubuntu-165:~/jaeger# kubectl exec box2 -- nslookup my-jaeger-collector-headless.kube-system
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: my-jaeger-collector-headless.kube-system
Address 1: 10.33.36.204 10-33-36-204.my-jaeger-collector.kube-system.svc.cluster.local
root@ubuntu-165:~/jaeger# kubectl exec box2 -- nslookup my-jaeger-collector.kube-system.svc.cluster.local

@jpkrohling
Copy link
Contributor

@pujunYang could you please share what's your my-jaeger-collector-headless definition? kubectl get service my-jaeger-collector-headless -o yaml should do the trick. How are you setting it up? Is it via the Operator?

@piwenzi
Copy link

piwenzi commented Nov 20, 2019

@jpkrohling yes
use Operator start jaeger,

kubectl get service my-jaeger-collector-headless -n kube-system  -o yaml 
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "false"
  creationTimestamp: "2019-11-20T09:40:13Z"
  labels:
    app: jaeger
    app.kubernetes.io/component: service-collector
    app.kubernetes.io/instance: my-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: my-jaeger-collector
    app.kubernetes.io/part-of: jaeger
  name: my-jaeger-collector-headless
  namespace: kube-system
  ownerReferences:
  - apiVersion: jaegertracing.io/v1
    controller: true
    kind: Jaeger
    name: my-jaeger
    uid: c00e3485-0b79-11ea-ab62-5254006535e0
  resourceVersion: "1933"
  selfLink: /api/v1/namespaces/kube-system/services/my-jaeger-collector-headless
  uid: c0869851-0b79-11ea-ab62-5254006535e0
spec:
  clusterIP: None
  ports:
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: grpc
    port: 14250
    protocol: TCP
    targetPort: 14250
  - name: c-tchan-trft
    port: 14267
    protocol: TCP
    targetPort: 14267
  - name: c-binary-trft
    port: 14268
    protocol: TCP
    targetPort: 14268
  selector:
    app: jaeger
    app.kubernetes.io/component: collector
    app.kubernetes.io/instance: my-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: my-jaeger-collector
    app.kubernetes.io/part-of: jaeger
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

@piwenzi
Copy link

piwenzi commented Nov 20, 2019

@jpkrohling it is was Jaeger.yaml

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: my-jaeger
  namespace: kube-system
spec:
  strategy: production # <1>
  allInOne:
    image: jaegertracing/all-in-one:latest # <2>
    options: # <3>
      log-level: debug # <4>
  storage:
    type: elasticsearch # <5>
    options: # <6>
      es: # <7>
        server-urls: http://elasticsearch-logging:9200
        tls:
          skip-host-verify: true
  ingress:
    enabled: false # <8>
  agent:
    strategy: DaemonSet # <9>
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: "" # <10>

@parberge
Copy link

parberge commented Oct 2, 2020

I have another concerned about load balancing;

We use "--reporter.grpc.host-port=dns:///jaeger-collector-gRPC.service.consul:14250" to get the list of collectors, which is working fine. All collectors receive spans.

The problem;
If we scale out the collectors the agent will never get a new list.
This also means if one or more collectors is removed/offline the list of collectors on the agents will remain the same.

It seems it only resolve the list when the agent starts?
Or am I missing something?

@jpkrohling
Copy link
Contributor

Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter.

@jpkrohling
Copy link
Contributor

I'm closing, as I think this has been answered some time ago, but feel free to reopen if there are still questions.

@parberge
Copy link

parberge commented Oct 2, 2020

Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter.

Not 1.20 that's for sure.

Will test and create an issue if the problem remains. Thanks.

@Kerptastic
Copy link

Hey guys - I am seeing some of the same issues. On v.1.19 right now, so will try to do the upgrade. But much like some of the folks are seeing. I am using the jaeger-operator, have the HPA setup for min/max of 2/10, and when CPU gets hammered during our bot/soak tests the collectors scale up as expected, but the agent connections continue to fire spans down their already existing connections. So effectively, feels like more of a fault tolerance setup than a high availability one. @parberge before I go super deep, did you see any positive changes with v1.20.0?

@parberge
Copy link

parberge commented Oct 15, 2020 via email

@jpkrohling
Copy link
Contributor

Feel free to reopen this issue you see the same problem happening on v1.20.

@Kerptastic
Copy link

Hi folks, unfortunately I see the same behavior as before. Running 1.20.0 container and agent (see below) via jaeger-operator.

Execute our bots firing off spans/traces and can observe the following in our graphs. You can see below we scaled to 4 collector instances, but the agents have no knowledge that they should reconnect and continue to saturate the collectors they are already connected to. The situation makes sense - I'm not missing a configuration in anyway for the collectors to notify the agent they should reconnect when dropping spans?

image

Containers:
  jaeger-collector:
    Container ID:  docker://98c391498dc8cdb605f5d001350c2451d16800470a7e03c096cab7b808ff7b95
    Image:         jaegertracing/jaeger-collector:1.20.0

...

Containers:
  jaeger-agent-daemonset:
    Container ID:  docker://fa6d25021bc5531053b998789d34fcdb0520d2a8f7907125951c44feeb10ffa4
    Image:         jaegertracing/jaeger-agent:1.20.0

@jpkrohling will try to reopen, need to figure out how =)

@jpkrohling jpkrohling reopened this Oct 15, 2020
@jpkrohling
Copy link
Contributor

I'll check what we can do, but I think the gRPC client might need some time to update the list of backends. In earlier versions, it would update only if all known backends were failing.

@Kerptastic
Copy link

Kerptastic commented Oct 15, 2020

OK - I am letting this soak. This may be something unique to how our bots are running also, as they are being spun up asynchronously in a single service, so would make sense that it would send traffic a single agent and thus overload the collector its connected to. Is the agent designed to have a single connection to a collector at a given point in time? If thats the case, this MAY be OK for us in production when we have bots replaced with real traffic and getting load balanced across our edge service, thus distributing across the agents more naturally.

@jpkrohling
Copy link
Contributor

Is the agent designed to have a single connection to a collector at a given point in time?

I'd have to double-check with the gRPC client load balancer documentation, but I think that's indeed the case. The agent has a list of backends, but will only failover once its "current" backend fails.

@jpkrohling
Copy link
Contributor

@jkandasa do you remember from your load-tests what's the expected behavior here?

@JMCFTW
Copy link

JMCFTW commented Oct 27, 2021

I encountered the same issue when using Opentelemetry collector with headless Jaeger gRPC collector in Kubernetes.
Opentelemetry collector will never get new list of Jaeger collector after Jaeger collector is scaling out/down.

Here is my configurations:

Opentelemetry collector:

exporters:
  jaeger:
    endpoint: "dns:///jaeger-collector-svc.monitoring.svc.cluster.local:14250"
    balancer_name: "round_robin"
    insecure: true
  logging:
    loglevel: info
extensions:
  health_check: {}
processors:
  batch: {}
receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}
  jaeger:
    protocols:
      thrift_compact: {}
      thrift_http: {}
extensions:
  health_check: {}
service:
  extensions:
    - health_check
  pipelines:
    traces:
      exporters:
        - logging
        - jaeger
      processors:
        - batch
      receivers:
        - otlp
        - jaeger

Jaeger collector headless service in Kubernetes:

apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: jaeger-collector
  name: jaeger-collector-svc
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: admin
    port: 14269
    protocol: TCP
    targetPort: 14269
  - name: receive-span-from-jaeger-agent
    port: 14250
    protocol: TCP
    targetPort: 14250
  - name: receive-span-from-jaeger-client
    port: 14268
    protocol: TCP
    targetPort: 14268
  selector:
    app: jaeger-collector
  sessionAffinity: None
  type: ClusterIP

@jpkrohling
Copy link
Contributor

@JMCFTW, this is something to be checked and handled at the OpenTelemetry Collector side of things. I just created an issue there (open-telemetry/opentelemetry-collector#4274) and assigned it to myself.

@JMCFTW
Copy link

JMCFTW commented Oct 27, 2021

@JMCFTW, this is something to be checked and handled at the OpenTelemetry Collector side of things. I just created an issue there (open-telemetry/opentelemetry-collector#4274) and assigned it to myself.

Hi @jpkrohling,

Thanks for referencing this issue in Opentelemetry collector.

I'm not sure this issue can be handled in Opentelemetry collector or not, because it seems like gRPC client parameters doesn't have an option can let client(Opentelemetry collector) to do DNS name re-resolution after server(Jaeger collector) is auto scaling out/down.

So in my opinion, a possible workarounds is to let MaxConnectionAge of gRPC server parameter configurable in Jaeger collector? but I don't know it's good or bad idea.

Since I'm not investigate this issue very long time so please feel free to correct me if I'm wrong or have misunderstood something.

@jpkrohling
Copy link
Contributor

That's a good hint, thanks! I think I faced a similar issue before, and if a fix is needed here on the Jaeger side of things, I'll fix it here.

@jpkrohling
Copy link
Contributor

Just as a status update, I'm able to reproduce this. Reading some source code from gRPC Go, I was expecting the DNS resolution to happen every 30s, adding the new backends to the list and making them available as subchannels, but looks like it's not happening. I'll check a couple of things, and if they don't work, I'll give the MaxConnectionAge suggestion a try.

The following screenshot shows a situation that started with 10 replicas and later scaled to 20 replicas, expecting the new ones to eventually start receiving traffic.

The disparity of two of the numbers is because I had a wrong configuration. The remaining 8 similar numbers are after adjusting the config to take advantage of both. This is the config used:

receivers:
  otlp:
    protocols: 
      grpc:

exporters:
  jaeger:
    tls:
      insecure: true
    endpoint: dns:///simple-prod-collector-headless:14250
    balancer_name: round_robin

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger]

image

@jpkrohling jpkrohling self-assigned this Dec 2, 2021
@jpkrohling
Copy link
Contributor

I got some time to do some extra experiments, and I agree that setting the MaxConnectionAge would be a good solution. The OpenTelemetry Collector was able to send all the spans to Jaeger Collector, despite the scaling events:

otelcol_receiver_accepted_spans{receiver="otlp",service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest",transport="grpc"} 2.009878e+06

otelcol_exporter_send_failed_requests{service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest"} 4293
otelcol_exporter_send_failed_spans{exporter="jaeger",service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest"} 0
otelcol_exporter_sent_spans{exporter="jaeger",service_instance_id="2582827e-a982-4020-8cb0-fba74c309060",service_version="latest"} 2.009878e+06

With the MaxConnectionAge, here's how the rate of spans per instance looks like:
image

In the image above, you can see that we had a few nodes at first, ingesting around 20 spans per minute. Then, new nodes appeared, so that each node now takes care of 10 spans per minute.

I'll create a PR adding both MaxConnectionAge and MaxConnectionAgeGrace as CLI options.

@JMCFTW
Copy link

JMCFTW commented Dec 3, 2021

Hi @jpkrohling , thanks for adding flags!

So according to RELEASE.md, this change will be released at 5 January 2022 right?

@jpkrohling
Copy link
Contributor

Correct. If you want to test this change before that, I can tag and generate a container image based on the current main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants