Configuring tracing on specific Mongoid client leads to clients not reconnecting on a stepdown #1211

kitop · 2020-10-19T21:37:49Z

When using Mongoid (v7.1.2) and overriding the global configuration for a certain client instance as detailed in the docs by doing Datadog.configure(client, options), the mongo clients don't reconnect after a stepdown happens in MongoDB. This leads to downtime in the application since it can no longer communicate with the database until the process is restarted.

We have the following configuration with Mongoid:

Datadog.configure do |c|
  c.use :mongo
end

Mongoid.configure do |config|
  default = config.clients[:default]

  [:primary_preferred, :secondary_preferred].each do |read_mode|
    config.clients[read_mode] = default.deep_merge(options: { read: { mode: read_mode } })
  end
end

Datadog.configure(Mongoid.clients(:secondary_preferred), service_name: "mongodb-secondary")

The tracing works fine this way, but it's causing reconnection issues when there's a stepdown. Looks like all clients have this issue and not just the one configured with Datadog.configure, this might be due to how Mongoid performs the connection, but tracing should not affect that at all.

This was observed with dd-trace-rb version 0.37.0, Mongoid 7.1.2, Rails 6.0.3.3 and Ruby 2.6.6.

This looks somewhat related to #729

The text was updated successfully, but these errors were encountered:

ericmustin · 2020-10-20T08:55:21Z

👋 Hey @kitop , thank's for the report here. Sorry to hear that this is causing issues. A few quick questions I have to clarify things here so that I can try to reproduce this on my end.

and overriding the global configuration for a certain client instance as detailed ...are you saying that you only see this behavior exhibited when both c.use :mongo AND Datadog.configure(Mongoid.clients(:secondary_preferred)... get called? And it doesn't occur when only using c.use :mongo ? Any additional info here helps greatly.
causing reconnection issues would you mind explaining a bit more the issues in this case? Ie, does an error get thrown when trying to reconnect, or does it just hang, or some other behavior, etc. if there's any stacktraces here that would also help tremendously.
Lastly, If you have a minimal reproduction that would also be useful, if not that's ok. One small thing here is that 0.38.0 had some fixes for propagating fatal exceptions (see: https://github.com/DataDog/dd-trace-rb/releases/tag/v0.38.0) , I don't think that's going to necessarily be the root cause here, but could be worth quickly ruling out by testing a reproduction against latest, if you happen to have one.

ericmustin · 2020-10-22T10:28:26Z

Just following up here. Having a bit of trouble reproducing this, so if there's any above info you can provide it would be helpful. It's also occurred to me that if there's any error messages or stacktraces here there may be some sensitive info involved, so it may be best to open up a support ticket through support@datadoghq.com. to ensure we're not leaking anything sensitive. Feel free to mention that I requested you open the ticket so that it gets escalated to me properly, or drop the ticket ID in here so I can grab the relevant info off it.

kitop · 2020-10-22T19:02:56Z

Hi @ericmustin, thanks for the comments!

and overriding the global configuration for a certain client instance as detailed ...are you saying that you only see this behavior exhibited when both c.use :mongo AND Datadog.configure(Mongoid.clients(:secondary_preferred)... get called? And it doesn't occur when only using c.use :mongo ? Any additional info here helps greatly.

Correct. It only happens when having both enabled. We've used just c.use :mongo for a long time now and worked perfectly. When introducing the Datadog.configure(...) is that we started seeing problems. We were able to replicate this behavior change consistently by only tweaking that part of the code.

causing reconnection issues would you mind explaining a bit more the issues in this case? Ie, does an error get thrown when trying to reconnect, or does it just hang, or some other behavior, etc. if there's any stacktraces here that would also help tremendously.

Sure! So, when a stepdown happens we usually see some connection errors for any MongoDB quries that are in flight in that moment, but the client reconnects on error and following queries succeed.
With this change, we observed that the clients couldn't reconnect. What ended up happening was the web requests to time out because the mongo client kept trying to reconnect and it couldn't in 30 seconds.

Some of the errors we saw are: (changed hostnames and replica set names)

#<Mongo::Error::NoServerAvailable: No primary server is available in cluster: #<Cluster topology=ReplicaSetNoPrimary[<list of hosts>,name=rs0,v=6,e=7fffffff000000000000002c] servers=[#<Server address=host1:27017 SECONDARY replica_set=rs0 pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>,#<Server address=host2:27017 UNKNOWN pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>,#<Server address=host3:27017 ARBITER replica_set=rs0>]> with timeout=10, LT=0.015. The following servers have dead monitor threads: #<Server address=host1:27017 SECONDARY replica_set=rs0 pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>, #<Server address=host2:27017 UNKNOWN pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>, #<Server address=host3:27017 ARBITER replica_set=rs0>>

and

2020-10-08T18:52:27.26614615Z WARN[2020-10-08T18:52:27.265922869+00:00] Retrying ismaster in monitor for host1:27017 due to: Mongo::Error::SocketTimeoutError Errno::ETIMEDOUT: Connection timed out - Took more than 10 seconds to receive data (for 10.0.0.18:27017 (host1:27017, TLS))

The stacktraces we have are from the web server timeout and not so much from ddtrace/mongo. MongoDB client mostly logged these kind of errors.

Lastly, If you have a minimal reproduction that would also be useful, if not that's ok. One small thing here is that 0.38.0 had some fixes for propagating fatal exceptions (see: https://github.com/DataDog/dd-trace-rb/releases/tag/v0.38.0) , I don't think that's going to necessarily be the root cause here, but could be worth quickly ruling out by testing a reproduction against latest, if you happen to have one.

I'm working on getting a minimal reproduction case, but having trouble doing so. We can consistently reproduce in our staging and production environments, but I'm having trouble doing so locally. Will keep trying and update if any updates.

We're gonna try with 0.38.0 and report back. We had done an upgrade to 0.41 this week but had to roll back due to memory leeks (will report that in a different issue)

ericmustin · 2020-10-22T20:48:37Z

@kitop Ok, that's very helpful. Thanks for the patience here. I will try again given the above and see if I can narrow down what's causing this for mongo. In the meantime if you want to name a specific client a different service name, I'd suggest as a workaround that you may want to use a span processor, https://github.com/DataDog/dd-trace-rb/blob/master/docs/GettingStarted.md#processing. It will allow you to modify a span before it's flushed to the agent.

Regarding the memory leak, that's definitely not good and something we want to fix. The same suggestion as above applies here

it may be best to open up a support ticket through support@datadoghq.com. to ensure we're not leaking anything sensitive. Feel free to mention that I requested you open the ticket so that it gets escalated to me properly + drop the ticket ID in here so I can grab the relevant info off it.

Additionally, any info about the leak, as well as your dd-trace-rb config configure block + what libraries your application is using, would help here. (again, usually providing this stuff via support ticket is easier since it's possible to accidentally share private info)

kitop · 2020-10-22T20:55:58Z

In the meantime if you want to name a specific client a different service name, I'd suggest as a workaround that you may want to use a span processor, https://github.com/DataDog/dd-trace-rb/blob/master/docs/GettingStarted.md#processing. It will allow you to modify a span before it's flushed to the agent.

Oh that's interesting! Thanks for the pointer! Will share with the team.

marcotc · 2020-10-27T20:41:43Z

👋 @kitop, I did some investigation on this issue today, and I couldn't find any particular difference between Datadog.configure and c.use :mongo that could affect the connection process.

The only thing I can think is that our instrumentation is throwing errors, which could cause commands to fail.
But if that was the case, you'd exceptions, or at least error log lines in your application that would point back to our library.

I believe you already confirmed above that there are no log lines that trace back to ddtrace, nor Ruby exceptions with a suspicious stack tracer.

In that case, we can try one thing: could you tell me what version of the tracer you have deployed? I can create a custom pre-release build with additional logging or changes to the Mongo instrumentation. If you'd be able to install those, at least in your staging environment that would be very helpful.

kitop · 2020-11-07T23:07:39Z

Hi @marcotc thanks for looking at this! I can't commit to do that now, as we don't have the time. The way we've fixed it was by doing something like this:

  begin
    Mongoid.override_client(:secondary_preferred)
    client = Mongoid.client(:secondary_preferred)
    pin = Datadog::Pin.get_from(client)
    pin.service_name = "mongodb-secondary"
    client.datadog_pin = pin
    # do the work
  ensure
    Mongoid.override_client(nil)
  end

It's quite hacky, but it works. We were suspecting it could have something to do on how the pin is set for replicas, but couldn't really come to a conclusion.
I hope this helps the investigation. As I can't commit time to investigate this more now, feel free to close the issue. I'll reach out to support to debug it when we have some time (an engineer is already looking at the memory leak issue with them).

Thanks!

skcc321 · 2021-03-22T11:00:54Z

hi guys, can this approach work? #1423

ivoanjo · 2024-12-09T10:58:42Z

I believe this was fixed by #1423, please do let us know if there's something to be improved/fixed still :)

delner assigned ericmustin Apr 1, 2021

delner added bug Involves a bug community Was opened by a community member integrations Involves tracing integrations labels Apr 1, 2021

ivoanjo closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuring tracing on specific Mongoid client leads to clients not reconnecting on a stepdown #1211

Configuring tracing on specific Mongoid client leads to clients not reconnecting on a stepdown #1211

kitop commented Oct 19, 2020

ericmustin commented Oct 20, 2020 •

edited

Loading

ericmustin commented Oct 22, 2020

kitop commented Oct 22, 2020

ericmustin commented Oct 22, 2020

kitop commented Oct 22, 2020

marcotc commented Oct 27, 2020

kitop commented Nov 7, 2020 •

edited

Loading

skcc321 commented Mar 22, 2021

ivoanjo commented Dec 9, 2024

Configuring tracing on specific Mongoid client leads to clients not reconnecting on a stepdown #1211

Configuring tracing on specific Mongoid client leads to clients not reconnecting on a stepdown #1211

Comments

kitop commented Oct 19, 2020

ericmustin commented Oct 20, 2020 • edited Loading

ericmustin commented Oct 22, 2020

kitop commented Oct 22, 2020

ericmustin commented Oct 22, 2020

kitop commented Oct 22, 2020

marcotc commented Oct 27, 2020

kitop commented Nov 7, 2020 • edited Loading

skcc321 commented Mar 22, 2021

ivoanjo commented Dec 9, 2024

ericmustin commented Oct 20, 2020 •

edited

Loading

kitop commented Nov 7, 2020 •

edited

Loading