Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring tracing on specific Mongoid client leads to clients not reconnecting on a stepdown #1211

Closed
kitop opened this issue Oct 19, 2020 · 9 comments
Assignees
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations

Comments

@kitop
Copy link
Contributor

kitop commented Oct 19, 2020

When using Mongoid (v7.1.2) and overriding the global configuration for a certain client instance as detailed in the docs by doing Datadog.configure(client, options), the mongo clients don't reconnect after a stepdown happens in MongoDB. This leads to downtime in the application since it can no longer communicate with the database until the process is restarted.

We have the following configuration with Mongoid:

Datadog.configure do |c|
  c.use :mongo
end

Mongoid.configure do |config|
  default = config.clients[:default]

  [:primary_preferred, :secondary_preferred].each do |read_mode|
    config.clients[read_mode] = default.deep_merge(options: { read: { mode: read_mode } })
  end
end

Datadog.configure(Mongoid.clients(:secondary_preferred), service_name: "mongodb-secondary")

The tracing works fine this way, but it's causing reconnection issues when there's a stepdown. Looks like all clients have this issue and not just the one configured with Datadog.configure, this might be due to how Mongoid performs the connection, but tracing should not affect that at all.

This was observed with dd-trace-rb version 0.37.0, Mongoid 7.1.2, Rails 6.0.3.3 and Ruby 2.6.6.

This looks somewhat related to #729

@ericmustin
Copy link
Contributor

ericmustin commented Oct 20, 2020

👋 Hey @kitop , thank's for the report here. Sorry to hear that this is causing issues. A few quick questions I have to clarify things here so that I can try to reproduce this on my end.

  • and overriding the global configuration for a certain client instance as detailed ...are you saying that you only see this behavior exhibited when both c.use :mongo AND Datadog.configure(Mongoid.clients(:secondary_preferred)... get called? And it doesn't occur when only using c.use :mongo ? Any additional info here helps greatly.

  • causing reconnection issues would you mind explaining a bit more the issues in this case? Ie, does an error get thrown when trying to reconnect, or does it just hang, or some other behavior, etc. if there's any stacktraces here that would also help tremendously.

  • Lastly, If you have a minimal reproduction that would also be useful, if not that's ok. One small thing here is that 0.38.0 had some fixes for propagating fatal exceptions (see: https://github.com/DataDog/dd-trace-rb/releases/tag/v0.38.0) , I don't think that's going to necessarily be the root cause here, but could be worth quickly ruling out by testing a reproduction against latest, if you happen to have one.

@ericmustin
Copy link
Contributor

Just following up here. Having a bit of trouble reproducing this, so if there's any above info you can provide it would be helpful. It's also occurred to me that if there's any error messages or stacktraces here there may be some sensitive info involved, so it may be best to open up a support ticket through support@datadoghq.com. to ensure we're not leaking anything sensitive. Feel free to mention that I requested you open the ticket so that it gets escalated to me properly, or drop the ticket ID in here so I can grab the relevant info off it.

@kitop
Copy link
Contributor Author

kitop commented Oct 22, 2020

Hi @ericmustin, thanks for the comments!

  • and overriding the global configuration for a certain client instance as detailed ...are you saying that you only see this behavior exhibited when both c.use :mongo AND Datadog.configure(Mongoid.clients(:secondary_preferred)... get called? And it doesn't occur when only using c.use :mongo ? Any additional info here helps greatly.

Correct. It only happens when having both enabled. We've used just c.use :mongo for a long time now and worked perfectly. When introducing the Datadog.configure(...) is that we started seeing problems. We were able to replicate this behavior change consistently by only tweaking that part of the code.

  • causing reconnection issues would you mind explaining a bit more the issues in this case? Ie, does an error get thrown when trying to reconnect, or does it just hang, or some other behavior, etc. if there's any stacktraces here that would also help tremendously.

Sure! So, when a stepdown happens we usually see some connection errors for any MongoDB quries that are in flight in that moment, but the client reconnects on error and following queries succeed.
With this change, we observed that the clients couldn't reconnect. What ended up happening was the web requests to time out because the mongo client kept trying to reconnect and it couldn't in 30 seconds.

Some of the errors we saw are: (changed hostnames and replica set names)

#<Mongo::Error::NoServerAvailable: No primary server is available in cluster: #<Cluster topology=ReplicaSetNoPrimary[<list of hosts>,name=rs0,v=6,e=7fffffff000000000000002c] servers=[#<Server address=host1:27017 SECONDARY replica_set=rs0 pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>,#<Server address=host2:27017 UNKNOWN pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>,#<Server address=host3:27017 ARBITER replica_set=rs0>]> with timeout=10, LT=0.015. The following servers have dead monitor threads: #<Server address=host1:27017 SECONDARY replica_set=rs0 pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>, #<Server address=host2:27017 UNKNOWN pool=#<ConnectionPool size=0 (0-5) used=0 avail=0 pending=0>>, #<Server address=host3:27017 ARBITER replica_set=rs0>>

and

2020-10-08T18:52:27.26614615Z WARN[2020-10-08T18:52:27.265922869+00:00] Retrying ismaster in monitor for host1:27017 due to: Mongo::Error::SocketTimeoutError Errno::ETIMEDOUT: Connection timed out - Took more than 10 seconds to receive data (for 10.0.0.18:27017 (host1:27017, TLS))

The stacktraces we have are from the web server timeout and not so much from ddtrace/mongo. MongoDB client mostly logged these kind of errors.

  • Lastly, If you have a minimal reproduction that would also be useful, if not that's ok. One small thing here is that 0.38.0 had some fixes for propagating fatal exceptions (see: https://github.com/DataDog/dd-trace-rb/releases/tag/v0.38.0) , I don't think that's going to necessarily be the root cause here, but could be worth quickly ruling out by testing a reproduction against latest, if you happen to have one.

I'm working on getting a minimal reproduction case, but having trouble doing so. We can consistently reproduce in our staging and production environments, but I'm having trouble doing so locally. Will keep trying and update if any updates.

We're gonna try with 0.38.0 and report back. We had done an upgrade to 0.41 this week but had to roll back due to memory leeks (will report that in a different issue)

@ericmustin
Copy link
Contributor

@kitop Ok, that's very helpful. Thanks for the patience here. I will try again given the above and see if I can narrow down what's causing this for mongo. In the meantime if you want to name a specific client a different service name, I'd suggest as a workaround that you may want to use a span processor, https://github.com/DataDog/dd-trace-rb/blob/master/docs/GettingStarted.md#processing. It will allow you to modify a span before it's flushed to the agent.

Regarding the memory leak, that's definitely not good and something we want to fix. The same suggestion as above applies here

it may be best to open up a support ticket through support@datadoghq.com. to ensure we're not leaking anything sensitive. Feel free to mention that I requested you open the ticket so that it gets escalated to me properly + drop the ticket ID in here so I can grab the relevant info off it.

Additionally, any info about the leak, as well as your dd-trace-rb config configure block + what libraries your application is using, would help here. (again, usually providing this stuff via support ticket is easier since it's possible to accidentally share private info)

@kitop
Copy link
Contributor Author

kitop commented Oct 22, 2020

In the meantime if you want to name a specific client a different service name, I'd suggest as a workaround that you may want to use a span processor, https://github.com/DataDog/dd-trace-rb/blob/master/docs/GettingStarted.md#processing. It will allow you to modify a span before it's flushed to the agent.

Oh that's interesting! Thanks for the pointer! Will share with the team.

@marcotc
Copy link
Member

marcotc commented Oct 27, 2020

👋 @kitop, I did some investigation on this issue today, and I couldn't find any particular difference between Datadog.configure and c.use :mongo that could affect the connection process.

The only thing I can think is that our instrumentation is throwing errors, which could cause commands to fail.
But if that was the case, you'd exceptions, or at least error log lines in your application that would point back to our library.

I believe you already confirmed above that there are no log lines that trace back to ddtrace, nor Ruby exceptions with a suspicious stack tracer.

In that case, we can try one thing: could you tell me what version of the tracer you have deployed? I can create a custom pre-release build with additional logging or changes to the Mongo instrumentation. If you'd be able to install those, at least in your staging environment that would be very helpful.

@kitop
Copy link
Contributor Author

kitop commented Nov 7, 2020

Hi @marcotc thanks for looking at this! I can't commit to do that now, as we don't have the time. The way we've fixed it was by doing something like this:

  begin
    Mongoid.override_client(:secondary_preferred)
    client = Mongoid.client(:secondary_preferred)
    pin = Datadog::Pin.get_from(client)
    pin.service_name = "mongodb-secondary"
    client.datadog_pin = pin
    # do the work
  ensure
    Mongoid.override_client(nil)
  end

It's quite hacky, but it works. We were suspecting it could have something to do on how the pin is set for replicas, but couldn't really come to a conclusion.
I hope this helps the investigation. As I can't commit time to investigate this more now, feel free to close the issue. I'll reach out to support to debug it when we have some time (an engineer is already looking at the memory leak issue with them).

Thanks!

@skcc321
Copy link
Contributor

skcc321 commented Mar 22, 2021

hi guys, can this approach work? #1423

@delner delner added bug Involves a bug community Was opened by a community member integrations Involves tracing integrations labels Apr 1, 2021
@ivoanjo
Copy link
Member

ivoanjo commented Dec 9, 2024

I believe this was fixed by #1423, please do let us know if there's something to be improved/fixed still :)

@ivoanjo ivoanjo closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations
Projects
None yet
Development

No branches or pull requests

6 participants