Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Poor/bad replica set failover handling #373

Open
rpechayr opened this issue May 12, 2015 · 0 comments
Open

Poor/bad replica set failover handling #373

rpechayr opened this issue May 12, 2015 · 0 comments

Comments

@rpechayr
Copy link

Hello,

I have a dozen rails application using Mongoid 4 all connecting to the same MongoDB replica set (2 servers and 1 arbiter all running Mongo 2.6.x). These apps run on unicorn in case it matters.
We had an outage yesterday that just caused the primary server (at that particular time) to stop (These servers are AWS EC2 instances). We had to stop and then start the failing server and reconnect it to a replica set since its ip had changed. Here is what happened

  • On MongoDB side everything went as expected, 👏 👏, since a new primary was elected in a few seconds

  • On application side, they all started to be really slow, 👎 👎 . They were all still configured to connect to the 2 severs of the replica set, including the one that was not responding anymore.

  • We fixed the issue by changing the mongoid configuration of all apps and make them use the right 2 servers.

    I spent a long time trying to understand what was happening and here it is because it is not the first time we observe this behaviour. Moped tries to connect to the dead host and blocks for 10 seconds, and then it connects to the other one :

      D, [2015-05-12T08:10:05.803642 #640] DEBUG -- :   MOPED: 10.120.131.142:27017 COMMAND      database=admin command={:ismaster=>1} runtime: 10003.8121ms
    

After the first attempt, the application seems to remember that the host is down but only for a few minutes. So the 10 second lags happens once per unicorn worker process per X minutes.

I first tried to reproduce this by spawning a replica set on localhost and did not observe anything. I finally managed to reproduce the same behaviour by using the 2 following hosts :

  • 8.8.8.8:27017 (this is google DNS server, not a MongoDB server)
  • 127.0.0.1:27017

The rails applications hangs for 10 seconds while trying to connect to 8.8.8.8 server.

I would expect moped (or another driver) to be a lot smarter than that :

  • Take less than 10 seconds to realize that a host is not even running mongodb, or not even listening on the port
  • Remember this state for more than a few minutes. In case of a failover, this caused my application to become really slow. Moped could for example flag the host as down, and check this information again asynchronously after some time 30 seconds could be fine without blocking the whole application.

This is serious issue (at least for me), since is causes my mongodb setup to poorly handle failover. I don't know if there is any workaround to this, or if other drivers have the same flaws. Maybe @durran or @estolfo who are working on replacing moped by mongo db native driver could be interested in this behaviour, at least to make sure that official mongodb driver works better than this.

Thank you in advance for your input.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant