Poor/bad replica set failover handling #373

rpechayr · 2015-05-12T09:57:19Z

Hello,

I have a dozen rails application using Mongoid 4 all connecting to the same MongoDB replica set (2 servers and 1 arbiter all running Mongo 2.6.x). These apps run on unicorn in case it matters.
We had an outage yesterday that just caused the primary server (at that particular time) to stop (These servers are AWS EC2 instances). We had to stop and then start the failing server and reconnect it to a replica set since its ip had changed. Here is what happened

On MongoDB side everything went as expected, 👏 👏, since a new primary was elected in a few seconds
On application side, they all started to be really slow, 👎 👎 . They were all still configured to connect to the 2 severs of the replica set, including the one that was not responding anymore.
We fixed the issue by changing the mongoid configuration of all apps and make them use the right 2 servers.

I spent a long time trying to understand what was happening and here it is because it is not the first time we observe this behaviour. Moped tries to connect to the dead host and blocks for 10 seconds, and then it connects to the other one :
```
  D, [2015-05-12T08:10:05.803642 #640] DEBUG -- :   MOPED: 10.120.131.142:27017 COMMAND      database=admin command={:ismaster=>1} runtime: 10003.8121ms
```

After the first attempt, the application seems to remember that the host is down but only for a few minutes. So the 10 second lags happens once per unicorn worker process per X minutes.

I first tried to reproduce this by spawning a replica set on localhost and did not observe anything. I finally managed to reproduce the same behaviour by using the 2 following hosts :

8.8.8.8:27017 (this is google DNS server, not a MongoDB server)
127.0.0.1:27017

The rails applications hangs for 10 seconds while trying to connect to 8.8.8.8 server.

I would expect moped (or another driver) to be a lot smarter than that :

Take less than 10 seconds to realize that a host is not even running mongodb, or not even listening on the port
Remember this state for more than a few minutes. In case of a failover, this caused my application to become really slow. Moped could for example flag the host as down, and check this information again asynchronously after some time 30 seconds could be fine without blocking the whole application.

This is serious issue (at least for me), since is causes my mongodb setup to poorly handle failover. I don't know if there is any workaround to this, or if other drivers have the same flaws. Maybe @durran or @estolfo who are working on replacing moped by mongo db native driver could be interested in this behaviour, at least to make sure that official mongodb driver works better than this.

Thank you in advance for your input.

The text was updated successfully, but these errors were encountered:

rpechayr mentioned this issue May 12, 2015

Intgeration with official Mongo driver 2.0.0. mongodb/mongoid#3941

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor/bad replica set failover handling #373

Poor/bad replica set failover handling #373

rpechayr commented May 12, 2015

Poor/bad replica set failover handling #373

Poor/bad replica set failover handling #373

Comments

rpechayr commented May 12, 2015