Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Moped is utilizing 'bad' Connections from the connection Pool #346

Open
steve-rodriguez opened this issue Feb 5, 2015 · 58 comments
Open

Comments

@steve-rodriguez
Copy link

I have a 3 node replica set with 1 primary and 2 secondaries. We read from Primary. We are using the 4.0.1 Mongoid gem, 2.0.3 Moped gem and the 2.1.1 connection pool gem. When we directly kill a mongo node and a new Primary is selected, 'broken' connections remain in the pool and when the app grabs one it throws the ConnectionPool::ShuttingDownError. When we gracefully step_down the primary and a new primary is selected, 'old' connections to the 'old' primary still exist in the pool and the app throws a 13435 error of "not master and slaveOK=false". Any time a connection is 'broken' or not valid useable anymore, it should NOT be utilized by Mongoid. Attaching screenprints of the error and the mongodb rest interface status ( for the 'killed' scenario ). Please advise ASAP. Thx!

mongo issue-3
mongo issue-2
mongo issue-1

@dblock
Copy link
Contributor

dblock commented Feb 8, 2015

This looks like a more detailed description from #345. @arthurnn can you please take a look? This seems new in the most recent version of moped and is affecting production for an increasing amount of people.

@dblock
Copy link
Contributor

dblock commented Feb 8, 2015

I also opened #348 which seems possibly related.

@argami
Copy link

argami commented Feb 8, 2015

I'm not sure how much can help, but most of my errors are related and are this:

A Timeout::Error occurred in background at 2015-02-07 01:02:20 UTC :

 Waited 0.5 sec
 path/shared/bundle/ruby/2.0.0/gems/connection_pool-2.0.0/lib/connection_pool/timed_stack.rb:42:in `block (2 levels) in pop'

I've change my pool timeout but seems to still be using 0.5

A Moped::Errors::ConnectionFailure occurred in users#show:

 Could not connect to a primary node for replica set #<Moped::Cluster:100465340 @seeds=[<Moped::Node resolved_address="10.101.1.100:27017">]>
 app/controllers/api/base_api_controller.rb:83:in `authenticate_user_from_token!'

Not sure if is because authenticate_user_from_token is the most used one but seems that the error came mainly from here. Wich is a look for a user in the db and sign_in with devise.

Two more things that can be related with the error is the way i look for the user:

def find_by_username username
    Rails.cache.fetch(username, :expires_in => 30.seconds) do
        self.where(username: /^#{username}$/i).first
    end
end

@argami
Copy link

argami commented Feb 9, 2015

I was able to fixed, In my case i found out that the error it was 2 things.

First the pool_size, and pool_timeout wasn't enough and second, for some reason my postfix queue was sending old messages as new ones, so from my point of view none of my modifications had any effect.

Anyways with a small pool_size and pool_timeout you should have those errors, i think that is the expected behaviour.

@dblock
Copy link
Contributor

dblock commented Feb 9, 2015

@argami What are the values you use?

@klamontagne
Copy link

The "not master and slaveOK=false" after a stepdown or restart of the primary was fixed for us by #351 .

@argami
Copy link

argami commented Feb 9, 2015

@dblock this are my values:

      options:
        pool_size: 200
        pool_timeout: 15
        retry_interval: 1
        timeout: 60
        max_retries: 20
        refresh_interval: 30

But beause i had a lot of noise on the mail stuff probably the last value i used (100) probably was good too

Im using Puma 3 workers 0,8 threads

@mateusdelbianco
Copy link

This is fixed in #351

@dblock
Copy link
Contributor

dblock commented Feb 13, 2015

Thanks @mateusdelbianco! Looking forward to a moped release soon with these.

@mateusdelbianco
Copy link

You should thank @wandenberg for his patch! :)

@steve-rodriguez
Copy link
Author

So are BOTH issues fixed? The master/slave issue AND the one where the ConnectionPool gem throws the ConnectionPool::PoolShuttingDownError when you restart a VM where a mongo primary is running? If so when will a new release of Moped be cut? Thanks!

@niedfelj
Copy link

Please see my PR #348

@steve-rodriguez
Copy link
Author

Still seeing this error when a mongo db node is killed. Moped should evict this bad connection. I am assuming since this issue is still open that this was NOT resolved with 2.0.4. I am using Mongoid 4.0.2 and Moped 2.0.4 and Connection_Pool 2.1.1. Any ideas when this will be addressed? thanks!

2015-02-23T15:06:11-07:00 INFO : [ebdb5491-645e-49b7-9535-07ec2412573f] [ 10.0.32.86] [TBD] [visitor_id: 140555603] Completed 500 Internal Server E rror in 21ms
2015-02-23T15:06:11-07:00 FATAL : [ebdb5491-645e-49b7-9535-07ec2412573f] [10.0.32.86] [TBD] [visitor_id: 140555603]
ConnectionPool::PoolShuttingDownError (ConnectionPool::PoolShuttingDownEr ror
controller: portfolio_summary/portfolios
action: show
params: {"action"=>"show", "controller"=>"portfolio_summary/portfoli os"}

@michaelrobinette
Copy link

Still seeing it with Moped 2.0.4 as well

@steve-rodriguez
Copy link
Author

To provide more information on this issue, I have a 3 node replica set ( node1, node2 and node3 ). Node1 is the primary. When node2 (secondary node ) is shutdown via the instructions on Mongodbs website for upgrading a replicaset, Mongoid/Moped throws a ConnectionPool::PoolShuttingDownError when it attempts to grab a connection. WHY would it throw this error if this error is generated from a SECONDARY node that should NOT be in the pool anyway???? The mongoid.yml default read option is primary. It should never be creating pool connections to SECONDARY nodes. It appears to me that Mongoid/Moped is throwing SECONDARY node exceptions to the client when the read preference is PRIMARY. And if this error is thrown from logic that is monitoring the replica set, it should NEVER be returned to the client. Mongoid/Moped should always return a connection to the PRIMARY node if the primary node is UP and AVAILABLE. Seems like a fairly MAJOR bug.

@niedfelj
Copy link

@steve-rodriguez

Can you try it using my branch and let me know the results? I've added more logging to moped and integrated a few pull requests that have not been integrated into Moped master/gem. We are using this in production without issue, although there are still some wonky things with IP changes etc:

gem 'moped', :git=>'git://github.com/niedfelj/moped.git', :branch=>'all_patches'

@steve-rodriguez
Copy link
Author

Will do. Thx.

From: Joel Niedfeldt [mailto:notifications@github.com]
Sent: Friday, February 27, 2015 4:38 PM
To: mongoid/moped
Cc: Rodriguez, Steve
Subject: Re: [moped] Moped is utilizing 'bad' Connections from the connection Pool (#346) [ EXTERNAL ]

@steve-rodriguezhttps://github.com/steve-rodriguez

Can you try it using my branch and let me know the results? I've added more logging to moped and integrated a few pull requests that have not been integrated into Moped master/gem. We are using this in production without issue, although there are still some wonky things with IP changes etc:

gem 'moped', :git=>'git://github.com/niedfelj/moped.git', :branch=>'all_patches'


Reply to this email directly or view it on GitHubhttps://github.com//issues/346#issuecomment-76493237.

This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications.

@fedenusy
Copy link
Contributor

fedenusy commented Mar 3, 2015

@niedfelj I tried your all_patches branch on prod and still got tons of Moped::Errors::ConnectionFailure: Could not connect to a primary node for replica set :(

@steve-rodriguez
Copy link
Author

@niedfelj I tried this branch and it appears the ConnectionPool::PoolShuttingDownError has gone away, BUT when stepping down the primary, I see these errors now:

Moped::Errors::ConnectionFailure (Could not connect to a primary node for replica set #<Moped::Cluster:1553548 @seeds=[<Moped::Node resolved_address="ip1:33478">, <Moped::Node resolved_address="ip2:33478">, <Moped::Node resolved_address="ip3:33478">]>

where ip1, ip2 and ip3 are our internal ipaddresses for each Mongodb node in the replica set.

So to test, I had our DBAs shutdown each secondary node independently and restart them. I saw no connection pool shutting down errors. I then had all nodes online, and had them do a stepdown for the primary node. Then I started getting the Could not connect to primary node in the replica set. Version 2.0.4 of Moped appears to have solved this issue, but your branch/patches contains the could not connect to primary issue I believe. Please advise. Thanks for your help! I think this is getting closer to working... So I think if you merge your changes with the 2.0.4 changes, we may have something that is close to working...

@fedenusy
Copy link
Contributor

fedenusy commented Mar 3, 2015

@steve-rodriguez moped 2.0.4 without @niedfelj's patch also throws Moped::Errors::ConnectionFailure: Could not connect to a primary node. So I'm not sure that merging 2.0.4 into @niedfelj's branch will do the trick.

@steve-rodriguez
Copy link
Author

The 2.0.4 branch does throws those, BUT it actually does reconnect when retrying I believe. I will do some more testing to verify

@steve-rodriguez
Copy link
Author

Ok, heres the scoop to reproduce this issues. You have a 3 node replicaset ( n1-primary, n2-secondary and n3-secondary )

TEST1 - Good Behavior: If you have just restarted your app and have ALL good connections in your connection pool, you can do an rs.stepDown() to elect a new primary. The app will work properly and you get no errors.

BUT, if you do a shutdown of one of the secondaries ( n2 ) and then restart it, AND then do an rs.stepDown(), where n2 is now the primary ( n2-primary, n1-secondary, n3-secondary ), it will attempt to talk to the primary and will throw the Moped::Errors::ConnectionFailure: Could not connect to a primary node error. I have my number of retries set to 20 and I can see it count down retrying every second and then it fails after 20 tries. So I suspect that somehow moped is mistakenly keeping around connections that still think that n1 is primary when it actually is now a secondary.

@steve-rodriguez
Copy link
Author

@niedfelj @fedenusy @durran . Please advise on my prior comments. Thanks!

@wandenberg
Copy link
Contributor

Hi @steve-rodriguez , I'm trying to reproduce the problem and solve it but I wasn't able to.
I fixed a problem with retry after a setpDown, but didn't have this scenario of shutdown + stepDown.
Can you help me reproduce it?
When you say 3 nodes they have to be on different servers/vm or can be 3 mongod instances on same machine?
Which value are you using for pool_size?
By shutdown you mean kill signal, kill -9, shutdown command ...?
Which version of mongodb are you using?
With or without authentication?

@steve-rodriguez
Copy link
Author

@wandenberg Here is our environment:

  1. 3 nodes ( 1 primary and 2 secondaries ) on 3 different RHEL vms
  2. pool_size for testing purposes is 10. I can make it larger but it doesn't solve the issue. I made it 10 so I can easily reproduce the issue.
  3. shutdown command
  4. 2.6.6 of Mongodb
  5. No authentication

Here is my Mongoid.yml settings:
production:
sessions:
default:
database: edist
hosts:
['web-node-db1:33478','web-node-db2:33478', 'web-node-db3:33478']
options:
pool_size: 10
pool_timeout: 15
retry_interval: 1
timeout: 30
max_retries: 20
refresh_interval: 20

secondary:
  database: fund
  hosts:
    ['cms-node-db1:33478','cms-node-db2:33478', 'cms-node-db3:33478']
  options:
    pool_size: 10
    pool_timeout:15
    retry_interval: 1
    timeout: 30
    max_retries: 20
    refresh_interval: 20

If you need anything else, let me know.. Will be doing more testing today to see if I can further isolate the issue.. Thanks! -Steve

@steve-rodriguez
Copy link
Author

Here are some screen prints of the steps to reproduce the issue. Read preference is primary. Hope my description and screen prints help.

Each screen print starts with a number corresponding to the 7 steps in sequential order to reproduce the issue. I think that somehow Moped is creating pool connections to the secondary node, but ONLY gives back a connection(s) to the Primary node. I believe that moped keeps track of the 3 nodes ( 1 primary and 2 secondary ). when a secondary is brought down, all is good as Moped is reading from the primaries. But somehow moped keeps state that the secondary node is unavailable. When the secondary node is brought back up, all is good as it is still a secondary node. Once you do a stepDown and that secondary node that was JUST down becomes primary, you start seeing occassional issues where it can not connect to the Primary node. I think this occurs when moped returns a connection that was ORIGINALLY a connection to the secondary, but now that node is primary, but since is was brought down, the status of the node remains down I believe.

1 - initial_state
2 - shutdown node 2 - secondary
3 - bring-up-secondary
4 - stepdown_primary node 3
5 - node - 2 elected primary
6 - moped trying to connect to node 2
7 - eventual could not connect to primary error

@niedfelj
Copy link

niedfelj commented Mar 4, 2015

Can you make sure you have your Rails logger set to debug and make sure Moped/Mongoid are setup like this in an initializer:

Mongoid.logger = Rails.logger
Moped.logger = Rails.logger

You should be able to see MOPED asking for primaries, which I don't see in your output:

MOPED: 172.169.54.93:6210 COMMAND      database=admin command={:ismaster=>1} runtime: 1.3853ms

I'm also running a separate version of Mongoid, because I wanted to be able to see the exceptions during operations, which both mongoid/moped often just swallow:

https://github.com/mongoid/mongoid/pull/3884/files

I would set your pool_timeout to 0.5 ... 15 is too high and it's only the setting for "checking out" a connection (not connecting or making a connection).

And refresh_interval should be higher also...I would set it to at least 120 (the default is 300)

And I would lower timeout to 5.

I have this PR out for showing more configs (you may be interested in tuning down_interval)

https://github.com/mongoid/mongoid-site/pull/254/files

@steve-rodriguez
Copy link
Author

@niedfelj Will do this today and let you know the results and will post some log files..

@wandenberg
Copy link
Contributor

Hi @steve-rodriguez

Can you test the code at this branch on your server?
Be sure that the moped-2.0.4.1.gem was generated.

It will produce some log messages like these bellow.
Maybe this branch does not solve the problem, but the log messages will help me try to fix the problem.

W, [2015-03-06T17:18:26.081014 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8ed20>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27117", "localhost:27317", "localhost:27217"], "primary"=>"localhost:27217", "me"=>"localhost:27117", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:26 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:26.082225 #8846]  WARN -- :   MOPED: connection failure investigation: failed with class 'Moped::Errors::OperationFailure' and message 'The operation: #<Moped::Protocol::Command
  @length=80
  @request_id=7
  @response_to=0
  @op_code=2004
  @flags=[]
  @full_collection_name="test_db.$cmd"
  @skip=0
  @limit=-1
  @selector={:getlasterror=>1, :w=>"majority"}
  @fields=nil>
failed with error 10058: "not master"

See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.'. runtime: n/a
W, [2015-03-06T17:18:26.082284 #8846]  WARN -- :   MOPED: Retrying connection attempt 30 more time(s). runtime: n/a
W, [2015-03-06T17:18:27.083077 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8ed20>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27117", "localhost:27317", "localhost:27217"], "me"=>"localhost:27117", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:27 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:27.083978 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8e780>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27217", "localhost:27317", "localhost:27117"], "me"=>"localhost:27217", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:27 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:27.084718 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8e230>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27317", "localhost:27217", "localhost:27117"], "me"=>"localhost:27317", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:27 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:27.085048 #8846]  WARN -- :   MOPED: Refreshed none #<Moped::Node:0x00000003a8ed20>, primary: false refreshed_at: 2015-03-06 17:18:27 -0300. runtime: n/a
W, [2015-03-06T17:18:27.085095 #8846]  WARN -- :   MOPED: Refreshed none #<Moped::Node:0x00000003a8e780>, primary: false refreshed_at: 2015-03-06 17:18:27 -0300. runtime: n/a
W, [2015-03-06T17:18:27.085136 #8846]  WARN -- :   MOPED: Refreshed none #<Moped::Node:0x00000003a8e230>, primary: false refreshed_at: 2015-03-06 17:18:27 -0300. runtime: n/a
W, [2015-03-06T17:18:27.085274 #8846]  WARN -- :   MOPED: connection failure investigation: failed with class 'Moped::Errors::ConnectionFailure' and message 'Could not connect to a primary node for replica set #<Moped::Cluster:30701240 @seeds=[<Moped::Node resolved_address="127.0.0.1:27117">, <Moped::Node resolved_address="127.0.0.1:27217">, <Moped::Node resolved_address="127.0.0.1:27317">]>'. runtime: n/a

@steve-rodriguez
Copy link
Author

@wandenberg Yep give me a bit and I will post the results. Didn't have a chance to post results yesterday...

@steve-rodriguez
Copy link
Author

@niedfelj Here is a RECAP of my testing with your branch with the following mongoid.yml settting:

production:
sessions:
default:
database: edist
hosts:
['qa2-web-db1:33478', 'qa2-web-db2:33478', 'qa2-web-db3:33478']
options:
read: primary_preferred
pool_size: 10
pool_timeout: 0.5
retry_interval: 1
timeout: 5
max_retries: 20
refresh_interval: 120

secondary:
  database: fund
  hosts:
    ['qa2-cms-db1:33478', 'qa2-cms-db2:33478', 'qa2-cms-db3:33478']
  options:
    read: primary_preferred
    pool_size: 10
    pool_timeout: 0.5
    retry_interval: 1
    timeout: 5
    max_retries: 20
    refresh_interval: 120

I changed my read preference to be primary_preferred. Ideally a read preference of primary should work, but I believe something in moped caused it to NOT work. I had a 3 node replicaset that looks like this to start ( n1-secondary, n2-primary and n3-secondary ). I brought down n1 and tested for a bit with no errors available to the app, but moped errors in the log which is fine. I brought n1 back up and took down n3. Same behavior. I brought n3 back up and stepped down n2. Then I took down n2 and brought it back up. All of this was done to simulate doing a mongo db upgrade. No errors visible to the app, but the attached moped errors occurred in the log. I started the test at 1:34 p.m. and ended it around 1:38 p.m.

Please advise on this TEST and getting your niedfelj branch changes into the main Moped code base and a NEW release cut. It would be nice if the read preference could be primary instead of primary_preferred. Thanks! -Steve
moped_errors

@wandenberg
Copy link
Contributor

Hi @steve-rodriguez

at some point on your application the logger was set to nil and the Moped.logger do not create the default logger again.
You can test this doing inside the console something like that

Moped.logger # will return a logger instance, Rails logger or Logger
Moped.logger = nil
Moped.logger # will return nil

I changed the Moped.logger to always return an instance.
Doing the same test with this 27cb994 the final call will return at least a Logger instance.

Sorry for the problem I could have caused.

@wandenberg
Copy link
Contributor

Hi @steve-rodriguez

did you have opportunity to test the code?

@steve-rodriguez
Copy link
Author

@wandenberg Sorry for the delay. I had to get some QA resources to help test and they were unavailable at the end of last week. Will be doing some testing today and respond by the end of the day. thx!

@steve-rodriguez
Copy link
Author

@wandenberg Here is a snippet of my mongoid.yml and the log file from the app that shows the ConnectionPool::PoolShuttingDownError. All I basically did to generate this error was shut down a secondary node in the replica set and then clicked around the app a few times and this error would show up occasionally.

Mongoid.yml snippet:

production:
sessions:
default:
database: edist
hosts:
['qa2-web-db1:33478', 'qa2-web-db2:33478', 'qa2-web-db3:33478']
options:
read: primary
pool_size: 10
pool_timeout: 0.5
retry_interval: 1
timeout: 5
max_retries: 20
refresh_interval: 120

secondary:
  database: fund
  hosts:
    ['qa2-cms-db1:33478', 'qa2-cms-db2:33478', 'qa2-cms-db3:33478']
  options:
    read: primary
    pool_size: 10
    pool_timeout: 0.5
    retry_interval: 1
    timeout: 5
    max_retries: 20
    refresh_interval: 120

jruby_rack.txt | uploaded via ZenHub

@steve-rodriguez
Copy link
Author

@niedfelj Any feedback on what I mentioned to you 8 days ago? thanks!

@niedfelj
Copy link

@steve-rodriguez Those log messages aren't necessarily "errors". I just expanded the logging facilities in my branch because I wanted to have more visibility into the processes occurring in moped/mongoid (IMHO, the current implementations take the stance of showing almost no logging, and that's not helpful). I'm glad my branch worked for you, unfortunately I don't have any ability to get them into the main codebase. My PR's are sitting there waiting...

@dblock
Copy link
Contributor

dblock commented Mar 18, 2015

Bump @durran @arthurnn.

@wandenberg
Copy link
Contributor

Hi @steve-rodriguez

can you test the code on this #359 ?
I think this will solve the problem with ConnectionPool::PoolShuttingDownError.
If it works and you need retry for write operations you can test this other branch.

@aponsin
Copy link

aponsin commented Mar 25, 2015

Are there any progress on this issue? It's really a plague for production application running with it. Thx @steve-rodriguez @niedfelj @wandenberg for the work and please @durran @arthurnn could we incorporate the fixes into a new release ASAP ? Thx.

@dblock
Copy link
Contributor

dblock commented Mar 25, 2015

The little bird tells me we're going to get relatively little support for this because Mongoid 5 + the office MongoDB Ruby driver 2.0 are in a fairly advanced shape. It does seem like @arthurnn is willing to merge commits though and release things - lets all try to give him some PRs he can vote yes/no to?

@matsimitsu
Copy link

We're seeing this issue as well when one of the secondaries is not reachable at all (this is with 2.0.4 and the patches from @wandenberg in the wandenberg/operation_timeout branch. )

To reproduce (in a 3 node replicaset)
simply shutdown one of the secondaries (the entire host, not just Mongod) and you'll see:

2015-04-29T14:25:01.841Z 27232 TID-otcy5hal8 WARN: ConnectionPool::PoolShuttingDownError
2015-04-29T14:25:01.842Z 27232 TID-otcy5hal8 WARN: /home/app/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool/timed_stack.rb:78:in `block (2 levels) in pop'

When running this in Sidekiq, it won't recognise this as an error and drop the job completely, causing lost data from the queue.

@matsimitsu
Copy link

So I did a little digging today and this is the relevant Stack:

2015-04-30T10:59:24.371Z 24905 TID-ow5881u84 WARN: ConnectionPool::PoolShuttingDownError
2015-04-30T10:59:24.371Z 24905 TID-ow5881u84 WARN: /home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool/timed_stack.rb:78:in `block (2 levels) in pop'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool/timed_stack.rb:77:in `loop'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool/timed_stack.rb:77:in `block in pop'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool/timed_stack.rb:76:in `synchronize'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool/timed_stack.rb:76:in `pop'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool.rb:89:in `checkout'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool.rb:61:in `block in with'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool.rb:60:in `handle_interrupt'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/connection_pool-2.2.0/lib/connection_pool.rb:60:in `with'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:116:in `connection'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:147:in `disconnect'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/failover/retry.rb:27:in `execute'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:193:in `rescue in ensure_connected'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:196:in `ensure_connected'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:598:in `block in flush'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:626:in `block in logging'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.7/lib/active_support/notifications.rb:159:in `block in instrument'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.7/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.7/lib/active_support/notifications.rb:159:in `instrument'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/instrumentable.rb:31:in `instrument'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:625:in `logging'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:596:in `flush'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:399:in `process'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/operation/read.rb:48:in `execute'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:657:in `read'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:90:in `command'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/node.rb:440:in `refresh'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/cluster.rb:188:in `block in refresh'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/cluster.rb:200:in `each'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/cluster.rb:200:in `refresh'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/cluster.rb:151:in `nodes'
/home/app/releases/6c1b2cea61f123c5d3298eb4f75c06f015cf3c6c/vendor/bundle/ruby/2.2.0/bundler/gems/moped-4990ec2a8fab/lib/moped/cluster.rb:246:in `with_primary'

The issue lies in the retry beeing attempted when a node is down, (I use the @wandenberg debug branch to add even more logging.

W, [2015-04-30T11:21:02.112642 #1386]  WARN -- :   MOPED: Failover strategy source: 'Moped::Errors::ConnectionFailure' and message 'Timed out connection to Mongo on xxx.xxx.xxx.85:27017'. runtime: n/a
W, [2015-04-30T11:21:02.113800 #1386]  WARN -- :   MOPED:  Disconnecting from <Moped::Node resolved_address="xxx.xxx.xxx.85:27017"> #<Moped::Node:0x007fdd3b7cf900> runtime: n/a

The retry fails (from what I have gathered), because it tries to disconnect the node from a connection that is already being shutdown.

When I change the Failover strategy (lib/moped/failover.rb) from:

Errors::ConnectionFailure => Retry,

to

Errors::ConnectionFailure => Disconnect,

The issue disappears.

I tried rescue-ing the connection pool error, but I had little luck

(lib/moped/failover/retry.rb)

def execute(exception, node)
  begin
    node.disconnect unless exception.is_a?(Errors::PoolTimeout)
  rescue => e
    node.down!
    raise(e)
  end

  begin
    node.connection do |conn|
      yield(conn) if block_given?
    end
  rescue Errors::PoolTimeout => e
    raise Errors::ConnectionFailure.new e
  rescue Exception => e
    node.down!
    raise(e)
  end
end

It would still raise the connectionpool error.

So my question is, why would we want to retry the Moped::Errors::ConnectionFailure? Isn't it better to just disconnect and after the refresh_interval in the moped config it would try again right?

@sahin
Copy link

sahin commented May 22, 2015

+1, @arthurnn any update on this one?

@sahin
Copy link

sahin commented May 22, 2015

+1

right now, we have a some issues in productions in dozens of server and websites + api that is used by many vendors, movie studios and our apps.

Right now, if any thing happens to a node in the replication , we are getting
No route to host - connect(2) for "20.0.0.16" port 27017
ConnectionPool::PoolShuttingDownError

@sahin
Copy link

sahin commented May 22, 2015

@steve-rodriguez , @matsimitsu solution seems fair, what is the final say? it will be great if we can merge and dump the version

@steve-rodriguez
Copy link
Author

@arthurnn @durran @sahin @matsimitsu As you can see, I have been pushing to get this fixed for months and for some reason there seems to be no activity on this project with respect to this issue and getting various pull requests committed. Quite a few developers have provided potential solutions, but there seems to be nothing getting pulled into the code base. Can one of the committers address this issue and why there seems to be no pull requests or ideas being committed/implemented? Thanks!

@sahin
Copy link

sahin commented May 22, 2015

@arthurnn @durran , the moped is in production and powering some super famous movie sites, I cant tell which one, but I can give hints, skynet :) , so can go and check the site and see movielala.com logo in the videos. So I can say there are so many people all over the world are affecting from this issue. It is also affecting our production api + web site.

I know that you guys are busy and moped is moving to offical mongo support :) but before that.

We are having crazy issues with this error , it will be super great. You guys can consider @steve-rodriguez 's solution and other solutions.

@sahin
Copy link

sahin commented May 23, 2015

@arthurnn @durran @matsimitsu @steve-rodriguez , I just had around 5 K errors during the primary election.

errors bugsnag

@sahin
Copy link

sahin commented May 25, 2015

@arthurnn @durran @matsimitsu @steve-rodriguez any updates?

@isair
Copy link

isair commented May 25, 2015

+1

@deepredsky
Copy link

Bump! @durran @arthurnn

@badeozgule
Copy link

+1

@dblock
Copy link
Contributor

dblock commented May 26, 2015

Can everyone comment on what they are running in production right now (and with what effect)? Maybe you can push a version of the moped gem to Rubygems under a different name (eg. moped-pool-shutting-down-fixes)?

@dblock
Copy link
Contributor

dblock commented May 26, 2015

#351 has been merged, can people on this thread please try that solution?

@steve-rodriguez
Copy link
Author

@dblock Didn't get a chance to get to this yesterday. Will test it this afternoon and provide feedback. Thanks!

@chenqiangzhishen
Copy link

chenqiangzhishen commented Jul 27, 2016

use @argami refer config seems not appear the issue Moped::Errors::ConnectionFailure: Could not connect to a primary nod... again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests