-
Notifications
You must be signed in to change notification settings - Fork 151
Fix 16550 "not authorized", TypeError, and others #324
Conversation
Call to disconnect triggers call to pool, which uses address.resolved as a hash key. This in turn causes pollution of the connection pool.
Test for equality of Node (==,#eql?) depends on node.address.resolved. Therefore avoid such tests (such as .include?() ) until address is resolved. This fixes unbounded growth (read: memory leak) of the seeds array in unstable network environments.
If existing socket is reset, perhaps due to mongod failure, credentials hash isn't reset. This causes them to not be sent upon reconnection. Changed to reset at connect() instead of relying on a disconnect() that may or may not happen. Also cache last set of credentials so they can be reused when connection loss is detected by with_connection().
We just ran a test against this version, and we still received the "TypeError: no implicit conversion of nil into String" (#290).
|
Testing this branch with the MongoidRnR script has shown the issue to be resolved. We're about to push this out to production to see what happens. While this does seem to resolve the eventual failure of DNS resolution (our connection works for some time, but eventually fails), there still does seem to be some low level Resolv failure which is leading to the trouble. We added some debugging and found that in |
@leemhenson Thanks for testing. Your bundle path (/vendor/bundle/ruby/2.1.0/bundler/gems/moped-4c2a0a29122e) shows a commit for the simplybusiness/moped/fix-retries-and-failover tree, not mine (zarqman/moped/fix-auth-errors). Of course, you may have manually applied one patchset on top of the other, so could you confirm which patches you're running before we dig in further? |
@mhoran It's possible the issue is with Resolv itself. In the past, I did extensive DNS work and found the Resolv library to be quite unreliable. Most of that was on Ruby 1.8. As I recall, we ended up using an alternate library instead. My own tests when building the above patches were set to cause DNS to fail 2/3s of the time, at random. I was able to run tests for several hours without it puking, so I'm hopeful these patches at least accommodate Resolv's behavior. |
@zarqman whoopsadaisy! we'll give it another go, this time with the right commit! sorry for the noise. |
Ok, we've pushed the right commit this time. We'll leave it overnight to see if we get any #290 -s, but one thing I did try was trigger a stepdown on our compose.io replica set. Moped immediately failed with the error below, and did not recover without an app restart:
|
After 24 hours and not a single "no implicit conversion of nil into String" error in production, I'd say this is a reasonable fix for #290. Of course, DNS resolution is still failing sporadically, but that's something else. We tried adding "resolv-replace" to our project, which uses C to re-implement Resolv, but that didn't help. We also raised the problem with Heroku, given there is only a single nameserver in /etc/resolv.conf and their DNS is a black box. |
@@ -46,6 +46,7 @@ def alive? | |||
# | |||
# @since 1.0.0 | |||
def connect | |||
credentials.clear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we clear credentials on connect now , instead of on disconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arthurnn If the socket disconnects (network issues, mongod failover, etc.), with_connection() can trigger a reconnect (sometimes without disconnect() being called). In this situation, credentials haven't been cleared and so they won't be resent, which causes the 16550 not authorized errors.
Resetting credentials ensures they're are always resent on reconnection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@jonhyman , is this related to your PRs too? |
Fix 16550 "not authorized", TypeError, and others
Thanks... pr looks good |
This is awesome! @arthurnn is it possible to create a new minor version so we can start deploying this into production without relying on git? |
http://rubygems.org/gems/moped/versions/2.0.1 . version 2.0.1 released with this fix.. thanks all |
This set of commits fixes a number of outstanding errors, including 16550: "not authorized for query" and TypeError: no implicit conversion of nil into String.
I believe this addresses #288, #278, #252, #247, #275, #290, and others.
Nearly everything here is triggered by unstable network environments, hence the difficulty in testing them. Thanks to MongoLab for their flipflop server which was a great help. I also injected random intentional DNS failures. The combination of the two, along with the helpful script in #323, were the basis for discovery. On to the issues:
DNS address resolution did not handle the case of DNS failure on the initial call to Address#resolve. This caused node.address.ip and node.address.resolved to be invalid which had a variety of cascading effects, including a memory leak in the seeds array and the dreaded "TypeError: no implicit conversion of nil into String" error.
There are multiple codepaths to trigger Connection#connect. Only some of them actually call apply_credentials(). Further, it was possible to call #connect without resetting @credentials, causing apply_credentials() to no-op. Each of these could individually trigger the 16550 errors. All of this has been addressed.
Using the flipflop server + random DNS failures and the script noted above, the only remaining error I can cause is '13435: "not master and slaveOk=false"'. The patches in #320 fix those as well and I recommend that set of patches. I have tested the below set of patches against both moped/master as well as the tree from #320.