Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for disconnect issues in #732 #784

Closed
wants to merge 1 commit into from

Conversation

rolette
Copy link

@rolette rolette commented Sep 10, 2016

I haven't finished all of my testing yet, but initial signs are all good and it passes your tox tests. There were a couple of behavior changes required to fully fix it, so I wanted to get your feedback on my approach.

  1. The socket.shutdown() call in Connection.disconnect() wasn't safe if the process was forked and the child process inherited the file descriptor. Since the shutdown() call is desirable for faster cleanup in the non-forked case, I added a new fork_safe option to Connection.__init__() that let's the code do the safe thing by default or take advantage of faster socket cleanup in a non-forked environment.

That also allowed us to skip doing anything in _checkpid() if we know the process isn't forked.

  1. The other change fixes ConnectionPool.disconnect() so it doesn't rip connections that are in-use out from under the threads that own them. I use a generation counter to flush connections safely.

A side-effect of this is that ConnectionPool.disconnect() doesn't actually disconnect anything immediately. Connections are disconnected gradually. I added an option to force the old behavior of "damn the torpedoes!" killing all connections, but the default is the safe path.

Let me know what you think.

Connection.disconnect() were ripping connections out from under
the threads or processes that owned them

redis#732
@rolette
Copy link
Author

rolette commented Sep 15, 2016

Extended testing still looks good on the fix.

@andymccurdy
Copy link
Contributor

@rolette I don't see any tests included with the PR. Maybe you forgot to push them?

@rolette
Copy link
Author

rolette commented Sep 15, 2016

@andymccurdy I didn't create any new redis-py tests... I ran the existing tests to make sure I didn't break anything. The rest of the testing I was referring to was testing we do in my product where I was seeing the various issues originally.

@rolette
Copy link
Author

rolette commented Sep 16, 2016

@andymccurdy To add a little more context, I'm not really sure how you'd go about adding a test that fits into the redis-py test infrastructure.

The errors that are generated due to the disconnects ripping connections out from under the owner are very timing dependent. It requires some run time to generate the problem and there's really no fixed number of iterations you can do and say definitively that it is fixed.

I was previously able to repro some form of the issue in a reasonable amount of time within my product, but it's a complex environment (see my previous description in the bug thread). With this patch, I haven't been able to reproduce it yet.

Beyond gaining more confidence in the fix with accumulated runtime, IMO, code review to validate the fix is probably the way to go.

@rolette
Copy link
Author

rolette commented Oct 7, 2016

I've been running a month with this fix and haven't seen the disconnect issues from #732 since applying it. I'm calling it done from my side. Can we get this merged into the main repo @andymccurdy?

@jhgg
Copy link
Contributor

jhgg commented Feb 9, 2017

Would love to see this merged. We run into a similar issue as well - especially when using the sentinel connection pool. When the sentinel elects a new master, we get a lot of exceptions until fully restarting the API servers

@jhgg
Copy link
Contributor

jhgg commented Feb 21, 2017

We've merged this into our fork - and found that this impl does not work w/ the BlockingConnectionPool. I implemented similar logic ontop of that onto our fork, discord#1

@rolette - take a look and see what you think if you have a chance - and if you want to add that to this PR.

@rolette
Copy link
Author

rolette commented Feb 22, 2017

@jhgg - Yeah, we don't use BlockingConnectionPool in my product, so I didn't touch it.

It's been 5 months since I submitted the the PR, so not sure @andymccurdy has any interest in merging it, but I'll try to take a look at your additions this weekend.

@jhgg
Copy link
Contributor

jhgg commented Feb 22, 2017

Thanks! I also implemented some tests for the blocking connection pool logic.

@andymccurdy
Copy link
Contributor

@rolette I've finally got around to resolving this. We already have a "unique-ish" id in the PID on both the Connection and the ConnectionPool so I'm using that as the guard against a child calling shutdown on a socket that wasn't created by them. I've also added some tests using the multiprocessing lib (that uses fork) to make sure we're doing the right thing.

Sorry again for the length of time but thanks for raising this issue and putting together this PR. It was definitely helpful.

@andymccurdy
Copy link
Contributor

3.2.0 has been released with these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants