-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give grace period for riakc_pb_socket
to reconnect
#1230
Conversation
When strong user get fails with timeout or disconnected error, weak user get which happens just after it will likely fail because reconnection of riakc_pb_socket still undergo and not connected yet. This commit gives some time period for riakc_pb_socket can reconnect to Riak after timeout or disconnected errors.
Block get is executed N=one at first by default, if the node for the first primary vnode is too-slow/hang/frozen/almost-but-not-dead, the get operation may fail with timeout. Then riakc_pb_socket disconnects the TCP connection and enter reconnection phase. Doing N=all get just after that likely fails because TCP connection is not connected yet.
2470de1
to
0e35cb6
Compare
|
Does "the process" indicate |
I think it was riak_cs_riak_client. |
It may be my mind model of riak cs code base... some processes share single riak_cs_riak_client checked out from a pool, and (at least I think) the pid is unchangable in a request-response lifecycle. If it should be changed or renewed, all the processes that share it should be notified and renew to new pid. At the current implementation, I'm afraid that it might be a big diff covering many modules in the final stage of 2.1.0 development cycle. By the example of this PR's current diff, block server process detects Although, I prefer any other way to just sleeping in general if possible. This PR is not mandatory for 2.1.0 because it's long standing issue 😓 Some random thoughts for possible next PR creator:
|
…tion Give grace period for `riakc_pb_socket` to reconnect Reviewed-by: kuenishi
@borshop merge |
GC also fails at
Maybe we'd also need this here. ... but the code change couldn't be small. fmm. |
This PR addresses #1201 (RCS-250) (RCS-250) .
When
riakc_pb_socket
detects timeout errors, it disconnects theTCP connection and enters reconnection phase. Immediate next
request to
riakc_pb_socket
process will likely fail withinfamous
disconnect
errors.Typical case is described in #1201 (RCS-250) (RCS-250) for user object fetch.
Also, there is another case for block get. If the first
primary vnode for a block is very slow, the reuqest will
timeout because it is
N=one
. Then subsequentN=all
request will likely fail.
This PR fixes these two cases by adding grace period so that
riakc_pb_socket
can reconnect if possible.