Give grace period for `riakc_pb_socket` to reconnect #1230

shino · 2015-08-31T08:28:52Z

This PR addresses #1201 (RCS-250) (RCS-250) .

When riakc_pb_socket detects timeout errors, it disconnects the
TCP connection and enters reconnection phase. Immediate next
request to riakc_pb_socket process will likely fail with
infamous disconnect errors.

Typical case is described in #1201 (RCS-250) (RCS-250) for user object fetch.
Also, there is another case for block get. If the first
primary vnode for a block is very slow, the reuqest will
timeout because it is N=one. Then subsequent N=all
request will likely fail.

This PR fixes these two cases by adding grace period so that
riakc_pb_socket can reconnect if possible.

When strong user get fails with timeout or disconnected error, weak user get which happens just after it will likely fail because reconnection of riakc_pb_socket still undergo and not connected yet. This commit gives some time period for riakc_pb_socket can reconnect to Riak after timeout or disconnected errors.

Block get is executed N=one at first by default, if the node for the first primary vnode is too-slow/hang/frozen/almost-but-not-dead, the get operation may fail with timeout. Then riakc_pb_socket disconnects the TCP connection and enter reconnection phase. Doing N=all get just after that likely fails because TCP connection is not connected yet.

shino · 2015-09-01T02:21:53Z

Rebased on develop 1643c8d to incorporate r_t fix #1231

kuenishi · 2015-09-07T04:33:58Z

~~Currently we have auto_reconnect as false by default, so the recovery from disconnected state is by just re-launching the process by poolboy?~~ [edit]We had auto_reconnect as true by default. I was wrong.

shino · 2015-09-07T05:20:39Z

by just re-launching the process by poolboy?

Does "the process" indicate riak_cs_riak_client process? Or riakc_pb_socket?

kuenishi · 2015-09-07T07:12:45Z

I think it was riak_cs_riak_client.

shino · 2015-09-07T07:28:52Z

It may be my mind model of riak cs code base... some processes share single riak_cs_riak_client checked out from a pool, and (at least I think) the pid is unchangable in a request-response lifecycle. If it should be changed or renewed, all the processes that share it should be notified and renew to new pid. At the current implementation, I'm afraid that it might be a big diff covering many modules in the final stage of 2.1.0 development cycle.

By the example of this PR's current diff, block server process detects timeout but it can not notify put/get FSM and/or mochiweb acceptor processes which has the pid in #context.

Although, I prefer any other way to just sleeping in general if possible. This PR is not mandatory for 2.1.0 because it's long standing issue 😓

Some random thoughts for possible next PR creator:

Slow node or silent node failure may prevent some users from accessing Riak CS system completely (because user object can not be fetched) typically for 60 seconds.
Using queue_if_disconnected encapsulates disconnected state in riakc_pb_socket and hides it from riak cs, it will be not good for visibility.

…tion Give grace period for `riakc_pb_socket` to reconnect Reviewed-by: kuenishi

shino · 2015-09-07T08:14:02Z

@borshop merge

kuenishi · 2015-09-07T09:19:45Z

GC also fails at riak_cs_gc_key_list.erl like this:

2015-09-07 18:17:29.172 [warning] <0.10473.4>@riak_cs_gc_key_list:eligible_manifest_keys:122 Error occurred trying to query from time <<"0">> to <<"1441531049">>in gc key index. Reason: disconnected

Maybe we'd also need this here. ... but the code change couldn't be small. fmm.

shino added this to the 2.2.0 milestone Aug 31, 2015

shino added 2 commits September 1, 2015 11:21

shino force-pushed the feature/grace-period-for-reconnection branch from 2470de1 to 0e35cb6 Compare September 1, 2015 02:21

shino modified the milestones: 2.1.0, 2.2.0 Sep 3, 2015

borshop added a commit that referenced this pull request Sep 7, 2015

Merge pull request #1230 from basho/feature/grace-period-for-reconnec…

168269a

…tion Give grace period for `riakc_pb_socket` to reconnect Reviewed-by: kuenishi

shino mentioned this pull request Sep 7, 2015

Timeout in get user with strong option makes subsequent weak get fail [JIRA: RCS-250] #1201

Closed

borshop merged commit 0e35cb6 into develop Sep 7, 2015

shino deleted the feature/grace-period-for-reconnection branch September 7, 2015 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give grace period for `riakc_pb_socket` to reconnect #1230

Give grace period for `riakc_pb_socket` to reconnect #1230

shino commented Aug 31, 2015

shino commented Sep 1, 2015

kuenishi commented Sep 7, 2015

shino commented Sep 7, 2015

kuenishi commented Sep 7, 2015

shino commented Sep 7, 2015

shino commented Sep 7, 2015

kuenishi commented Sep 7, 2015

Give grace period for riakc_pb_socket to reconnect #1230

Give grace period for riakc_pb_socket to reconnect #1230

Conversation

shino commented Aug 31, 2015

shino commented Sep 1, 2015

kuenishi commented Sep 7, 2015

shino commented Sep 7, 2015

kuenishi commented Sep 7, 2015

shino commented Sep 7, 2015

shino commented Sep 7, 2015

kuenishi commented Sep 7, 2015

Give grace period for `riakc_pb_socket` to reconnect #1230

Give grace period for `riakc_pb_socket` to reconnect #1230