Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give grace period for riakc_pb_socket to reconnect #1230

Merged
merged 2 commits into from
Sep 7, 2015

Conversation

shino
Copy link
Contributor

@shino shino commented Aug 31, 2015

This PR addresses #1201 (RCS-250) (RCS-250) .

When riakc_pb_socket detects timeout errors, it disconnects the
TCP connection and enters reconnection phase. Immediate next
request to riakc_pb_socket process will likely fail with
infamous disconnect errors.

Typical case is described in #1201 (RCS-250) (RCS-250) for user object fetch.
Also, there is another case for block get. If the first
primary vnode for a block is very slow, the reuqest will
timeout because it is N=one. Then subsequent N=all
request will likely fail.

This PR fixes these two cases by adding grace period so that
riakc_pb_socket can reconnect if possible.

@shino shino added this to the 2.2.0 milestone Aug 31, 2015
When strong user get fails with timeout or disconnected error,
weak user get which happens just after it will likely fail
because reconnection of riakc_pb_socket still undergo and not
connected yet.
This commit gives some time period for riakc_pb_socket can
reconnect to Riak after timeout or disconnected errors.
Block get is executed N=one at first by default, if the node
for the first primary vnode is too-slow/hang/frozen/almost-but-not-dead,
the get operation may fail with timeout. Then riakc_pb_socket
disconnects the TCP connection and enter reconnection phase.
Doing N=all get just after that likely fails because TCP connection
is not connected yet.
@shino shino force-pushed the feature/grace-period-for-reconnection branch from 2470de1 to 0e35cb6 Compare September 1, 2015 02:21
@shino
Copy link
Contributor Author

shino commented Sep 1, 2015

Rebased on develop 1643c8d to incorporate r_t fix #1231

@shino shino modified the milestones: 2.1.0, 2.2.0 Sep 3, 2015
@kuenishi
Copy link
Contributor

kuenishi commented Sep 7, 2015

Currently we have auto_reconnect as false by default, so the recovery from disconnected state is by just re-launching the process by poolboy? [edit]We had auto_reconnect as true by default. I was wrong.

@shino
Copy link
Contributor Author

shino commented Sep 7, 2015

by just re-launching the process by poolboy?

Does "the process" indicate riak_cs_riak_client process? Or riakc_pb_socket?

@kuenishi
Copy link
Contributor

kuenishi commented Sep 7, 2015

I think it was riak_cs_riak_client.

@shino
Copy link
Contributor Author

shino commented Sep 7, 2015

It may be my mind model of riak cs code base... some processes share single riak_cs_riak_client checked out from a pool, and (at least I think) the pid is unchangable in a request-response lifecycle. If it should be changed or renewed, all the processes that share it should be notified and renew to new pid. At the current implementation, I'm afraid that it might be a big diff covering many modules in the final stage of 2.1.0 development cycle.

By the example of this PR's current diff, block server process detects timeout but it can not notify put/get FSM and/or mochiweb acceptor processes which has the pid in #context.

Although, I prefer any other way to just sleeping in general if possible. This PR is not mandatory for 2.1.0 because it's long standing issue 😓

Some random thoughts for possible next PR creator:

  • Slow node or silent node failure may prevent some users from accessing Riak CS system completely (because user object can not be fetched) typically for 60 seconds.
  • Using queue_if_disconnected encapsulates disconnected state in riakc_pb_socket and hides it from riak cs, it will be not good for visibility.

borshop added a commit that referenced this pull request Sep 7, 2015
…tion

Give grace period for `riakc_pb_socket` to reconnect

Reviewed-by: kuenishi
@shino
Copy link
Contributor Author

shino commented Sep 7, 2015

@borshop merge

@borshop borshop merged commit 0e35cb6 into develop Sep 7, 2015
@shino shino deleted the feature/grace-period-for-reconnection branch September 7, 2015 08:35
@kuenishi
Copy link
Contributor

kuenishi commented Sep 7, 2015

GC also fails at riak_cs_gc_key_list.erl like this:

2015-09-07 18:17:29.172 [warning] <0.10473.4>@riak_cs_gc_key_list:eligible_manifest_keys:122 Error occurred trying to query from time <<"0">> to <<"1441531049">>in gc key index. Reason: disconnected

Maybe we'd also need this here. ... but the code change couldn't be small. fmm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants