Skip to content

Bugfix/mw/keepalive timer from interval to send after 1.4 [JIRA: RIAK-1888] #683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jonmeredith
Copy link
Contributor

#680 (RIAK-1507) applied to 1.4 series for customer.

lordnull added 3 commits May 20, 2015 09:52
This resolves an issue where the provider process could become blocked or
stuck for an extended period of time, only to receive many pointless
messages from the timer system.

First change is to up the keepalive from 1 second to 60. This should be
enough for the remote end to realize if the socket connection has died, but
not so much as to be spammy.

The second, more invasive change, is to use the erlang:send_after/3 reset
pattern rather than timer:send_interval/3. By using the erlang:send_after,
messages will not be sent if the process becomes stuck, as no code capable
of resetting the timer will run. Secondly, the return values of the
erlang:send_after and erlang:cancel_timer don't need to be pattern matched
to ensure correctness (since the former returns the reference itself, and
the latter returns an integer or false).
Based on feedback, resetting the keep-alive timer on each socket message
may not be strictly needed. We just need some data every 60 seconds or so
to ensure a node is alive (rather than the tcp timeout which could be
much longer). An interval is still not used (reasons in previous commit
still stand), so instead it's just re-creating the timer on each keepalive.
We never cancel it, nor refer to the timer in the code, ever. One could
flood the process by manually sending 'keepalive' to it, but if someone is
doing that, you've got bigger problems.

This means that now the timer is like a send interval that only intervals
if the process is not stuck, and does not add a timer on proxy get
requests.
@jonmeredith
Copy link
Contributor Author

+1 3a2863d

Already created branch for 1.4 to give patch to customer - needs to be applied against 2.0 and 2.1 branches. @lordnull could you prepare PRs (format-patch/am so we keep sha's if we can) against those two and I'll approve.

@jonmeredith
Copy link
Contributor Author

@bors retry

Unrelated failure

module 'riak_core_cluster_mgr'
  module 'riak_core_cluster_mgr_tests'
    *** context setup failed ***
::{badmatch,{error,{already_started,<0.1301.0>}}}

jonmeredith pushed a commit that referenced this pull request May 20, 2015
…erval-to-send-after-1.4

Bugfix/mw/keepalive timer from interval to send after 1.4

Good effort bors, merging manually as eunit not reliable enough back in prehistory.
@jonmeredith jonmeredith merged commit cf1e203 into 1.4 May 20, 2015
@jonmeredith
Copy link
Contributor Author

Quick test - mention of #680 (RIAK-1507) again to see if annotated with JIRA number.

@jonmeredith
Copy link
Contributor Author

Also - does it work the other way - RIAK-1507 (#680)

jonmeredith pushed a commit that referenced this pull request Jun 23, 2015
…mer-from-interval-to-send-after-1.4

Bugfix/mw/keepalive timer from interval to send after 1.4

Good effort bors, merging manually as eunit not reliable enough back in prehistory.

(cherry picked from commit cf1e203)
@Basho-JIRA Basho-JIRA changed the title Bugfix/mw/keepalive timer from interval to send after 1.4 Bugfix/mw/keepalive timer from interval to send after 1.4 [JIRA: RIAK-1888] Jun 24, 2015
jonmeredith pushed a commit that referenced this pull request Jun 24, 2015
…om-interval-to-send-after-2.0

Backport of merge pull request #683 from basho/bugfix/mw/keepalive-ti…

bors in poor state due to 2.0.6 reversioning - merging manually.
@Basho-JIRA
Copy link

Merged in the fix.

_[posted via JIRA by Jon Meredith]_

@kuenishi kuenishi mentioned this pull request Jul 3, 2015
@jonmeredith jonmeredith removed their assignment Apr 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants