Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV i1871 - Handle timeout on remote connection #1

Merged
merged 5 commits into from
Sep 2, 2023

Conversation

martinsumner
Copy link

basho#1871

The riak_kv_replrtq_snk will now determine the work to be run after initialisation - so the initial connection time of PB clients no longer delays the startup of riak_kv.

The riak_kv_replrtq_snk functions called by the riak_kv_replrtq_peer have timeouts disabled so that a riak_kv_replrtq_snk process which is hanging on a connection timeout does not crash a calling riak_kv_replrtq_peer process.

The riak_kv_replrrtq_peer will now catch and ignore replies from the erlang http client. The client catches timeouts on calls to the connection process, so that when the connection process replies it goes to the process which launched the client.

The peer needs to call current_peers, and this may timeout if the process is held busy by a client initialisation that is timing out due to unavailability of a peer
The ibrowse http client  will gen_server:call on send_req with a timeout, and catch any timeout error.

This means that the client may receive an {error, reqd_timeout}, but the gen_server has not crashed and so may later send a gen_server:reply to the riak_kv_replrtq_peer.

These errors need to be picked up in handle_info to avoid noise of repeated crashes.

This is not a problem in riak_kv_replrtq_snk, as in the process we are always using the http client from within a short-lived spawned process.  Likewise in riak_kv_ttaaefs_manager the client is sued within a short-lived aae_exchange.
Riak now OTP 22+ only, so ok to use feature added in OTP 21 rather than an artificial method for deferring work after init reply.
src/riak_kv_replrtq_snk.erl Outdated Show resolved Hide resolved
@martinsumner martinsumner merged commit dee16d5 into nhse-develop-3.0 Sep 2, 2023
1 check passed
@martinsumner martinsumner deleted the nhse-contrib-kv1871 branch September 2, 2023 08:12
martinsumner added a commit that referenced this pull request Nov 14, 2023
KV i1871 - Handle timeout on remote connection
martinsumner added a commit that referenced this pull request Nov 14, 2023
KV i1871 - Handle timeout on remote connection
@martinsumner martinsumner mentioned this pull request Nov 14, 2023
martinsumner added a commit that referenced this pull request Feb 13, 2024
* Merge pull request #1 from nhs-riak/nhse-contrib-kv1871

KV i1871 - Handle timeout on remote connection

* Trigger batch correctly at each size (#4)

* Force timeout to trigger (#3)

Previously, the inactivity timeout on handle_continue could be cancelled by a call to riak_kv_rpelrtq_snk (e.g. from riak_kv_rpelrtq_peer).  this might lead to the log_stats loop never being triggered.

* Configurable %key query on leveled (#8)

Can be configured to ignore tombstone keys by default.

* Allow nextgenrepl to real-time replicate reaps (#6)

* Allow nextgenrepl to real-time replicate reaps

This is to address the issue of reaping across sync'd clusters.  Without this feature it is necessary to disable full-sync whilst independently replicating on each cluster.

Now if reaping via riak_kv_reaper the reap will be replicated assuming the `riak_kv.repl_reap` flag has been enabled.  At the receiving cluster the reap will not be replicated any further.

There are some API changes to support this.  The `find_tombs` aae_fold will now return Keys/Clocks and not Keys/DeleteHash.  The ReapReference for riak_kv_repaer will now expect a clock (version vector) not a DeleteHash, and will also now expect an additional boolean to indicate if this repl is a replication candidate (it will be false for all pushed reaps).

The object encoding for nextgenrepl now has a flag to indicate a reap, with a special encoding for reap references.

* Update riak_object.erl

Clarify specs

* Take timestamp at correct point (after push)

* Updates following review

* Update rebar.config

* Make current_peers empty when disabled (#10)

* Make current_peers empty when disabled

* Peer discovery to recognise suspend and disable of sink

* Update src/riak_kv_replrtq_peer.erl

Co-authored-by: Thomas Arts <thomas.arts@quviq.com>

* Update src/riak_kv_replrtq_peer.erl

Co-authored-by: Thomas Arts <thomas.arts@quviq.com>

---------

Co-authored-by: Thomas Arts <thomas.arts@quviq.com>

* De-lager

* Add support for v0 object in parallel-mode AAE (#11)

* Add support for v0 object in parallel-mode AAE

Cannot assume that v0 objects will not happen - capability negotiation down to v0 on 3.0 Riak during failure scenarios

* Update following review

As ?MAGIC is distinctive constant, then it should be the one on the pattern match - with everything else assume to be convertible by term_to_binary.

* Update src/riak_object.erl

Co-authored-by: Thomas Arts <thomas.arts@quviq.com>

---------

Co-authored-by: Thomas Arts <thomas.arts@quviq.com>

* Update riak_kv_ttaaefs_manager.erl (#13)

For bucket-based full-sync `{tree_compare, 0}` is the return on success.

* Correct log macro typo

---------

Co-authored-by: Thomas Arts <thomas.arts@quviq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants