-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wday merge - exposed issues #21
Comments
Groups passing so far:
|
New tests have been added since the wday merge, so these have now been de-lagered. There are a number of issues arising as these tests are run, documented in #21. This commit provides initial workarounds to problems highlighted in this issue with: - location - verify_riak_stats - verify_2i_limit
Helpful comments, so much left to update ... |
I like the idea of an
Within the scope of the outer |
As part of this testing, I spotted an issue with group.sh. Within group.sh there is use of I think in the past I've spotted this in some group files, and added a newline, but we shouldn't rely on remembering to do this:
|
On
|
Some issues exposed by merging in wday tests.
group.sh
The group test script now stops when it has a failed test, rather than continuing and reporting the list of tests and their results at the end.
location
The location change broke backwards compatibility in making the configuration of claim limited to the function name e.g. choose_claim_v. Previously it had been {Module, Function}. Anyone using this config, such as to use choose_claim_v3, would now be broken (including the workday test). As an additional complication the claim functions moved from the
riak_core_claim
module (which was renamed in refactoring toriak_core_membership_claim
.The algorithm_supported function needs to be changed:
Also the configuration of the claim function:
verify_handoff_zlib
Support for the legacy encode_zlib has been removed as part of Riak 2.9.0, but the test continued to pass, as the test had never set the configuration of the encoding correctly. Setting the encoding to encode_zlib correctly, causes for handoff to fail.
The test should be removed, and also the defaulting of the capability fetch to the legacy value (or indeed the capability check should be removed altogether). There can be issues when nodes have just started with capability checks falling back to default values, even where no cluster members require the legacy setting.
verify_riak_stats
There are a number of additional stats that may be added to organisation-specific versions. A specific rt_config item of
organisation
should exist so that organisation-specific stats can be checked for inclusion, without others having issues with missing keys.e.g.
verify_2i_limit
The wday branch fails due to a change on the secondary_index_tests
stream_loop/1
function to error on the "Got a wat" message, rather than loop. This change seems like the right thing to do.The issue is that a previous HTTP query has a straggling message, that then leads to a new PB stream receiving a rogue
{Ref, done}
message prior to receiving any genuine data.This looks like an error in the riak erlang http client, it should be waiting for a done message before closing its receive loop.
How to fix this in the riak erlang client isn't obvious though. Even when streaming isn't requested, ibrowse streaming is used to fetch the results (and in term streaming index queries within riak). This is then processed with the help of the mochiweb_multipart module for added complexity. In this case as a full response has been received, there is a {done, Continuation} followed by a done. But a done might not always follow in other circumstances.
The riak erlang http client probably needs radical changes rather than a simple fix. The use of stream_index in response to get_index is wrong I think. Using stream_index generates extra work, which might not be desired. It would be preferable for get_index to do a simple httpc:request, or at least an ibrowse_send_req without stream_to (ibrowse) and stream in the GET options.
basic_command_line
Fails on testing riak console in OTP 26 (with node down). No obvious issues with running riak console.
This passes on OTP26 on nhse-develop-3.4, but not when merged in with wday-develop-3.2 - either using the rt:interact or the original rt:console form of the test.
This has been corrected - was a an issue with the regular expression. The debug logging of
rt_exec:interact_seq/3
was helpful in resolving.nextgenrepl_bouncingtomb
Not waiting after config change to fully restart - i.e. waited to be pingable but not riak_kv service to start, and so was failing to have sufficient vnodes available when config-change/start was followed by a coverage query. There appears not to be a behaviour change wday vs nhse branches. But the additional wait resolves the issue
nextgenrepl_rtq_peerdiscovery
After forcing a change (in Cluster B), there is a non-immediate call to verify the change leads to a change of peers in Clusters A and C when manually prompting peer discovery. However, if automated peer discovery has occurred before the manual prompt, the manual prompt may lead to no changes - and this will lead to less than length(Nodes) reporting a change of peer. Auto discovery kicks in every 60 seconds, so this should be intermittent at worst, but the timings of the test seem to regularly align to prompt this.
Some stability gained by removing the wait after the change, and also making sure the initial max time has elapsed before commencing the test (so that the next tick will be at a random interval in the larger ?STD_MAX time). being strict on equality is probably not necessary either.
verify_mr_prereduce_node_down
Potential {error, timeout} when doing PUT during node stop. It is stop not stop_and_wait (and no wait for handoffs). Preflists could still be changing, need to look at forwarding and re-forwarding. As this isn't a test of PUT, bypassing this by waiting for now.
verify_bitcask_expiry_stats
This fails when gathering the stat (perhaps not available until a future merge). Using the organisation config to bypass the stat check for now.
replication2
Not sure why this is failing now, and didn't before. Issue is fairly consistent with the part of the test that fails the cluster manager leader in each cluster.
Connection Manager located no endpoints for: {rt_repl,"B"}. Will retry.
<0.2772.0>@riak_core_connection_mgr:disconnect_from_target/2:439 Disconnecting from: {rt_repl,"B"}
- I think prompted by the leader being stopped on ClusterB.any cast to
start_fullsync
will be ignored (silently) as the socket is undefined:I suspect something has changed in the timing of the test (perhaps the stops in ClusterA and ClusterB are closer together), to mean this now fails. But it appears to be a genuine issue. The
riak_repl2_fscooordinator
appears to assume that starting a connection request will result always in success or failure response, when instead it can get nil response. There then appears to be nothing to prompt the system out of this state.The obvious fix, is to fail any requests that are cancelled, when they are cancelled (in riak_core_connection_mgr), and this appears to make the test work.
The test then intermittently fails when checking real-time replication during a node stop. This does PUT, and stops the node being PUT to during the stream of PUTs. It then counts how many PUTs fail, and how many GETs fail for the same key-space in the other cluster - in effect, expecting every PUT that succeeded to be found in the other cluster.
However, I'm not sure this is right by design. The repl being queued is done via a postcommit hook, and the postcommit hook is fired after the reply to the client from the FSM. so it remains possible that a write could succeed, despite the node stopping before the write was added to the replication queue. Hence the read errors on the sink, could exceed the write errors on the source.
This might be very unlikely, but with changes to OTP versions the possibility of the client winning the race could increase. The postcommit state is triggered by a zero timeout, so it could be interrupted by a shutdown message.
replication2_ssl
This test will pass for OTP24, but not for OTP26. This is because the default tls_options used within riak_core_ssl_util are passed to both the client and server side, even when the options are client-only or server-only. OTP26 will error rather than ignore such options.
OpenRiak/riak_core#12
repl_aae_fullsync_custom_n & repl_aae_fullsync_bench
There is a potential issue in the call to
repl_aae_fullsync_util:make_clusters/3
where it runsrt:wait_until_aae_trees_built/1
. This has been seen to hang, as one tree is not built - but there is no log evidence as to why the tree has not been built. This looks like potentially a genuine issue, in riak when rebuilds are not triggered as expected - but it has not been possible to recreate this (i.e. it intermittently failed for a while, then on same machine & code stopped intermittently failing).A potential resolution may be to use staged_join in riak_repl tests. Using a rolling sequence of joins can lead to a vnode (on a given node) starting, exiting, re-starting rapidly. This probably shouldn't cause a problem for aae tree rebuilds, but rt:join is a legacy and non-production method for doing joins - so avoiding the problem this way should be OK given this is a legacy feature.
The text was updated successfully, but these errors were encountered: