-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling moving not initialized partition cross shards #1414
handling moving not initialized partition cross shards #1414
Conversation
do you think this might also be some of the other non-ducktape CI failures like #1426 ? |
This is unrelated, the problem we are seeing in CI is caused by one of raft-0 consensus instances not responding to requests and all the replicate requests times out. |
2754fd3
to
60e40f1
Compare
60e40f1
to
d886397
Compare
src/v/redpanda/admin_server.cc
Outdated
@@ -666,3 +675,86 @@ void admin_server::register_partition_routes() { | |||
co_return ss::json::json_void(); | |||
}); | |||
} | |||
|
|||
void admin_server::register_hbadger_routes() { | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to further protect this via #ifndef NDEBUG return; #endif
- we can't release this to a prod cluster too risky yet. let's get confidence in honey badger. even if the is_enabled
is there. this is our last line of defense.
import requests | ||
|
||
|
||
class HoneyBadger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might consider adding the raw requests to admin.py. then having a hb specific interface on top in a class like this.
src/v/redpanda/admin_server.cc
Outdated
@@ -121,6 +127,9 @@ void admin_server::configure_admin_routes() { | |||
register_status_routes(); | |||
register_broker_routes(); | |||
register_partition_routes(); | |||
if (finjector::shard_local_badger().is_enabled()) { | |||
register_hbadger_routes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should either (1) register a GET route at v1/failure-probes
which returns a flag indicating if the probes are enabled or not or (2) register all the routes but have them return an appropriate http error code when things are disabled.
by not registering the routes we'll send the tests a 404 which is hard to work with in the context of a feature that can be enabled/disabled because its the same code used for so many other common mistakes, typos, etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -147,6 +148,62 @@ def derived_done(): | |||
|
|||
wait_until(derived_done, timeout_sec=30, backoff_sec=1) | |||
|
|||
@cluster(num_nodes=3) | |||
def test_moving_not_fully_initialized_partition(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the test work with builds that have honey badger disabled? I suggested a slight change to the routes in a previous commit, and I'm wondering if that would be the basis for dynamically determining if we'll enable/disable this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current situation without failure injector, it would just execute like there were no errors. I have applied your suggestion and I am disabling this test if honey badger isn't enabled.
Signed-off-by: Michal Maslanka <michal@vectorized.io>
When partition is being moved cross shards, it is shut down on source core and recreated on target one. X-core move doesn't involve any data movement. When partition is recreated on target core we relay on data in partition underlying log to recover partition state. The bug that was triggered in ducktape tests was caused by moving partition before any data were persisted to disk. This prevented partition on target core from recovering its initial configuration. Fixed the problem by initializing partition on target core with its original configuration. This way if there is no data it will start from valid initial state, or if data are in log this initial configuration will be overridden. Signed-off-by: Michal Maslanka <michal@vectorized.io>
Added admin API that allow us to control failures injected by honey badger failure injector. Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Added test that tries to move cross core not yet initialized partition. Signed-off-by: Michal Maslanka <michal@vectorized.io>
d886397
to
7597945
Compare
Cover letter
When partition is being moved cross shards, it is shut down on source core and recreated on target one. X-core move doesn't involve any data movement. When partition is recreated on target core we relay on data in partition underlying log to recover partition state. The bug that was triggered in ducktape tests was caused by moving partition before any data were persisted to disk. This prevented partition on target core from recovering its initial configuration.
Fixed the problem by initializing partition on target core with its original configuration. This way if there is no data it will start from valid initial state, or if data are in log this initial configuration will be overridden.
Additionally exposed honey badger as an admin API endpoint and added test that reproduce the described issue.