Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling moving not initialized partition cross shards #1414

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented May 18, 2021

Cover letter

When partition is being moved cross shards, it is shut down on source core and recreated on target one. X-core move doesn't involve any data movement. When partition is recreated on target core we relay on data in partition underlying log to recover partition state. The bug that was triggered in ducktape tests was caused by moving partition before any data were persisted to disk. This prevented partition on target core from recovering its initial configuration.

Fixed the problem by initializing partition on target core with its original configuration. This way if there is no data it will start from valid initial state, or if data are in log this initial configuration will be overridden.

Additionally exposed honey badger as an admin API endpoint and added test that reproduce the described issue.

@dotnwat
Copy link
Member

dotnwat commented May 19, 2021

do you think this might also be some of the other non-ducktape CI failures like #1426 ?

@mmaslankaprv
Copy link
Member Author

do you think this might also be some of the other non-ducktape CI failures like #1426 ?

This is unrelated, the problem we are seeing in CI is caused by one of raft-0 consensus instances not responding to requests and all the replicate requests times out.

@mmaslankaprv mmaslankaprv changed the title c/backend: handling moving not initialized partition cross shards handling moving not initialized partition cross shards May 20, 2021
@mmaslankaprv mmaslankaprv requested a review from dotnwat May 20, 2021 12:44
@mmaslankaprv mmaslankaprv force-pushed the fix-moving-not-fully-initialized-partition branch from 2754fd3 to 60e40f1 Compare May 20, 2021 12:44
@mmaslankaprv mmaslankaprv requested a review from emaxerrno May 20, 2021 12:44
@mmaslankaprv mmaslankaprv marked this pull request as ready for review May 20, 2021 12:44
@mmaslankaprv mmaslankaprv requested a review from a team as a code owner May 20, 2021 12:44
@mmaslankaprv mmaslankaprv force-pushed the fix-moving-not-fully-initialized-partition branch from 60e40f1 to d886397 Compare May 20, 2021 13:22
@@ -666,3 +675,86 @@ void admin_server::register_partition_routes() {
co_return ss::json::json_void();
});
}

void admin_server::register_hbadger_routes() {
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to further protect this via #ifndef NDEBUG return; #endif - we can't release this to a prod cluster too risky yet. let's get confidence in honey badger. even if the is_enabled is there. this is our last line of defense.

import requests


class HoneyBadger:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might consider adding the raw requests to admin.py. then having a hb specific interface on top in a class like this.

@@ -121,6 +127,9 @@ void admin_server::configure_admin_routes() {
register_status_routes();
register_broker_routes();
register_partition_routes();
if (finjector::shard_local_badger().is_enabled()) {
register_hbadger_routes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should either (1) register a GET route at v1/failure-probes which returns a flag indicating if the probes are enabled or not or (2) register all the routes but have them return an appropriate http error code when things are disabled.

by not registering the routes we'll send the tests a 404 which is hard to work with in the context of a feature that can be enabled/disabled because its the same code used for so many other common mistakes, typos, etc..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -147,6 +148,62 @@ def derived_done():

wait_until(derived_done, timeout_sec=30, backoff_sec=1)

@cluster(num_nodes=3)
def test_moving_not_fully_initialized_partition(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the test work with builds that have honey badger disabled? I suggested a slight change to the routes in a previous commit, and I'm wondering if that would be the basis for dynamically determining if we'll enable/disable this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current situation without failure injector, it would just execute like there were no errors. I have applied your suggestion and I am disabling this test if honey badger isn't enabled.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
When partition is being moved cross shards, it is shut down on
source core and recreated on target one. X-core move doesn't involve
any data movement. When partition is recreated on target core we relay
on data in partition underlying log to recover partition state. The bug
that was triggered in ducktape tests was caused by moving partition
before any data were persisted to disk. This prevented partition on
target core from recovering its initial configuration.

Fixed the problem by initializing partition on target core with its
original configuration. This way if there is no data it will start from
valid initial state, or if data are in log this initial configuration
will be overridden.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Added admin API that allow us to control failures injected by
honey badger failure injector.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Added test that tries to move cross core not yet initialized partition.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
@mmaslankaprv mmaslankaprv force-pushed the fix-moving-not-fully-initialized-partition branch from d886397 to 7597945 Compare May 21, 2021 08:46
@mmaslankaprv mmaslankaprv requested review from dotnwat and emaxerrno May 21, 2021 08:52
@dotnwat dotnwat merged commit caf7679 into redpanda-data:dev May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants