Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use remote recovery when topic is configured to use remote data #22908

Merged
merged 5 commits into from
Oct 10, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Aug 16, 2024

When topic is configured to use remote data the data can be used when
force recovering partitions that lost majority. In this case instead of
creating an empty partition replica instance we customize arguments
passed into the partition_manger::manage method to enable remote
recovery of replica data.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Improvements

  • with this improvement the force reconfigured partitions will be backfilled from Tiered Storage even if they lost all local data

@bashtanov bashtanov changed the title Use remove recovery when topic is configured to use remote data Use remote recovery when topic is configured to use remote data Aug 16, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 16, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/53044#01915b88-9afa-4011-9e70-1a139c563e0f:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_valid_settings"

new failures in https://buildkite.com/redpanda/redpanda/builds/53548#01918f2f-9bd9-49ec-b693-5be4f3e5348e:

"rptest.tests.partition_force_reconfiguration_test.PartitionForceReconfigurationTest.test_basic_reconfiguration.acks=1.restart=True.controller_snapshots=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/54942#01921e82-34f4-4454-997c-574449687a2a:

"rptest.tests.leadership_transfer_test.AutomaticLeadershipBalancingTest.test_automatic_rebalance"

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 16, 2024

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53044#01915b87-27a9-43bf-9b32-91e32ab0071a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53044#01915b88-9afd-4965-b953-53c8457c84fc

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53119#019169bf-5f90-4ecd-b3f3-24507e924a68

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53207#01916feb-b5ff-484b-a33d-613e63733355

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53207#01916fed-9b30-482e-bac9-8fe6c015522d

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53207#01916fed-9b2f-4907-b3ce-ac2f712a72e7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53207#01916fed-9b2d-4d0b-9df3-e6bdb31126dd

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53207#0191752f-380d-4ad6-b97a-6cf7cba5dbfd

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53548#01918f2f-9bd4-493d-a3a1-88bb8ef2a647

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53548#01918f2f-9bd9-49ec-b693-5be4f3e5348e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53548#01918f49-493a-47a4-9e86-81a155b3d090

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53548#01918f49-4934-4011-9d02-c276f30102a7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53600#01919316-c335-432e-a052-8a191ce548c4

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53600#01919316-c330-4abb-8d73-4d50eedf87d4

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53662#01919807-fe50-4ae8-a476-72514e6cfd3f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53662#01919807-fe4c-4b13-9001-086a4c517a9b

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53880#0191b189-f141-4c76-98c8-b35d46fb5f91

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/53880#0191b189-f148-4ad6-8b9a-ec57a5c7ad2f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54508#0191f9aa-8240-4ac8-9c5a-d63ce498fad7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54508#0191f9ab-d05d-4db2-a684-8dfb53170198

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54508#0191f9ab-d05a-4a42-858e-5dbdaa42692e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54508#0191f9ab-d061-484c-b2a5-28634b79e61d

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54508#0191f9ab-d064-4acb-a24a-0be7aec62676

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54942#01921e82-34f3-4107-86b6-4496dd60bcb0

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54942#01921e82-34f4-4454-997c-574449687a2a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54942#01921e82-34f6-4e04-85d8-30f6c0f7f81f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55444#0192436b-e48b-420f-ac68-83da7bf83a1e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55444#0192436b-e487-4b93-b517-6f6854e2c381

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56133#01927238-3452-412c-96e7-637887f14eaf
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56133#01927238-3456-4e01-b896-a0248bad2ad4

tests/rptest/services/redpanda.py Show resolved Hide resolved
admin_client = admin_client or self._admin
if tolerate_stopped_nodes:
started_node_ids = {self.node_id(n) for n in self.started_nodes()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I find having a variable declared/undeclared depending on the condition somewhat error-prone and harder to read too. What do you think of something like the following?

if tolerate_stopped_nodes:
    started_node_ids = {self.node_id(n) for n in self.started_nodes()}
    node_check_predicate = lambda n: n in started_node_ids
else:
    node_check_predicate = lambda n: True
...
ready = all([n['config_version'] >= config_version for n in status if node_check_predicate(n)])

src/v/cluster/controller_backend.cc Outdated Show resolved Hide resolved
tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
bashtanov
bashtanov previously approved these changes Aug 22, 2024
Copy link
Contributor

@bashtanov bashtanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only nits left, so approving in case it's really urgent to merge

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added log entry emitted when partition instance is being created.
The entry will allow us to quickly identify partition configuration.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
When topic is configured to use remote data the data can be used when
force recovering partitions that lost majority. In this case instead of
creating an empty partition replica instance we customize arguments
passed into the `partition_manger::manage` method to enable remote
recovery of replica data.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@mmaslankaprv
Copy link
Member Author

unrelated test failure: https://redpandadata.atlassian.net/issues/CORE-7002

bashtanov
bashtanov previously approved these changes Aug 30, 2024
tests/rptest/tests/partition_force_reconfiguration_test.py Outdated Show resolved Hide resolved
Added replicating some data and waiting for then to be uploaded to the
cloud when executing node wise recovery. This way a test is able to
verify if cloud storage data are used when force re-configuring
partitions with lost majority.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

2 similar comments
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

// topic being cloud enabled implies existence of overrides
ntp_config.get_overrides().recovery_enabled
= storage::topic_recovery_enabled::yes;
rtp.emplace(*initial_rev, cfg->partition_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with initial_version perhaps @ztlpn can take another look.

@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#56133

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage@{"cleanup_policy":"delete"}

@mmaslankaprv
Copy link
Member Author

known ci failure:

@mmaslankaprv mmaslankaprv merged commit 082c700 into redpanda-data:dev Oct 10, 2024
15 of 18 checks passed
@mmaslankaprv mmaslankaprv deleted the recovery-ts branch October 10, 2024 06:38
@dotnwat dotnwat requested a review from Lazin October 10, 2024 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants