Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist learner info #3771

Merged
merged 2 commits into from
Mar 17, 2022

Conversation

pengweisong
Copy link
Contributor

@pengweisong pengweisong commented Jan 20, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:
#3689

Description:
Balance is a long process. If the storaged restart when the cluster is doing balancing, we will lose some info now.
That is, all the learner info will be lost, because they will neither be persisted in storaged nor in metad.

How do you solve it?

Now the partition balance process includes:

  1. ADD_PART_ON_DST: create a new part in target storaged.
  2. ADD_LEARNER: add the new part as a raft learner in all partition's peers.
  3. CATCH_UP_DATA: wait the new part catching data.
  4. MEMBER_CHANGE_ADD: promote the new part from learner as follower.
  5. MEMBER_CHANGE_REMOVE: remove the old part from all peers.
  6. UPDATE_PART_META: update part hosts info in the metad.
  7. REMOVE_PART_ON_SRC: remove unused data in the source storaged.
  8. CHECK: check the part is good.

We will persist all the partition peers info in the storage local, including the status in balancing.
When the storaged restart, we will join the info in storage local with the meta to decide if the part should be kept or removed, started as learner or normal peer, started with which peers.

image

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

@pengweisong pengweisong added the ready-for-testing PR: ready for the CI test label Jan 20, 2022
@Sophie-Xie Sophie-Xie added cherry-pick-v3.0 PR: need cherry-pick to this version and removed cherry-pick-v3.0 PR: need cherry-pick to this version labels Jan 20, 2022
@pengweisong pengweisong marked this pull request as ready for review January 21, 2022 04:16
Copy link
Contributor

@wenhaocs wenhaocs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job!

src/kvstore/KVEngine.h Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job~ You'd better write a simple test case: write some learner info, restart the NebulaStore, check if the state is correct. I concern about the convert between storage address and raft address.

src/kvstore/KVEngine.h Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
src/kvstore/Part.cpp Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@panda-sheep panda-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@Sophie-Xie Sophie-Xie linked an issue Feb 25, 2022 that may be closed by this pull request
@pengweisong pengweisong force-pushed the persist-partition-learner branch from 04b78af to 3a04aae Compare February 28, 2022 09:02
@liwenhui-soul
Copy link
Contributor

why don't you merge them into one commit, it's hard to read

@pengweisong pengweisong force-pushed the persist-partition-learner branch from ad43bb3 to 98a2dae Compare March 1, 2022 04:01
@liwenhui-soul
Copy link
Contributor

did you consider about removing parts in metaclient?

@pengweisong
Copy link
Contributor Author

@liwenhui-soul Metaclient only calculate diff in two version of local cache, but when add peer, we will not update local cache.

src/kvstore/Common.h Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Outdated Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Show resolved Hide resolved
src/kvstore/RocksEngine.cpp Show resolved Hide resolved
src/kvstore/RocksEngine.cpp Show resolved Hide resolved
Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM~

src/kvstore/Common.h Show resolved Hide resolved
src/kvstore/Common.h Show resolved Hide resolved
src/kvstore/NebulaStore.cpp Show resolved Hide resolved
src/kvstore/Common.h Show resolved Hide resolved
src/kvstore/Common.h Outdated Show resolved Hide resolved
@pengweisong pengweisong force-pushed the persist-partition-learner branch from a05bdf4 to e71482f Compare March 15, 2022 07:06
critical27
critical27 previously approved these changes Mar 15, 2022
Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done... This bug fix is way more complicated than I expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor meta part table
8 participants