Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support passing a slot range in CLUSTERX MIGRATE #2355

Open
1 of 2 tasks
PragmaTwice opened this issue Jun 7, 2024 · 12 comments
Open
1 of 2 tasks

Support passing a slot range in CLUSTERX MIGRATE #2355

PragmaTwice opened this issue Jun 7, 2024 · 12 comments
Assignees
Labels
enhancement type enhancement feature type new feature help wanted Good for newcomers

Comments

@PragmaTwice
Copy link
Member

PragmaTwice commented Jun 7, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Motivation

In slot migration, many users would like to be able to pass a slot range to migrate in batches.

Solution

--

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@PragmaTwice PragmaTwice added enhancement type enhancement help wanted Good for newcomers feature type new feature labels Jun 7, 2024
@PragmaTwice PragmaTwice changed the title Support passing a slot range in SLOT MIGRATE Support passing a slot range in CLUSTERX MIGRATE Jun 7, 2024
@PokIsemaine
Copy link
Contributor

I'd like to learn kvrocks' cluster design by solving this problem, if it's suitable please arrange it for me. At the same time, if you have relevant information, please list it. Thank you very much!

@git-hulk
Copy link
Member

@PokIsemaine Sure, thank you. Free feel to ping @PragmaTwice @git-hulk @caipengbo for help or confirm anything.

@caipengbo
Copy link
Contributor

@PokIsemaine I suggest that before you start writing code, outline your solution and let's discuss it together.

@PokIsemaine
Copy link
Contributor

First of all, I would like to discuss the use of commands, and noting that CLUSTERX SETSLOT has similar needs, I decided to refer to the following PR and issues.
#1412
#1414

The current form of CLUSTERX MIGRATE is as follows

CLUSTERX MIGRATE $SLOT_ID $DEST_NODE_ID [$SYNC_FLAG] [$TIMEOUT]

To be compatible with the previous command, we've only modified the form of content that SLOT_ID can accept, allowing it to accept a single slot or a range of slotrange, or both.

CLUSTERX MIGRATE 1 $dest_node_id
CLUSTERX MIGRATE "3 5 9 20-30" $dest_node_id

Do you think this is reasonable? If it makes sense, I'll continue to think about the specific design and implementation.

In addition, in the process of learning cluster, I found that the https://kvrocks.apache.org/docs/cluster/ documentation is not very user-friendly, for example, it does not tell CLUSTERX MIGRATE how to use it, command line arguments, etc., which may need to be improved in the future

@caipengbo
Copy link
Contributor

I agree with this way.

@PokIsemaine
Copy link
Contributor

PokIsemaine commented Jun 21, 2024

The next thing I'd like to discuss is a more specific design: how do we migrate when we have multiple slot ranges?

CLUSTERX MIGRATE "3 5 9 20-30" $dest_node_id

In the current implementation, migration-related functions, including SlotMigrationJob, are implemented for a single slot.

Some rough thoughts:

  • 1-SlotMigrationJob,1-Slot: Still using the existing single slot implementation, adding two layers of traversal of slot_ranges and slot_range.
    • It looks like this is how SETSLOT works, but does it cause too much communication for migrate?
    • At the same time, I noticed that there is a mechanism similar to pipeline, but I don't know much about its design and function, do you have any relevant issues/PR descriptions? Does it alleviate these problems?
  • 1-SlotMigrationJob,1-Slot Range: Migrate at the slot range granularity, which requires a lot of modifications
  • 1-SlotMigrationJob,N-Slot Range: Migrate at the slot ranges granularity

Which option do you think is more suitable? If you have a better plan, you can also come up with it, thank you very much!

@caipengbo
Copy link
Contributor

caipengbo commented Jun 21, 2024

I prefer 1-SlotMigrationJob,1-Slot Range. We can seek only once for a range (the same namespace), making the changes relatively small and making it easier to determine whether a slot belongs to that slot range. HDYT @PragmaTwice @git-hulk

Users are encouraged to migrate slots by slot range, and we can set a maximum number of ranges if the slot passed by the user is not contiguous.

@git-hulk
Copy link
Member

@PokIsemaine @caipengbo To make it simpler, perhaps another approach is implementing a single continuous slot range migration first. That said, we allow to pass 1-N but 2 4 5 is disallowed. How do you think?

@caipengbo
Copy link
Contributor

caipengbo commented Jun 21, 2024

I think ok, just pass a valid slot range. We can extend it in the future to make it more diverse.

@PragmaTwice
Copy link
Member Author

I'm not sure about how the current implementation is and how it handles with migration errors and rollbacks.

But from my side it should not be so difficult to implement the design by @PokIsemaine. Of course we can also concentrate on extend from one slot to a slot range.

@PokIsemaine
Copy link
Contributor

@PokIsemaine @caipengbo To make it simpler, perhaps another approach is implementing a single continuous slot range migration first. That said, we allow to pass 1-N but 2 4 5 is disallowed. How do you think?

I want to confirm what you mean

// support
CLUSTERX MIGRATE 3 $dest_node_id
CLUSTERX MIGRATE "20-30" $dest_node_id
 not support
CLUSTERX MIGRATE "3 5 9" $dest_node_id
CLUSTERX MIGRATE "3 5 9 20-30" $dest_node_id
CLUSTERX MIGRATE "5-10 20-30" $dest_node_id

If so, I think we can do this first, and a single slot can also be unified into a single slot range.

Another question is how do we handle failure situations? If some slots fail, how do we handle it?
It seems that status is now tracked by individual slots, such as cluster info information.

127.0.0.1:6666> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0 
cluster_slots_fail:0
cluster_known_nodes:3
cluster_size:3
cluster_current_epoch:3
cluster_my_epoch:3
importing_slot: 7629  // single slot => slot range?
import_state: success // single slot => slot range?

@caipengbo
Copy link
Contributor

caipengbo commented Jun 24, 2024

how do we handle failure situations?

The migration slot range is migrated from front to back, if a slot fails, [start_slot, failed_slot) should be successful, [failed_slot, end_slot) should fail. If the slot is found to have failed, the migration is stopped.

I think, we can add this information to cluster info, like this:

successful_slot: [start_slot, failed_slot)
failed_slot: [failed_slot, end_slot)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement type enhancement feature type new feature help wanted Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants