[RFC]: Cleanup processing replicas if the transferring client is down

### Changes proposed

We noticed the issue #974, and we also encountered the problem recently. We have implemented and verified a simple solution:

1. We refactored the client management logic in master. All clients must be registered to the master before sending mount/put_start requests. Therefore, the master can keep a list of valid clients and recognize faulted clients through pinging timeout.
2. Clients send their `client_id` along with put_start requests, so the master can identify which client is responsible for transfering data of a given object. When a client fault is detected, the master checks all transferring objects, discards processing replicas of objects which are being processed by the faulted client. Note that completed replicas will not be discarded, so the objects can still be used if they have any completed replicas. Otherwise, they will be removed and other clients will be allowed to insert the same keys.
3. There are cases that the faulted client is not terminated and is still writing data to discarded replicas. Therefore, discarded replicas will be put into a staging list instead of being released immediately. After a long-enough and configurable timeout (e.g. 10 or 20 minutes), the discarded replicas can be released safely.

### Limitations

1. Currently, there is no reliable way to prevent potential data corruption caused by reusing space of discarded replicas. We can only set a long enough timeout for releasing discarded replicas to minimize the probability.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: Cleanup processing replicas if the transferring client is down #975

Changes proposed

Limitations

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Cleanup processing replicas if the transferring client is down #975

Description

Changes proposed

Limitations

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions