Skip to content

[RFC]: Cleanup processing replicas if the transferring client is down #975

@nickyc975

Description

@nickyc975

Changes proposed

We noticed the issue #974, and we also encountered the problem recently. We have implemented and verified a simple solution:

  1. We refactored the client management logic in master. All clients must be registered to the master before sending mount/put_start requests. Therefore, the master can keep a list of valid clients and recognize faulted clients through pinging timeout.
  2. Clients send their client_id along with put_start requests, so the master can identify which client is responsible for transfering data of a given object. When a client fault is detected, the master checks all transferring objects, discards processing replicas of objects which are being processed by the faulted client. Note that completed replicas will not be discarded, so the objects can still be used if they have any completed replicas. Otherwise, they will be removed and other clients will be allowed to insert the same keys.
  3. There are cases that the faulted client is not terminated and is still writing data to discarded replicas. Therefore, discarded replicas will be put into a staging list instead of being released immediately. After a long-enough and configurable timeout (e.g. 10 or 20 minutes), the discarded replicas can be released safely.

Limitations

  1. Currently, there is no reliable way to prevent potential data corruption caused by reusing space of discarded replicas. We can only set a long enough timeout for releasing discarded replicas to minimize the probability.

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions