Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Prevent concurrent topology parallelism changes #1353

Closed
billonahill opened this issue Sep 7, 2016 · 3 comments
Closed

Prevent concurrent topology parallelism changes #1353

billonahill opened this issue Sep 7, 2016 · 3 comments
Assignees
Milestone

Comments

@billonahill
Copy link
Contributor

billonahill commented Sep 7, 2016

We need a way to assure that multiple scaling events don't collide.

Related to #1292.

@billonahill billonahill changed the title Topology parallelism changes should be atomic Prevent concurrent topology parallelism changes Sep 14, 2016
@billonahill billonahill self-assigned this Sep 14, 2016
@billonahill
Copy link
Contributor Author

Approach 1 - local lock with current packing plan comparison

  1. Client fetches current packing plan from state manager and sends it and proposed changes to scheduler when invoking update.
  2. UpdateTopologyManager receives request and takes out a local lock. If lock can't be obtained, returns a concurrent update exception response.
  3. With the local lock, UpdateTopologyManager compares the current packing plan with what's in the state manager. This is to confirm that the system state hasn't changed since the request was initiated by another agent. If the packing plans don't match, returns a concurrent update exception response.
  4. UpdateTopologyManager updates the topology info in the state manager to invoke the scaling change.
  5. UpdateTopologyManager releases the local lock and returns success.

Approach 2 - local lock with update request versioning

This approach is similar to approach 1, except an atomically incremented request id is used instead of the current packing plan comparison.

  1. Client atomically increments an updateRequestId counter in state manager.
  2. Client submits update request including the updateRequestId.
  3. UpdateTopologyManager receives request and takes out a local lock. If lock can't be obtained, returns a concurrent update exception response.
  4. With the local lock, UpdateTopologyManager compares the updateRequestId with the updateRequestId in the state manager. This is to confirm that the system state hasn't changed since the request was initiated by another agent. If the updateRequestIds don't match, returns a concurrent update exception response.
  5. Proceed as describe in Approach 1.

@avflor
Copy link
Contributor

avflor commented Sep 14, 2016

@billonahill I prefer the second approach since it allows us to do some ordering in the update requests if needed. So the update topology manager can detect if a particular request is out of order (the current updateRequestId is much greater than the previous one it processed). I'm not sure if this is useful though. Just thinking.

@billonahill
Copy link
Contributor Author

If we don't need to keep track of the last successfully handled updateRequestId that simplifies things, so I was thinking that if we're not handling the current updateRequestId in state manager, we fail. It's more aggressive in that we might fail even when handling the "next in line", but it's simple to implement and rationalize. Also it's easy to recover from with another request.

Since these ids are numerically increasing it's tempting to use them to infer ordering, but I recommend we use them just as atomic optimistic locks on the request/response cycle.

If that's the case, it really does make approach 1 and 2 similar, except that 2 requires additional state storage.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants