feat: API for dynamic scaling of Sharded daemon process instances #31844

johanandren · 2023-02-22T10:20:59Z

This allows a user to rescale the number of sharded daemon process workers dynamically, after starting the cluster.

patriknw

looks promising

...sharding-typed/src/main/scala/akka/cluster/sharding/typed/javadsl/ShardedDaemonProcess.scala

patriknw · 2023-02-23T10:54:46Z

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

+  // FIXME stash or deny new requests if rescale in progress?
+  def start(): Behavior[ShardedDaemonProcessCommand] = {
+    replicatorAdapter.askGet(
+      replyTo => Replicator.Get(key, Replicator.ReadLocal, replyTo),


should use some kind of read and write majority to be safe

I did a write all to know it reaches all pingers, so a local read would be ok, but I'm not convinced that's a good solution for pausing the pingers anyway.

Went with read/write majority and explicit ping protocol in d608bee

johanandren · 2023-03-10T15:29:20Z

Two more things we may want here: a query message to ask the coordinator about current number of workers, an additional check in the local ddata state with revision from pid before starting an entity (in case of delayed delivery of ping)

…n multi-jvm test

…ades

johanandren · 2023-03-13T16:24:22Z

Checking revision/number of workers in pingers and on worker start implemented.

I'll push querying the coordinator about current state to a separate PR I think as this is large enough as it is and that is not a completely required feature (although would be nice as it could allow for example for looking a the current scale/state of a sharded daemon process in management or something like that).

patriknw

Haven't reviewed the coordinator yet, but here is a first round of comments.

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

akka-cluster-sharding-typed/src/main/resources/reference.conf

...sharding-typed/src/main/scala/akka/cluster/sharding/typed/ShardedDaemonProcessSettings.scala

patriknw · 2023-03-14T07:28:43Z

...rding-typed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessId.scala

+    // use process n for shard id
+    def shardId(entityId: String): String = {
+      if (supportsRescale) entityId.split(Separator)(2)
+      else entityId


Will this MessageExtractor work in the same way as the old for a rolling update? Also for the case of switching from init to initWithContext (changing supportsRescale). So that we can support a rolling update that enables the new feature.

Hmmm, the pingers live on the old nodes, so may send the single id ping until roll complete, but extractor on new nodes now that it supports rescale, but they can't get revision and total count from anywhere.

Not sure what sharding does for a failing extractor, but we'd sort of want to drop those Start messages somehow when supportsRescale but getting a non-rescale message. Once the roll is completed the pingers will live on new nodes and send the right Start messages.

Maybe we can parse such messages into revision -1 and the startup check will cancel starting the worker and we'll be fine.

Added in 7e8904e but I need to test it out, and add a mention in docs.

Something more around this: cluster singleton will not start pinging nodes until all nodes are rolled since it will run on oldest - unless you limit it with role, in that scenario there could still be old nodes left when all of a certain role has rolled up, so there is probably a caveat to document for rolling upgrades there.

could we handle revision 0 to use the same entity ids as before, same as for the init( case?

I think there is a high risk that users will switch from init to initWithContext later, not necessarily at the same time as bumping the Akka version.

Went with just id in string for revision 0 in b1998c8, should work, but didn't figure out how to add a multinode test covering it to be sure.

...ing-typed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessImpl.scala

...rc/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessKeepAlivePinger.scala

...ed/src/test/scala/akka/cluster/sharding/typed/scaladsl/ShardedDaemonProcessReScaleSpec.scala

patriknw · 2023-03-14T08:53:24Z

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

+      initialNumberOfProcesses: Int,
+      daemonProcessName: String,
+      shardingRef: ActorRef[ShardingEnvelope[T]]): Behavior[ShardedDaemonProcessCommand] =
+    Behaviors.setup { context =>


any risk of fail and need of restart supervision?

Added in 662adcf

...rc/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessKeepAlivePinger.scala

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

...ng-typed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessState.scala

patriknw · 2023-03-14T11:05:00Z

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

+
+  // first step of rescale, pause all pinger actors so they do not wake the entities up again
+  // when we stop them
+  private def pauseAllPingers(


One option we have is to move the pinging to the coordinator. Might simplify the process. Drawback is less ping redundancy in case of network partitions.

Mmm, that would certainly simplify things quite a bit, I think it is an attractive idea.

Worst case scenario is worker crashes in face of partition, and isn't restarted until partition heals, but then you can work around that by making sure the worker doesn't crash by supervise.restart it.

Done in 034116a

patriknw · 2023-03-14T14:24:08Z

...ng-typed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessState.scala

+    revision: Long,
+    numberOfProcesses: Int,
+    completed: Boolean,
+    started: Instant)


What do you think about making this a proper ReplicatedData? Merge should be dead simple since the revision is increasing and completed is changing from false to true.

I don't have a great motivation but I don't think I ever did a custom ReplicatedData structure so I'll do it.

Changed in d530fe8

patriknw · 2023-03-15T09:54:57Z

...d/src/main/scala/akka/cluster/sharding/typed/internal/ClusterShardingTypedSerializable.scala

+ * INTERNAL API
+ */
+@InternalApi
+trait ClusterShardingTypedSerializable


private[akka] ?

Fixed in a56a07b

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

patriknw · 2023-03-15T10:27:42Z

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

+    replicatorAdapter.askUpdate(
+      replyTo =>
+        Replicator.Update(key, initialState, Replicator.WriteMajority(settings.rescaleWriteStateTimeout), replyTo)(
+          (existingState: ShardedDaemonProcessState) => existingState.merge(newState)),


It would be more correct usage to not merge here, because that is what the Replicator will do. This can just return newState.

Now we know that we are single writer, but otherwise it would be existingState.startScalingTo(request.newNumberOfProcesses) here instead of changing that in advance.

Yeah was scratching my head about what was the right thing here. Added that startScaling method for that reason but then I need the resulting state later. I'll change to just replace.

Done in a56a07b

patriknw · 2023-03-15T10:40:20Z

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

+            daemonProcessName,
+            request.newNumberOfProcesses,
+            newState.revision)
+          prepareRescale(newState, request, currentState.numberOfProcesses)


should it also cancel the Tick timer? I know it's ignored but anyway.

Cancelled in a56a07b

...ed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessCoordinator.scala

...rding-typed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessId.scala

patriknw

just one thing, then ready to merge

patriknw · 2023-03-16T07:15:00Z

...ing-typed/src/main/scala/akka/cluster/sharding/typed/internal/ShardedDaemonProcessImpl.scala

+      totalProcesses: Int,
+      name: String,
+      revision: Long)
+      extends ShardedDaemonProcessContext

  def init[T](name: String, numberOfInstances: Int, behaviorFactory: Int => Behavior[T])(


we are missing a corresponding initWithContext for this one. good to have all 3 to make the "migration" seamless.

I tried that but got compiler errors because of ambiguous overloads in some way I didn't figure out a way around, so backed out ot of it completely. Can try giving it another shot.

Done in 65acbec and noticed that I left the old ddata state timeout config in place, so removed that in ff0ae04

…hose

patriknw

LGTM

patriknw · 2023-03-16T16:07:08Z

Great work @johanandren 🎉

patriknw reviewed Feb 23, 2023

View reviewed changes

johanandren force-pushed the wip-dynamic-sdp branch 2 times, most recently from a38762f to ab032a0 Compare February 28, 2023 18:31

johanandren force-pushed the wip-dynamic-sdp branch from 9e2caad to 76d7af1 Compare March 9, 2023 14:53

johanandren added 15 commits March 10, 2023 16:27

wip: API for dynamic scaling of Sharded daemon process instances

5b6a696

Coordinator sketched out

d6c4fd1

Pause capability for pingers

850b67a

Pause and unpause pingers

04a02cc

Interact with the coordinator to stop shards

46618ee

First happy path test passing

87eee4b

Formatting

b653bc3

debug2

5b43000

More mima

ccb427e

More retrying, better logging

9e35f04

hardcoded timeouts -> config/settings

9d9faa0

Some minimal docs

5e9601e

Only start coordinator if scaling is supported

e4fb5c5

Basic multi node test and a missing serializer

eeb3188

One more class serializable

734493c

johanandren force-pushed the wip-dynamic-sdp branch from cffcb78 to 734493c Compare March 10, 2023 15:28

johanandren added 6 commits March 13, 2023 11:06

Convenience factory methods, apimaychange, scale from a second node i…

ea98a91

…n multi-jvm test

extra safety hatch not starting old processes, some cleanup/restructure

a8e4762

Pingers also need to check revision on startup if rescale possible

a48fcfa

Use the old entityId scheme for non-scaling to not break rolling upgr…

d68f0a9

…ades

Scala 2.12 logging

a8910c8

Separate test cases for previous style ids and new

10bdba0

johanandren marked this pull request as ready for review March 13, 2023 16:20

patriknw reviewed Mar 14, 2023

View reviewed changes

johanandren added 4 commits March 14, 2023 10:42

getName when logging unexpected class

2dd69bd

ReScale -> Rescale

f0cf088

support rolling enable of rescale

7e8904e

No final objects in scala 3

d8e684c

patriknw reviewed Mar 14, 2023

View reviewed changes

johanandren added 6 commits March 14, 2023 13:18

retry write completed state indefinitely

72c78e8

indefinite retry, log at info when starting

b7a858c

one more passivation fixme deleted

a058c24

Push pinging to coordinator/singleton, skip per-node-pingers completely

034116a

Supervise the coordinator

662adcf

Ups wrong StartEntity, fix test

ddfe048

patriknw reviewed Mar 14, 2023

View reviewed changes

johanandren added 6 commits March 14, 2023 16:26

custom replicated data type for state instead of LWW

d530fe8

command to query current state

abd4c39

No case class API for bin evolvability

99d51a3

Some docs on rolling upgrade

8b92258

Encode revision 0 using old scheme for rolling upgrades

b1998c8

more test coverage of switching id scheme on revision > 0

77fb8a3

patriknw reviewed Mar 15, 2023

View reviewed changes

johanandren added 3 commits March 15, 2023 16:35

some of the latest review feedback addressed

a56a07b

Kill throttled ping stream when rescaling

0631bd3

supportsRescale not needed in many places

ce5db3a

patriknw reviewed Mar 16, 2023

View reviewed changes

johanandren added 2 commits March 16, 2023 10:28

All the overloads

65acbec

ddata state timeouts no longer used, we use the sharding config for t…

ff0ae04

…hose

patriknw approved these changes Mar 16, 2023

View reviewed changes

johanandren merged commit 12a0ad4 into main Mar 16, 2023

johanandren deleted the wip-dynamic-sdp branch March 16, 2023 09:58

johanandren added this to the 2.8.0 milestone Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: API for dynamic scaling of Sharded daemon process instances #31844

feat: API for dynamic scaling of Sharded daemon process instances #31844

johanandren commented Feb 22, 2023 •

edited

Loading

patriknw left a comment

patriknw Feb 23, 2023

johanandren Feb 27, 2023

johanandren Mar 1, 2023

johanandren commented Mar 10, 2023

johanandren commented Mar 13, 2023

patriknw left a comment

patriknw Mar 14, 2023

johanandren Mar 14, 2023

johanandren Mar 14, 2023

johanandren Mar 14, 2023

patriknw Mar 14, 2023

johanandren Mar 14, 2023 •

edited

Loading

patriknw Mar 14, 2023

johanandren Mar 14, 2023

patriknw Mar 14, 2023

johanandren Mar 14, 2023

johanandren Mar 14, 2023

patriknw Mar 14, 2023

johanandren Mar 14, 2023

johanandren Mar 14, 2023

patriknw Mar 15, 2023

johanandren Mar 15, 2023

patriknw Mar 15, 2023

johanandren Mar 15, 2023

johanandren Mar 15, 2023

patriknw Mar 15, 2023

johanandren Mar 15, 2023

patriknw left a comment

patriknw Mar 16, 2023

johanandren Mar 16, 2023

johanandren Mar 16, 2023

patriknw left a comment

patriknw commented Mar 16, 2023

feat: API for dynamic scaling of Sharded daemon process instances #31844

feat: API for dynamic scaling of Sharded daemon process instances #31844

Conversation

johanandren commented Feb 22, 2023 • edited Loading

patriknw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanandren commented Mar 10, 2023

johanandren commented Mar 13, 2023

patriknw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanandren Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patriknw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patriknw left a comment

Choose a reason for hiding this comment

patriknw commented Mar 16, 2023

johanandren commented Feb 22, 2023 •

edited

Loading

johanandren Mar 14, 2023 •

edited

Loading