Make bootstrap operations deterministic #403

adutra · 2022-09-07T09:18:21Z

What this PR does:
Makes bootstrap order 100% balanced and deterministic.

Which issue(s) this PR fixes:
Fixes #381 .

Checklist

Changes manually tested
Automated Tests added/updated
Documentation added/updated
CHANGELOG.md updated (not required for documentation PRs)
CLA Signed: DataStax CLA

burmanm · 2022-09-12T14:48:45Z

If you wish to write a test for this (using fakeclient), then it's possible with some helpers already in cass-operator. For example, https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/reconcile_racks_test.go#L1625 calls startCassandra, but the more interesting for you is of course https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/reconcile_racks_test.go#L1406

It shows how to mock the pod endpoint that the fakeclients then call, without doing too much coding (it's not intuitive at all, but copy-pasting parts of those existing tests usually ends up making one understand how to achieve something). And you can call a single test from the editor which makes iterating quite quick.

adutra · 2022-09-12T18:28:37Z

@burmanm thank you. I managed to create a unit test that is not too lame.

Would you mind reviewing this? It should be a quick review.

burmanm · 2022-09-13T11:43:01Z

pkg/reconciliation/reconcile_racks.go

@@ -617,11 +618,13 @@ func (rc *ReconciliationContext) CheckPodsReady(endpointData httphelper.CassMeta
 		return result.RequeueSoon(5)
 	}

-	needsMoreNodes, err := rc.startAllNodes(endpointData)
+	// step 4 - make sure all nodes are ready


Now the step 3 comment is wrong as it used to be this step.

OK will fix comment for step 3

pkg/reconciliation/reconcile_racks.go

burmanm · 2022-09-13T12:19:34Z

pkg/reconciliation/reconcile_racks.go


-	labelSeedBeforeStart := readySeeds == 0 && len(rc.Datacenter.Spec.AdditionalSeeds) == 0 && externalSeedPoints == 0
+	for podRankWithinRack := 0; ; podRankWithinRack++ {


Out of interest, would you explain the benefit vs, lets say following algorithm (the problem here looks very much like a dynamic programming)

smallestRack := 0 for i := 0; i < len(rc.statefulSets); i++ { if rc.statefulSets[i].Status.ReadyReplicas < rc.statefulSets[smallestRack].Status.ReadyReplicas { smallestRack = i } } fmt.Printf("Next rack to start is: %d\n", smallestRack)

Also, is .Replicas really the number you want to compare to vs. ReadyReplicas?

First, my understanding is that we want here to use the desired number of pods for each sts, thus we should use .Replicas.

Second, switching to .ReadyReplicas could indeed simplify the code, but can we guarantee that if a pod is counted as a ready replica, it is not only ready, but also started and has joined the ring? Because if that's not the case, we should NOT keep bootstrapping the next pods.

How would Replicas help with that behavior? Isn't .Spec.Replicas updated when we add the pods? They don't have to be ready at all. Our startup process should take care of preventing next startup (there's one pod that's Started-but-not-Ready) before starting next ones. Though for that reason, perhaps it doesn't matter which one we use, since when we get to this point, we should have no starting pods anymore.

Other than of course for the algorithm I mentioned, there the ReadyReplicas does matter since Replicas has wrong calculation.

So, what should I do here? Should I change as you suggested above?

No, not necessarily. I only want to ensure that whoever reads it next can easily see what it's doing (and so avoid anything that looks pretty and complex). Would you think from that perspective the code is self-explanatory enough (the comment on the method only describes what we want to achieve, not how we achieve it) ?

To me, it is self-explanatory as it is now, however that's a subjective notion. You are never going to get everybody on the same page on such matters.

pkg/reconciliation/reconcile_racks.go

pkg/reconciliation/reconcile_racks_test.go

jsanda · 2022-09-14T12:45:00Z

pkg/reconciliation/reconcile_racks_test.go

@@ -1676,15 +1675,21 @@ func TestFailedStart(t *testing.T) {

 func TestReconciliationContext_startAllNodes(t *testing.T) {

+	// A boolean representing the state of a pod (started or not).


Very helpful. Thank you!

adutra added 5 commits September 7, 2022 11:12

Make bootstrap operations deterministic

7ef3376

Also bootstrap extra pods that shouldn't exist

d658741

Better design around statefulsets

e6d932a

Start from zero

338dee9

Start from zero

248a513

adutra added 2 commits September 12, 2022 17:03

Revert changes

21cc2a9

Add unit test for startAllNodes

c17f80f

adutra force-pushed the bootstrap-order branch from b5ada74 to c17f80f Compare September 12, 2022 17:27

adutra marked this pull request as ready for review September 12, 2022 18:28

adutra requested a review from a team as a code owner September 12, 2022 18:29

Remove debug comment

9f84c24

This was referenced Sep 13, 2022

K8SSAND-1737 ⁃ StatefulSet restart can cause a bug in startNode process.. #392

Closed

K8SSAND-1740 ⁃ Document decommission operations #393

Closed

burmanm reviewed Sep 13, 2022

View reviewed changes

pkg/reconciliation/reconcile_racks.go Outdated Show resolved Hide resolved

burmanm reviewed Sep 13, 2022

View reviewed changes

pkg/reconciliation/reconcile_racks.go Outdated Show resolved Hide resolved

burmanm reviewed Sep 13, 2022

View reviewed changes

pkg/reconciliation/reconcile_racks.go Outdated Show resolved Hide resolved

burmanm reviewed Sep 13, 2022

View reviewed changes

pkg/reconciliation/reconcile_racks.go Outdated Show resolved Hide resolved

adutra added 4 commits September 13, 2022 15:13

Fix step 3 comment

e5647d5

Use distinct booleans [skip ci]

36d38a8

Remove named return parameters

3072002

Use getStatefulSetPodNameForIdx

215859f

jsanda reviewed Sep 14, 2022

View reviewed changes

pkg/reconciliation/reconcile_racks_test.go Show resolved Hide resolved

Explain test struct

97adf47

jsanda reviewed Sep 14, 2022

View reviewed changes

burmanm approved these changes Sep 16, 2022

View reviewed changes

burmanm merged commit c767dda into k8ssandra:master Sep 16, 2022

jsanda mentioned this pull request Oct 28, 2022

K8SSAND-1862 ⁃ Decommissioning a DC doesn't finish when racks are used k8ssandra/k8ssandra-operator#746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make bootstrap operations deterministic #403

Make bootstrap operations deterministic #403

adutra commented Sep 7, 2022

burmanm commented Sep 12, 2022

adutra commented Sep 12, 2022

burmanm Sep 13, 2022

adutra Sep 13, 2022

burmanm Sep 13, 2022

adutra Sep 13, 2022

burmanm Sep 13, 2022

burmanm Sep 13, 2022

adutra Sep 14, 2022

burmanm Sep 14, 2022

adutra Sep 16, 2022

jsanda Sep 14, 2022


		labelSeedBeforeStart := readySeeds == 0 && len(rc.Datacenter.Spec.AdditionalSeeds) == 0 && externalSeedPoints == 0
		for podRankWithinRack := 0; ; podRankWithinRack++ {

		@@ -1676,15 +1675,21 @@ func TestFailedStart(t *testing.T) {

		func TestReconciliationContext_startAllNodes(t *testing.T) {

		// A boolean representing the state of a pod (started or not).

Make bootstrap operations deterministic #403

Make bootstrap operations deterministic #403

Conversation

adutra commented Sep 7, 2022

burmanm commented Sep 12, 2022

adutra commented Sep 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment