Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task controller for multi-pod jobs in cass-operator #243

Merged
merged 31 commits into from
Jan 8, 2022

Conversation

burmanm
Copy link
Contributor

@burmanm burmanm commented Dec 7, 2021

What this PR does:
Runs jobs in serial in every pod part of the Datacenter, either using the async functions or sync functions (depending on the version of the cassandra container).

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@burmanm burmanm requested a review from a team as a code owner December 7, 2021 14:59
@jsanda jsanda requested review from jsanda and removed request for a team December 9, 2021 02:16
apis/control/v1alpha1/cassandratask_types.go Outdated Show resolved Hide resolved
//+kubebuilder:rbac:groups=control.k8ssandra.io,namespace=cass-operator,resources=cassandratasks/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=control.k8ssandra.io,namespace=cass-operator,resources=cassandratasks/finalizers,verbs=update

// Do we need to repeat this? It's already on the cassandradatacenter_controller.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. My guess is no, but I think it is good to leave it because 1) it provides documentation and 2) it avoid an unnecessary coupling on cassandradatacenter_controller.go.

apis/control/v1alpha1/cassandratask_types.go Outdated Show resolved Hide resolved
apis/control/v1alpha1/cassandratask_types.go Show resolved Hide resolved
controllers/control/cassandratask_controller.go Outdated Show resolved Hide resolved
apis/control/v1alpha1/cassandratask_types.go Show resolved Hide resolved
controllers/control/cassandratask_controller.go Outdated Show resolved Hide resolved
controllers/control/cassandratask_controller.go Outdated Show resolved Hide resolved
controllers/control/cassandratask_controller.go Outdated Show resolved Hide resolved
// It is represented in RFC3339 form and is in UTC.
// The completion time is only set when the job finishes successfully.
// +optional
CompletionTime *metav1.Time `json:"completionTime,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The completion time is only set when the job finishes successfully.

Is this correct? If I understand reconcileEveryPodTask correctly, it looks like the job should eventually reach a completion state even if it has failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that text is copied over from Job. I don't know yet how to approach this, should it be "CompletionTime + Failed count" to notice that job failed or how would one check that it will no longer be executed (there are cases where we won't retry).

@burmanm
Copy link
Contributor Author

burmanm commented Dec 10, 2021

Trying to write down missing features:

  • TTL to delete expired tasks
  • Output modifications with kubebuilder annotations so that display looks nice with kubectl
  • Condition updates to Status
  • Partial failure mode and CompletionTime
  • Decommission support (requires changes elsewhere in the cass-operator)
  • More than one job? (and its failure mode)
  • Event support to track the process for some tools
  • RetryPolicy extensions? MaxRetries and maxMissedSeconds for a scheduled time?
  • ScheduledTimeWindow? Only start jobs between time n -> n and otherwise wait?
  • Webhooks for validation
  • CassandraTaskStatus has still fields from Job definition that will not be correct

None critical, but something I had planned..

)

func createDatacenter(dcName, namespace string) func() {
return func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be wrapped in a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way ginkgo works Describe("...", func() ..) and that's how it was previously used (until I refactored a bit to do it in the beforesuite instead of inside the test)

]
}`

var jobDetailsCompleted = `{"submit_time":"1638545895255","end_time":"1638545895255","id":"%s","type":"Cleanup","status":"COMPLETED"}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw this I wondered whether there should be a separate job details for the rebuild, but looking again at reconcileEveryPodTask I see that it only checks the status. Is type actually used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not by cass-operator. But if one would have "getAllJobs" in the management-api, you could create external tool to monitor the jobs from pods.

return taskKey, task
}

var _ = Describe("Execute jobs against all pods", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing these tests. This is good stuff. There are probably more scenarios we could test, but they could make for good follow up PRs, particularly for someone else looking to get more familiar with the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more scenarios, but writing them as envtest is difficult. Something like "prevent this from running concurrently" / "allow concurrently".

pod.Annotations = make(map[string]string)
}

if podJobId, found := pod.Annotations[podJobIdAnnotation]; found {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been testing your branch with my k8ssandra-operator branch and hit some issues. I am testing adding a new DC to an existing cluster. I first tested a with a single-node DC. I grepped Cassandra's logs to confirm that the rebuild operation. Yay for that!

I then decided to try my test with 2 nodes. When the new DC, dc2, comes up in this case, cleanup runs first. When cleanup finishes, the job id and job status annotations are still present on the pods. The rebuild task is created and reported as finished but not rebuild operations are actually performed. This is because at line 354 we check for the job id annotation. It is already present from cleanup, so we assume that the job is done.

Once I realized what was going on, I thought I might have a quick fix by simply adding a statement to delete the job id annotation at line 384. Unfortunately it leads to additional problems.

This situation made it abundantly clear that there needs to be an integration for this, i.e., two tasks run in succession.

Removing both the job id and job status annotations when the job completes won't work either because on a subsequent reconciliation we won't be able to determine if the job has already run on that pod. I think that storing the completed pods in the CassandraTask status would make this easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll handle this the same way as Kubernetes handles affinity status in annotations (json-serialized job information).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we update TaskConfig to include the task name, then we might be able to add a check to see if both the job id and job status are set but belong to a different task other than the current one, then we would know we can proceed with execution. I can try to test this out today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I did get it to run twice in the test at least with my approach. The current commit is a bit hacky, but I don't have the energy to make it pretty tonight - I'll continue tomorrow. But it seems to pass the test with two tasks in the same datacenter at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing the commit. I tested and it didn't resolve the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't see them clashing at least on the annotations part:

Annotations:  control.k8ssandra.io/job-4d7dc3a5-68c4-40cc-9136-cd9b3daac215:
                {"jobId":"448f3370-2aab-42de-8262-257911379814","jobStatus":"COMPLETED","jobHandler":"management-api"}
              control.k8ssandra.io/job-93af73b5-12aa-446e-9e12-720a85227e66:
                {"jobId":"964845be-a7dd-4c1c-977c-08988d8dbf1c","jobStatus":"COMPLETED","jobHandler":"management-api"}

Copy link
Contributor Author

@burmanm burmanm Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see that the cleanup endpoint is called multiple times. Hmm..

INFO  [nioEventLoopGroup-2-1] 2021-12-13 10:04:59,643 Cli.java:617 - address=/10.244.1.2:35656 url=/api/v0/metadata/versions/features status=200 OK
INFO  [nioEventLoopGroup-2-2] 2021-12-13 10:04:59,646 Cli.java:617 - address=/10.244.1.2:35658 url=/api/v1/ops/keyspace/cleanup status=202 Accepted
INFO  [nioEventLoopGroup-2-1] 2021-12-13 10:04:59,670 Cli.java:617 - address=/10.244.1.2:35660 url=/api/v0/metadata/versions/features status=200 OK
INFO  [nioEventLoopGroup-2-2] 2021-12-13 10:04:59,672 Cli.java:617 - address=/10.244.1.2:35662 url=/api/v0/ops/executor/job status=200 OK
INFO  [nioEventLoopGroup-2-1] 2021-12-13 10:04:59,708 Cli.java:617 - address=/10.244.1.2:35664 url=/api/v0/metadata/versions/features status=200 OK
INFO  [nioEventLoopGroup-2-2] 2021-12-13 10:05:03,608 Cli.java:617 - address=/10.244.4.1:37512 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2021-12-13 10:05:09,670 Cli.java:617 - address=/10.244.1.2:35666 url=/api/v0/metadata/versions/features status=200 OK

This was with live pods, third task to execute.

func (r *CassandraTaskReconciler) activeTasks(ctx context.Context, dc *cassapi.CassandraDatacenter) ([]api.CassandraTask, error) {
var taskList api.CassandraTaskList
matcher := client.MatchingLabels(utils.MergeMap(dc.GetDatacenterLabels(), map[string]string{taskStatusLabel: activeTaskLabelValue}))
if err := r.Client.List(ctx, &taskList, client.InNamespace(dc.Namespace), matcher); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always returns an empty result set because there are no datacenter labels on a CassandraTask. You need to remove the datacenter labels and then I would filter the results based on the .Spec.Datacenter field.

}
return ctrl.Result{RequeueAfter: 1 * time.Second}, nil
} else if details.Status == podJobError {
// Log the error, move on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the job finishes, either successfully or fails, the job annotation never gets removed from the pod so you end up with an accumulation of job annotations, e.g.,

apiVersion: v1
kind: Pod
metadata:
  annotations:
    control.k8ssandra.io/job-0f003348-33db-4272-bfcd-790ad27785cf: '{"id":"0fc51790-c522-4f74-9d87-a93a08d252af","status":"COMPLETED","handler":"management-api"}'
    control.k8ssandra.io/job-91f0112a-5cf6-4e65-b889-54d6676d2d8d: '{"id":"c38a54e5-5a57-4d2e-9067-7957370466cf","status":"COMPLETED","handler":"management-api"}'
    control.k8ssandra.io/job-b7946ffd-2456-4544-bab8-c6f314c6fcc4: '{"id":"8b04c4f2-632b-429d-b678-aeeefe5489a6","status":"COMPLETED","handler":"management-api"}'
  creationTimestamp: "2021-12-13T18:29:52Z"
  generateName: test-dc2-default-sts-

We should be removing those once it is safe to do so.

Since the annotation refers to the CassandraTask uid, do you think it might provide more clarity to change the prefix to task, e.g., control.k8ssandra.io/task-b7946ffd-2456-4544-bab8-c6f314c6fcc4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually need to add those annotations (so that all resources have those annotations).

controllers/control/cassandratask_controller.go Outdated Show resolved Hide resolved
// We weren't resuming from a stopped state, so we must be growing the
// size of the rack
// size of the rack and this isn't the initialization stage
Copy link
Contributor

@jsanda jsanda Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently (in master) the scaling up condition does get set to true initially before the datacenter is ready. Are there any scenarios in which the Ready condition could be false and scaling up would happen? I assume no but want to be sure. If it is possible would it be better/safer to handle this in the CheckClearActionConditions function. The function could be updated to only call cleanupAfterScaling after the Initialized condition is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's no such thing. This is the only place where ScalingUp is set (and I don't think it should be set when the cluster is starting).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point is that ScalingUp is set before the datacenter becomes ready, or more precisely, before the Initialized condition is set. What is the rationale for changing that existing behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because creating a new cluster is not scaling up existing one. I don't feel like these two should share the same status - especially if we have a different process for both cases.


// TrackedTasks tracks the tasks for completion that were created by the cass-operator
// +optional
TrackedTasks []corev1.ObjectReference `json:"trackedTasks,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have added the DC labels to the CassandraTask is this necessary any more? I see where it is used in reconcile_racks.go, but that could be replaced with a query to fetch the CassandraTask. There are a couple benefits. First, it eliminates and therefore simplifies some code. Secondly, it cuts down on requeues which is big. The status updates for a task will trigger a lot of unnecessary requeues not only for cass-operator but also for k8ssandra-operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status updates should not trigger a reconcile. That's a bug in the k8ssandra-operator if that happens and means the watcher is missing version check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I forgot about GenerationChangedPredicate when I wrote my comment. Status updates won't trigger a reconcile for cass-operator but it will for k8ssandra-operator since the latter does not use a GenerationChangedPredicate for its CassandraDatacenter watch. This is by design though. k8ssandra-operator doesn't use GenerationChangedPredicate since it does not manage the CassandraDatacenter. We do want status updates to trigger reconciliation in k8ssandra-operator. Aside from requeues (or lack thereof) I would still be in favor removing TrackedTasks since the information is already available through the CassandraTask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That information isn't actually available, since TrackedTasks only include tasks generated by the cass-operator, while fetching with labels would give back all the tasks generated by anyone. At that point, I would need to know which ones have been created before the operation I ran in the cass-operator and which later. I don't have that information at the moment.


// TODO Add conditions also
if err = r.Client.Status().Update(ctx, &cassTask); err != nil {
return ctrl.Result{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point I tested with a bad management-api image which resulted in an error when taskConfig.AsyncFunc is called. The failure was due to the RPC endpoint not being properly exposed. The errors happens against the first pod, subsequent pods are never tried. At some point the job should be marked failed. k8ssandra-operator ends up in a requeue loop waiting for the job to complete, but it never does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems my reply never got here, odd.. well, the process catches failed jobs and exits with them correctly and cass-operator should kill pods that are not healthy. However, you hit an edge case where management-api is just simply not working correctly but cass-operator is not killing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants