-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task controller for multi-pod jobs in cass-operator #243
Conversation
…gle pod runner. Fix a bug in the httphelper when fetching jobStatus
//+kubebuilder:rbac:groups=control.k8ssandra.io,namespace=cass-operator,resources=cassandratasks/status,verbs=get;update;patch | ||
//+kubebuilder:rbac:groups=control.k8ssandra.io,namespace=cass-operator,resources=cassandratasks/finalizers,verbs=update | ||
|
||
// Do we need to repeat this? It's already on the cassandradatacenter_controller.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. My guess is no, but I think it is good to leave it because 1) it provides documentation and 2) it avoid an unnecessary coupling on cassandradatacenter_controller.go.
// It is represented in RFC3339 form and is in UTC. | ||
// The completion time is only set when the job finishes successfully. | ||
// +optional | ||
CompletionTime *metav1.Time `json:"completionTime,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The completion time is only set when the job finishes successfully.
Is this correct? If I understand reconcileEveryPodTask
correctly, it looks like the job should eventually reach a completion state even if it has failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that text is copied over from Job. I don't know yet how to approach this, should it be "CompletionTime + Failed count" to notice that job failed or how would one check that it will no longer be executed (there are cases where we won't retry).
…nd NodeMgmtClient initialization
Trying to write down missing features:
None critical, but something I had planned.. |
) | ||
|
||
func createDatacenter(dcName, namespace string) func() { | ||
return func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be wrapped in a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the way ginkgo works Describe("...", func() ..)
and that's how it was previously used (until I refactored a bit to do it in the beforesuite instead of inside the test)
] | ||
}` | ||
|
||
var jobDetailsCompleted = `{"submit_time":"1638545895255","end_time":"1638545895255","id":"%s","type":"Cleanup","status":"COMPLETED"}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first saw this I wondered whether there should be a separate job details for the rebuild, but looking again at reconcileEveryPodTask
I see that it only checks the status. Is type
actually used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not by cass-operator. But if one would have "getAllJobs" in the management-api, you could create external tool to monitor the jobs from pods.
return taskKey, task | ||
} | ||
|
||
var _ = Describe("Execute jobs against all pods", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing these tests. This is good stuff. There are probably more scenarios we could test, but they could make for good follow up PRs, particularly for someone else looking to get more familiar with the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more scenarios, but writing them as envtest is difficult. Something like "prevent this from running concurrently" / "allow concurrently".
pod.Annotations = make(map[string]string) | ||
} | ||
|
||
if podJobId, found := pod.Annotations[podJobIdAnnotation]; found { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing your branch with my k8ssandra-operator branch and hit some issues. I am testing adding a new DC to an existing cluster. I first tested a with a single-node DC. I grepped Cassandra's logs to confirm that the rebuild operation. Yay for that!
I then decided to try my test with 2 nodes. When the new DC, dc2, comes up in this case, cleanup runs first. When cleanup finishes, the job id and job status annotations are still present on the pods. The rebuild task is created and reported as finished but not rebuild operations are actually performed. This is because at line 354 we check for the job id annotation. It is already present from cleanup, so we assume that the job is done.
Once I realized what was going on, I thought I might have a quick fix by simply adding a statement to delete the job id annotation at line 384. Unfortunately it leads to additional problems.
This situation made it abundantly clear that there needs to be an integration for this, i.e., two tasks run in succession.
Removing both the job id and job status annotations when the job completes won't work either because on a subsequent reconciliation we won't be able to determine if the job has already run on that pod. I think that storing the completed pods in the CassandraTask status would make this easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll handle this the same way as Kubernetes handles affinity status in annotations (json-serialized job information).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we update TaskConfig to include the task name, then we might be able to add a check to see if both the job id and job status are set but belong to a different task other than the current one, then we would know we can proceed with execution. I can try to test this out today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I did get it to run twice in the test at least with my approach. The current commit is a bit hacky, but I don't have the energy to make it pretty tonight - I'll continue tomorrow. But it seems to pass the test with two tasks in the same datacenter at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing the commit. I tested and it didn't resolve the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can't see them clashing at least on the annotations part:
Annotations: control.k8ssandra.io/job-4d7dc3a5-68c4-40cc-9136-cd9b3daac215:
{"jobId":"448f3370-2aab-42de-8262-257911379814","jobStatus":"COMPLETED","jobHandler":"management-api"}
control.k8ssandra.io/job-93af73b5-12aa-446e-9e12-720a85227e66:
{"jobId":"964845be-a7dd-4c1c-977c-08988d8dbf1c","jobStatus":"COMPLETED","jobHandler":"management-api"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see that the cleanup endpoint is called multiple times. Hmm..
INFO [nioEventLoopGroup-2-1] 2021-12-13 10:04:59,643 Cli.java:617 - address=/10.244.1.2:35656 url=/api/v0/metadata/versions/features status=200 OK
INFO [nioEventLoopGroup-2-2] 2021-12-13 10:04:59,646 Cli.java:617 - address=/10.244.1.2:35658 url=/api/v1/ops/keyspace/cleanup status=202 Accepted
INFO [nioEventLoopGroup-2-1] 2021-12-13 10:04:59,670 Cli.java:617 - address=/10.244.1.2:35660 url=/api/v0/metadata/versions/features status=200 OK
INFO [nioEventLoopGroup-2-2] 2021-12-13 10:04:59,672 Cli.java:617 - address=/10.244.1.2:35662 url=/api/v0/ops/executor/job status=200 OK
INFO [nioEventLoopGroup-2-1] 2021-12-13 10:04:59,708 Cli.java:617 - address=/10.244.1.2:35664 url=/api/v0/metadata/versions/features status=200 OK
INFO [nioEventLoopGroup-2-2] 2021-12-13 10:05:03,608 Cli.java:617 - address=/10.244.4.1:37512 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-1] 2021-12-13 10:05:09,670 Cli.java:617 - address=/10.244.1.2:35666 url=/api/v0/metadata/versions/features status=200 OK
This was with live pods, third task to execute.
func (r *CassandraTaskReconciler) activeTasks(ctx context.Context, dc *cassapi.CassandraDatacenter) ([]api.CassandraTask, error) { | ||
var taskList api.CassandraTaskList | ||
matcher := client.MatchingLabels(utils.MergeMap(dc.GetDatacenterLabels(), map[string]string{taskStatusLabel: activeTaskLabelValue})) | ||
if err := r.Client.List(ctx, &taskList, client.InNamespace(dc.Namespace), matcher); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always returns an empty result set because there are no datacenter labels on a CassandraTask. You need to remove the datacenter labels and then I would filter the results based on the .Spec.Datacenter
field.
} | ||
return ctrl.Result{RequeueAfter: 1 * time.Second}, nil | ||
} else if details.Status == podJobError { | ||
// Log the error, move on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the job finishes, either successfully or fails, the job annotation never gets removed from the pod so you end up with an accumulation of job annotations, e.g.,
apiVersion: v1
kind: Pod
metadata:
annotations:
control.k8ssandra.io/job-0f003348-33db-4272-bfcd-790ad27785cf: '{"id":"0fc51790-c522-4f74-9d87-a93a08d252af","status":"COMPLETED","handler":"management-api"}'
control.k8ssandra.io/job-91f0112a-5cf6-4e65-b889-54d6676d2d8d: '{"id":"c38a54e5-5a57-4d2e-9067-7957370466cf","status":"COMPLETED","handler":"management-api"}'
control.k8ssandra.io/job-b7946ffd-2456-4544-bab8-c6f314c6fcc4: '{"id":"8b04c4f2-632b-429d-b678-aeeefe5489a6","status":"COMPLETED","handler":"management-api"}'
creationTimestamp: "2021-12-13T18:29:52Z"
generateName: test-dc2-default-sts-
We should be removing those once it is safe to do so.
Since the annotation refers to the CassandraTask uid, do you think it might provide more clarity to change the prefix to task
, e.g., control.k8ssandra.io/task-b7946ffd-2456-4544-bab8-c6f314c6fcc4
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually need to add those annotations (so that all resources have those annotations).
// We weren't resuming from a stopped state, so we must be growing the | ||
// size of the rack | ||
// size of the rack and this isn't the initialization stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently (in master) the scaling up condition does get set to true initially before the datacenter is ready. Are there any scenarios in which the Ready condition could be false and scaling up would happen? I assume no but want to be sure. If it is possible would it be better/safer to handle this in the CheckClearActionConditions
function. The function could be updated to only call cleanupAfterScaling
after the Initialized condition is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there's no such thing. This is the only place where ScalingUp is set (and I don't think it should be set when the cluster is starting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point is that ScalingUp is set before the datacenter becomes ready, or more precisely, before the Initialized condition is set. What is the rationale for changing that existing behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because creating a new cluster is not scaling up existing one. I don't feel like these two should share the same status - especially if we have a different process for both cases.
…e task from the targeted Datacenter
…cy issue with cached gets from api-server, add CassandraTask checks to scale_up e2e tests. Also, a small example CassandraTask
|
||
// TrackedTasks tracks the tasks for completion that were created by the cass-operator | ||
// +optional | ||
TrackedTasks []corev1.ObjectReference `json:"trackedTasks,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have added the DC labels to the CassandraTask is this necessary any more? I see where it is used in reconcile_racks.go
, but that could be replaced with a query to fetch the CassandraTask. There are a couple benefits. First, it eliminates and therefore simplifies some code. Secondly, it cuts down on requeues which is big. The status updates for a task will trigger a lot of unnecessary requeues not only for cass-operator but also for k8ssandra-operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status updates should not trigger a reconcile. That's a bug in the k8ssandra-operator if that happens and means the watcher is missing version check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I forgot about GenerationChangedPredicate when I wrote my comment. Status updates won't trigger a reconcile for cass-operator but it will for k8ssandra-operator since the latter does not use a GenerationChangedPredicate for its CassandraDatacenter watch. This is by design though. k8ssandra-operator doesn't use GenerationChangedPredicate since it does not manage the CassandraDatacenter. We do want status updates to trigger reconciliation in k8ssandra-operator. Aside from requeues (or lack thereof) I would still be in favor removing TrackedTasks since the information is already available through the CassandraTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That information isn't actually available, since TrackedTasks only include tasks generated by the cass-operator, while fetching with labels would give back all the tasks generated by anyone. At that point, I would need to know which ones have been created before the operation I ran in the cass-operator and which later. I don't have that information at the moment.
|
||
// TODO Add conditions also | ||
if err = r.Client.Status().Update(ctx, &cassTask); err != nil { | ||
return ctrl.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point I tested with a bad management-api image which resulted in an error when taskConfig.AsyncFunc
is called. The failure was due to the RPC endpoint not being properly exposed. The errors happens against the first pod, subsequent pods are never tried. At some point the job should be marked failed. k8ssandra-operator ends up in a requeue loop waiting for the job to complete, but it never does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems my reply never got here, odd.. well, the process catches failed jobs and exits with them correctly and cass-operator should kill pods that are not healthy. However, you hit an edge case where management-api is just simply not working correctly but cass-operator is not killing it.
…unts to executed pod counts
What this PR does:
Runs jobs in serial in every pod part of the Datacenter, either using the async functions or sync functions (depending on the version of the cassandra container).
Which issue(s) this PR fixes:
Fixes #
Checklist