Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds job cancellation of flux jobs #22

Merged
merged 4 commits into from
Nov 4, 2021

Conversation

xyloid
Copy link

@xyloid xyloid commented Nov 1, 2021

This PR adds flux job cancellation function to kubeflux scheduler which enables kubeflux to cancel a flux job after it's corresponding pod completed (Its PodPhase can be either PodSucceeded or PodFailed). Thus the previously allocated resources can be freed and reused.

  • Add PodInformer to watch pod event. Job cancellation is implemented in the update event handler.
  • Each pod's jobspec yaml is stored in its own yaml file instead of being overwritten when a new pod is being scheduled.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass please squash your commits into functional changes, to take down from current 33.

also a nit, some files need EOF clean up

examples/pi/.gitignore Outdated Show resolved Hide resolved
examples/pi/Dockerfile Outdated Show resolved Hide resolved
examples/pi/Dockerfile.segfault Outdated Show resolved Hide resolved
examples/pi/clean_pods.yaml Outdated Show resolved Hide resolved
examples/pi/demo-pi-job.yaml Outdated Show resolved Hide resolved
examples/pi/init_kind_cluster.sh Outdated Show resolved Hide resolved
examples/pi/pi-job-default.yaml Outdated Show resolved Hide resolved
examples/pi/pi-job-kubeflux-segfault.yaml Outdated Show resolved Hide resolved
examples/pi/pi-job-kubeflux.yaml Outdated Show resolved Hide resolved
examples/pi/pi.yaml Outdated Show resolved Hide resolved
@xyloid xyloid force-pushed the yilin/flux-job-cancellation branch 7 times, most recently from 479957e to a5ca5a3 Compare November 1, 2021 20:46
@dongahn dongahn self-requested a review November 1, 2021 23:24
@dongahn
Copy link
Member

dongahn commented Nov 1, 2021

I just added myself as a reviewer. Welcome @xyloid!!

Two high level comments first:

  1. Please add a high level summary of what this PR is about. Reviewing individual commits without the high level synapsis makes it hard to assess whether this PR has a good transactional topic or not. (As an example you can look at recently merged PRs in other repos like flux-core, flux-sched, flux-accounting -- e.g., Add sched.feasibility service for flux-core's jobspec feasibility validator  flux-sched#871)
  2. As part of that, please comment why your PR includes @milroy's. Is that because a WIP PR was used to create this PR.

Flux's RFC1 on code-development workflow (called C4 which is a variant of fork/pull model) might be useful if you haven't reviewed it yet.

Copy link
Member

@dongahn dongahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my posting in the conversation. Thanks.

@xyloid
Copy link
Author

xyloid commented Nov 1, 2021

I just added myself as a reviewer. Welcome @xyloid!!

Two high level comments first:

1. Please add a high level summary of what this PR is about. Reviewing individual commits without the high level synapsis makes it hard to assess whether this PR has a good transactional topic or not. (As an example you can look at recently merged PRs in other repos like flux-core, flux-sched, flux-accounting -- e.g., [Add `sched.feasibility` service for flux-core's jobspec feasibility validator  flux-sched#871](https://github.com/flux-framework/flux-sched/pull/871))

2. As part of that, please comment why your PR includes @milroy's. Is that because a WIP PR was used to create this PR.

Flux's RFC1 on code-development workflow (called C4 which is a variant of fork/pull model) might be useful if you haven't reviewed it yet.

Thank you for the comments! I will work on comment 1. As for comment 2, I think I mistakenly clicked fetch upstream on my github. Otherwise, I don't know why those commits show up in this branch. Let me try if I can remove them from this branch.

@xyloid xyloid force-pushed the yilin/flux-job-cancellation branch from a5ca5a3 to 29ccee7 Compare November 2, 2021 00:21
@dongahn
Copy link
Member

dongahn commented Nov 2, 2021

As for comment 2, I think I mistakenly clicked fetch upstream on my github. Otherwise, I don't know why those commits show up in this branch. Let me try if I can remove them from this branch.

I think those commits disappeared now!!

I don't know exactly how this state was created. But I typically use the following sequence before post a PR. In your forked dev branch:

% git remote add upstream git@github.com:flux-framework/flux-k8s.git (if you haven't done)
% git fetch upstream
% git rebase upstream/master

This will allow your changed to be on top of the head of the upstream master.

@xyloid xyloid force-pushed the yilin/flux-job-cancellation branch from 29ccee7 to 881626d Compare November 2, 2021 01:37
@xyloid xyloid changed the title Implemented job cancellation of flux jobs Adds job cancellation of flux jobs Nov 2, 2021
@xyloid xyloid force-pushed the yilin/flux-job-cancellation branch from 881626d to d57bb44 Compare November 2, 2021 02:21
@xyloid
Copy link
Author

xyloid commented Nov 2, 2021

As for comment 2, I think I mistakenly clicked fetch upstream on my github. Otherwise, I don't know why those commits show up in this branch. Let me try if I can remove them from this branch.

I think those commits disappeared now!!

I don't know exactly how this state was created. But I typically use the following sequence before post a PR. In your forked dev branch:

% git remote add upstream git@github.com:flux-framework/flux-k8s.git (if you haven't done)
% git fetch upstream
% git rebase upstream/master

This will allow your changed to be on top of the head of the upstream master.

I used git rebase -i HEAD~n to drop those commits. Thank you for the help with git! I guess I clicked something wrong in github webpage that related to branch history. Next time I will use command line.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a short discussion with @cmisale , I think this PR must be spitted out into 2
1- the actual changes to the go code
2- the example for the pi calculation

@xyloid xyloid force-pushed the yilin/flux-job-cancellation branch from d57bb44 to cd1576e Compare November 3, 2021 00:25
Previously, all jobspec are written to the same jobspec.yaml.
As a result, only the most recent jobspec.yaml is saved.
For debug purpose, in this commit, each pod's jobspec is
saved in its own file.
A PodInformer is added in kubeflux. For each update pod
event, if the pod is in PodSucceeded PodPhase, then kubeflux
will try to find its corresponding jobid. If jobid exists,
then kubeflux will try to cancel it in flux.
In the update pod event handler, a pod in PodFailed
PodPhase will also trigger job cancellation in order
to free the resources of a failed pod.
@xyloid xyloid force-pushed the yilin/flux-job-cancellation branch from cd1576e to 14c4edc Compare November 3, 2021 00:29
@xyloid
Copy link
Author

xyloid commented Nov 3, 2021

After a short discussion with @cmisale , I think this PR must be spitted out into 2 1- the actual changes to the go code 2- the example for the pi calculation

I have created a new PR(#23) for the pi calculation, and now this PR only contains go code changes.

flux-plugin/kubeflux/kubeflux.go Outdated Show resolved Hide resolved
flux-plugin/kubeflux/kubeflux.go Show resolved Hide resolved
flux-plugin/kubeflux/kubeflux.go Outdated Show resolved Hide resolved
Co-authored-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
@cmisale cmisale merged commit 0a111e6 into flux-framework:main Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants