Support for Kubernetes Jobs #1906

mcapuccini · 2016-11-01T17:04:56Z

Support for Kubernetes Jobs

Description

I added a Task extension that enable to run Jobs in a Kubernetes cluster.

Motivation and Context

This enables the distribution of tasks, that come as light-weight application containers, in a Kubernetes cluster. There is a feature proposal: #1549.

Have you tested this? If so, how?

I have included unit tests. To run them locally you need a minikube cluster up and running.

Tarrasch

I didn't look much at the actual code that matters. Can you find somebody with kubernetes knowledge to review?

Also, please write about if and how you've used this in production already. :)

Tarrasch · 2016-11-02T02:14:09Z

examples/kubernetes_job.py

+ }
+
+if __name__ == "__main__":
+ luigi.run(['PerlPi', '--local-scheduler'])


Let's remove this as it's discouraged nowadays.

Tarrasch · 2016-11-02T02:17:55Z

examples/kubernetes_job.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#


Can you add module docs. SImilar to that of the execution summary example?

Is the docstring below good?

Yes. I think this is very good. :)

Tarrasch · 2016-11-02T02:19:11Z

test/contrib/k8s_job_test.py

+ self.assertTrue(job.obj["status"]["failed"] > fail.max_retrials)
+
+if __name__ == "__main__":
+ unittest.main()


Remove last 2 lines please. :)

mcapuccini · 2016-11-02T11:53:45Z

Hi @Tarrasch, thanks for this first review.

I talked to @pcm32. He doesn't have experience with Luigi, but he made a Kubernetes Job wrapper for Galaxy, so he could give a look to the code. However, he can't do this soonish. If you know someone with experience in both Kubernetes, and Lugi, it would work better I think.

I didn't use this in production yet, but I will do it very soon. I will try to reproduce a scientific workflow in a cloud environment.

I have a question about the CI. I think I can fix some of the checks that fail, but ultimately to run the tests that I wrote, you need a minikube cluster that is local to Travis (or some other Kubernetes cluster). Is that feasible in your settings?

Tarrasch · 2016-11-03T02:24:49Z

As for CI, you can annotate the tests (like we do for hdfs and many other systems). But I think you can skip making an actual Travis build for them.

tym-xqo · 2016-11-22T16:40:38Z

We're using Luigi and Kubernetes in Production at my shop. Will try to take a look as time permits

mcapuccini · 2016-11-22T16:44:25Z

thanks @tym-oao, I would really appreciate that. I will report if I succeed to use this in production too.

Tarrasch

Sorry. I previously mis-clicked approve ...

henryrizzi · 2016-12-29T18:13:56Z

Hello, I'm at the same shop as @tym-oao .
I'm currently taking a look at this and trying to think of more test situations, but it has worked as I expected so far. Will update back here with any concerns or problems that I might run into.

henryrizzi

I was able to get this to work for a few trivial tests, but these changes would be appreciated. Also, I would think that there should be the requirement of output for the task to fit it more easily into a workflow. so including def output(self): raise NotImplementedError() (unless that seems unreasonable)

henryrizzi · 2016-12-30T06:52:27Z

luigi/contrib/k8s_job.py

+ A name for this job. This task will automatically append a UUID to the
+ name before to submit to Kubernetes.
+ """
+ pass


I would probably also have this as a raise NotImplementedError("subclass must define name") or you can keep both this and the one below as pass.

henryrizzi · 2016-12-30T06:55:28Z

luigi/contrib/k8s_job.py

+ self.__logger.debug("Kubernetes job " + self.uu_name + " is still running")
+ time.sleep(self.__POLL_TIME)
+ if(self.__get_job_status() == "succeeded"):
+ self.__logger.info("Kubernetes job " + self.uu_name + " succeeded")


I think it would make sense to touch some sort of output at this point to signal job completion.

Maybe something along the lines of:

with self.output().open('w') as output_file: output_file.write('')

Just to touch the required output file.

mcapuccini · 2016-12-30T18:16:34Z

Thanks a lot for testing and review! I will wrap it up soon and come back to you guys. I am also working to a more complex analysis for a bioinformatics paper, I suggest that this PR gets merged after I complete that, so if I figure out that there is something missing I can add it.

henryrizzi · 2017-01-03T17:26:43Z

After looking into it more, it was a configuration issue on my part. Am getting some issues when trying to require a k8s task from another task, but I think that's expected without the task having an output. :)

henryrizzi

These changes would be helpful for ascertaining job completion.
If this is added, I would also add an output to the tests, otherwise they will fail.

henryrizzi · 2017-01-03T18:28:15Z

luigi/contrib/k8s_job.py

+ job.scale(replicas=0) # avoid more retrials
+ return "failed"
+ return "running"
+


I think adding this to the method would be really helpful, and fixes an error that I had while testing (there's no way to tell that the k8s job is complete)

def output(self): """Implement an output to allow for dependency chaining""" raise NotImplementedError("Subclass must define output")

henryrizzi · 2017-01-03T18:28:18Z

luigi/contrib/k8s_job.py

+ self.__logger.debug("Kubernetes job " + self.uu_name + " is still running")
+ time.sleep(self.__POLL_TIME)
+ if(self.__get_job_status() == "succeeded"):
+ self.__logger.info("Kubernetes job " + self.uu_name + " succeeded")


Maybe something along the lines of:

with self.output().open('w') as output_file: output_file.write('')

Just to touch the required output file.

henryrizzi · 2017-01-03T20:51:31Z

The central scheduler won't behave correctly when the task __init__ method is overwritten. I ran into this issue earlier, but thought it was a configuration issue on my part.
The following changes worked for me:
def __init__(self, *args, **kwargs): -> def initialize_k8s_job(self):
take out the first line calling super.
and then just put self.initialize_k8s_job() at the beginning of the run method.

You could accomplish that a different way, but that will allow it to run with local scheduler or with the central scheduler.

henryrizzi · 2017-01-04T21:29:36Z

@mcapuccini Not sure if you have the time to work on this right now, so I addressed the changes that @Tarrasch and I requested here.
Would it be preferable to put in a pull request to your fork (a fork of a fork) and or put another pull request to master referencing this pull request?

No rush on this, just thought I'd do my best to help things along. 😄

mcapuccini · 2017-01-05T09:44:13Z

@henryrizzi thanks for your review and comments, I really appreciate that. I added you as a collaborator to my fork, so you can add your improvements to this PR straight away. At the moment I have some other tasks to work to with higher priority, but in a couple of weeks I'll be able test this on a real use case.

…fo docs, update tests and example

henryrizzi · 2017-01-05T18:18:50Z

@mcapuccini Thanks for adding me as a collaborator to your fork and for making the PR!
I just put in a pull request to implement some of the suggested changes. Feel free to ask questions or change anything that you think is weird in my code. I'll be testing testing it in more real world use cases as well, so I'll make updates to that PR if I notice anything weird.

Make changes to the task to allow for central scheduling

Tarrasch · 2017-01-20T02:32:25Z

I believe I replied to all your comments. The last change (which you nicely reminded me off) is to change the names. Just name them like:

luigi/contrib/kubernetes.py
test/contrib/kubernetes_test.py
examples/kubernetes.py (here you have a bit more freedom)

Does it sound reasonable? Also the config class you'll create should also be called kubernetes, which will automatically make the config section called [kubernetes]

mcapuccini · 2017-01-20T15:00:10Z

@henryrizzi will you take care of the latest change requests, or shall I do it?

henryrizzi · 2017-01-20T16:16:13Z

I can take care of the latest changes and put in another pull request to your branch. 👍

Name Changes + luigi.Config class addition due to feedback

Tarrasch · 2017-02-02T03:20:05Z

This looks ready to merge except for that Travis is red. Once that's fixed I'm ok with this getting merged. :)

mcapuccini · 2017-02-02T12:03:53Z

@Tarrasch I am doing some tests on a real pipeline these days. There are some things to be fixed. Next week I'll be to a conference, so in a couple of weeks it will be ready to be merged.

mcapuccini · 2017-02-17T16:14:48Z

Waiting for @henryrizzi to review the latest changes before I get them merged. I have successfully run my workflow (https://github.com/phnmnl/jupyter-demo/blob/master/preprocessing_workflow.py) in a real k8s cluster.

…ma is defined as a method, plus some comments refactory

Fix restart policy not picked

This reverts commit a2023ee.

colemanja91 · 2017-02-20T15:41:59Z

@mcapuccini Thanks for this! I'm at the beginning of trying to run tasks on an Openshift cluster and this will help a ton. Just curious what your thoughts are from a design/implementation perspective:
Are you running the KubernetesJobTask from within a Kube container itself? If so I'd be curious to hear about any challenges/benefits from doing that.

mcapuccini · 2017-02-20T16:33:27Z

@colemanja91 yes, I run Luigi inside a container. What I like a lot is to run a custom Jupyter image where I can edit and run my Luigi workflows. This is not challenging at all, you just need to use the service-account authentication method when setting up Luigi in your container.

I am very soon going to integrate Luigi in KubeNow to enable data science pipelines on top of it.

mcapuccini · 2017-02-20T17:13:04Z

@Tarrasch I am quite confident that the build is going to pass this time. Then it should be ready to be merged IMO 🙂

Tarrasch · 2017-02-24T01:56:13Z

Thanks!

apierleoni · 2017-02-24T03:57:26Z

@mcapuccini great job here, thanks!
I have a question regarding watching the job status. This implementation is periodically querying the api to get the status. Do you know if this will overload the kubernets api in case too many parallel jobs are executed?

Therr is a "WATCH" method in kubernetes api that returns the status as a stream and might be useful to reduce both the number of calls and the delay to get the response.
Not sure it is available in pykube but is already implemented here:
http://python-kube.readthedocs.io/en/latest/reference.html#resourcewatcher

mcapuccini · 2017-02-24T15:53:43Z

@apierleoni I had tested it with 40 parallel Jobs with no problems.

The watch method is interesting, if someone reports any problem in polling from the Kubernetes API, we should change the implementation.

apierleoni · 2017-02-24T16:00:43Z

Thanks for the info

Marco Capuccini added 6 commits October 30, 2016 00:28

add k8s support

7619a9c

task definition simplified

b0f5a4a

remove sys import

f08b9f7

Add python 2 warning

a2685a2

add tests and comment

bc94870

add tests and fix bugs

82ca272

Tarrasch suggested changes Nov 2, 2016

View reviewed changes

add support for ServiceAccount

da9ec93

Tarrasch approved these changes Nov 23, 2016

View reviewed changes

Tarrasch suggested changes Nov 23, 2016

View reviewed changes

Marco Capuccini added 2 commits November 23, 2016 10:35

print auth method as debug message

0c553e8

fix logging

078bc58

henryrizzi suggested changes Dec 30, 2016

View reviewed changes

henryrizzi suggested changes Jan 3, 2017

View reviewed changes

add output req, update test+example, get logger

6cb5317

make output more flexible, change spelling+consistency, add config in…

acc97d9

…fo docs, update tests and example

henryrizzi and others added 3 commits January 5, 2017 12:36

add support for custom job labels

37f7733

use mock for testing signal_complete behavior + reduce job requirements

a042c93

Merge pull request #1 from henryrizzi/feature/k8s-task-changes

adef08e

Make changes to the task to allow for central scheduling

henryrizzi and others added 2 commits January 25, 2017 11:40

name changes and config addition due to feedback

8b7f615

Merge pull request #2 from henryrizzi/feature/k8s-task-changes

eda6421

Name Changes + luigi.Config class addition due to feedback

Tarrasch approved these changes Feb 2, 2017

View reviewed changes

mcapuccini-ci and others added 8 commits February 18, 2017 14:18

Update restart policy in job_json to make it effective when spec_sche…

d920cdf

…ma is defined as a method, plus some comments refactory

Merge pull request #4 from mcapuccini/fix/restart-policy

24c25ac

Fix restart policy not picked

attempt to fix travis doc check

a2023ee

Revert "attempt to fix travis doc check"

e64e4f8

This reverts commit a2023ee.

skip tests when pykube is not available

e06adb0

fix documentation

c9e4ff4

fix bullet list

9de1222

fix bullet list, attempt 2

cf8fc8e

fix output documentation

2429850

fix config documentation

2c4c340

Tarrasch merged commit 6ce9708 into spotify:master Feb 24, 2017

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed

mdragilev mentioned this pull request Jun 28, 2024

for S3 contrib package move to boto3 Affirm/luigi#26

Merged

Support for Kubernetes Jobs #1906

Support for Kubernetes Jobs #1906

Conversation

mcapuccini commented Nov 1, 2016

Support for Kubernetes Jobs

Description

Motivation and Context

Have you tested this? If so, how?

Tarrasch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcapuccini commented Nov 2, 2016

Tarrasch commented Nov 3, 2016

tym-xqo commented Nov 22, 2016

mcapuccini commented Nov 22, 2016

Tarrasch left a comment

Choose a reason for hiding this comment

henryrizzi commented Dec 29, 2016

henryrizzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcapuccini commented Dec 30, 2016

henryrizzi commented Jan 3, 2017

henryrizzi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henryrizzi commented Jan 3, 2017 • edited Loading

henryrizzi commented Jan 4, 2017

mcapuccini commented Jan 5, 2017 • edited Loading

henryrizzi commented Jan 5, 2017

Tarrasch commented Jan 20, 2017

mcapuccini commented Jan 20, 2017

henryrizzi commented Jan 20, 2017 • edited Loading

Tarrasch commented Feb 2, 2017

mcapuccini commented Feb 2, 2017

mcapuccini commented Feb 17, 2017

colemanja91 commented Feb 20, 2017

mcapuccini commented Feb 20, 2017 • edited Loading

mcapuccini commented Feb 20, 2017

Tarrasch commented Feb 24, 2017

apierleoni commented Feb 24, 2017 • edited Loading

mcapuccini commented Feb 24, 2017

apierleoni commented Feb 24, 2017 via email • edited Loading

henryrizzi left a comment •

edited

Loading

henryrizzi commented Jan 3, 2017 •

edited

Loading

mcapuccini commented Jan 5, 2017 •

edited

Loading

henryrizzi commented Jan 20, 2017 •

edited

Loading

mcapuccini commented Feb 20, 2017 •

edited

Loading

apierleoni commented Feb 24, 2017 •

edited

Loading

apierleoni commented Feb 24, 2017 via email •

edited

Loading