Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Kubernetes Jobs #1906

Merged
merged 26 commits into from
Feb 24, 2017
Merged

Conversation

mcapuccini
Copy link

Support for Kubernetes Jobs

Description

I added a Task extension that enable to run Jobs in a Kubernetes cluster.

Motivation and Context

This enables the distribution of tasks, that come as light-weight application containers, in a Kubernetes cluster. There is a feature proposal: #1549.

Have you tested this? If so, how?

I have included unit tests. To run them locally you need a minikube cluster up and running.

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look much at the actual code that matters. Can you find somebody with kubernetes knowledge to review?

Also, please write about if and how you've used this in production already. :)

}

if __name__ == "__main__":
luigi.run(['PerlPi', '--local-scheduler'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this as it's discouraged nowadays.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add module docs. SImilar to that of the execution summary example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the docstring below good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think this is very good. :)

self.assertTrue(job.obj["status"]["failed"] > fail.max_retrials)

if __name__ == "__main__":
unittest.main()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove last 2 lines please. :)

@mcapuccini
Copy link
Author

Hi @Tarrasch, thanks for this first review.

I talked to @pcm32. He doesn't have experience with Luigi, but he made a Kubernetes Job wrapper for Galaxy, so he could give a look to the code. However, he can't do this soonish. If you know someone with experience in both Kubernetes, and Lugi, it would work better I think.

I didn't use this in production yet, but I will do it very soon. I will try to reproduce a scientific workflow in a cloud environment.

I have a question about the CI. I think I can fix some of the checks that fail, but ultimately to run the tests that I wrote, you need a minikube cluster that is local to Travis (or some other Kubernetes cluster). Is that feasible in your settings?

@Tarrasch
Copy link
Contributor

Tarrasch commented Nov 3, 2016

As for CI, you can annotate the tests (like we do for hdfs and many other systems). But I think you can skip making an actual Travis build for them.

@tym-xqo
Copy link
Contributor

tym-xqo commented Nov 22, 2016

We're using Luigi and Kubernetes in Production at my shop. Will try to take a look as time permits

@mcapuccini
Copy link
Author

thanks @tym-oao, I would really appreciate that. I will report if I succeed to use this in production too.

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I previously mis-clicked approve ...

@henryrizzi
Copy link
Contributor

Hello, I'm at the same shop as @tym-oao .
I'm currently taking a look at this and trying to think of more test situations, but it has worked as I expected so far. Will update back here with any concerns or problems that I might run into.

Copy link
Contributor

@henryrizzi henryrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get this to work for a few trivial tests, but these changes would be appreciated. Also, I would think that there should be the requirement of output for the task to fit it more easily into a workflow. so including def output(self): raise NotImplementedError() (unless that seems unreasonable)

A name for this job. This task will automatically append a UUID to the
name before to submit to Kubernetes.
"""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably also have this as a raise NotImplementedError("subclass must define name") or you can keep both this and the one below as pass.

self.__logger.debug("Kubernetes job " + self.uu_name + " is still running")
time.sleep(self.__POLL_TIME)
if(self.__get_job_status() == "succeeded"):
self.__logger.info("Kubernetes job " + self.uu_name + " succeeded")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to touch some sort of output at this point to signal job completion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along the lines of:

with self.output().open('w') as output_file:
    output_file.write('')

Just to touch the required output file.

@mcapuccini
Copy link
Author

Thanks a lot for testing and review! I will wrap it up soon and come back to you guys. I am also working to a more complex analysis for a bioinformatics paper, I suggest that this PR gets merged after I complete that, so if I figure out that there is something missing I can add it.

@henryrizzi
Copy link
Contributor

After looking into it more, it was a configuration issue on my part. Am getting some issues when trying to require a k8s task from another task, but I think that's expected without the task having an output. :)

Copy link
Contributor

@henryrizzi henryrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes would be helpful for ascertaining job completion.
If this is added, I would also add an output to the tests, otherwise they will fail.

job.scale(replicas=0) # avoid more retrials
return "failed"
return "running"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this to the method would be really helpful, and fixes an error that I had while testing (there's no way to tell that the k8s job is complete)

def output(self):
        """Implement an output to allow for dependency chaining"""
        raise NotImplementedError("Subclass must define output")

self.__logger.debug("Kubernetes job " + self.uu_name + " is still running")
time.sleep(self.__POLL_TIME)
if(self.__get_job_status() == "succeeded"):
self.__logger.info("Kubernetes job " + self.uu_name + " succeeded")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along the lines of:

with self.output().open('w') as output_file:
    output_file.write('')

Just to touch the required output file.

@henryrizzi
Copy link
Contributor

henryrizzi commented Jan 3, 2017

The central scheduler won't behave correctly when the task __init__ method is overwritten. I ran into this issue earlier, but thought it was a configuration issue on my part.
The following changes worked for me:
def __init__(self, *args, **kwargs): -> def initialize_k8s_job(self):
take out the first line calling super.
and then just put self.initialize_k8s_job() at the beginning of the run method.

You could accomplish that a different way, but that will allow it to run with local scheduler or with the central scheduler.

@henryrizzi
Copy link
Contributor

@mcapuccini Not sure if you have the time to work on this right now, so I addressed the changes that @Tarrasch and I requested here.
Would it be preferable to put in a pull request to your fork (a fork of a fork) and or put another pull request to master referencing this pull request?

No rush on this, just thought I'd do my best to help things along. 😄

@mcapuccini
Copy link
Author

mcapuccini commented Jan 5, 2017

@henryrizzi thanks for your review and comments, I really appreciate that. I added you as a collaborator to my fork, so you can add your improvements to this PR straight away. At the moment I have some other tasks to work to with higher priority, but in a couple of weeks I'll be able test this on a real use case.

@henryrizzi
Copy link
Contributor

@mcapuccini Thanks for adding me as a collaborator to your fork and for making the PR!
I just put in a pull request to implement some of the suggested changes. Feel free to ask questions or change anything that you think is weird in my code. I'll be testing testing it in more real world use cases as well, so I'll make updates to that PR if I notice anything weird.

@Tarrasch
Copy link
Contributor

I believe I replied to all your comments. The last change (which you nicely reminded me off) is to change the names. Just name them like:

  • luigi/contrib/kubernetes.py
  • test/contrib/kubernetes_test.py
  • examples/kubernetes.py (here you have a bit more freedom)

Does it sound reasonable? Also the config class you'll create should also be called kubernetes, which will automatically make the config section called [kubernetes]

@mcapuccini
Copy link
Author

@henryrizzi will you take care of the latest change requests, or shall I do it?

@henryrizzi
Copy link
Contributor

henryrizzi commented Jan 20, 2017

I can take care of the latest changes and put in another pull request to your branch. 👍

@Tarrasch
Copy link
Contributor

Tarrasch commented Feb 2, 2017

This looks ready to merge except for that Travis is red. Once that's fixed I'm ok with this getting merged. :)

@mcapuccini
Copy link
Author

@Tarrasch I am doing some tests on a real pipeline these days. There are some things to be fixed. Next week I'll be to a conference, so in a couple of weeks it will be ready to be merged.

@mcapuccini
Copy link
Author

Waiting for @henryrizzi to review the latest changes before I get them merged. I have successfully run my workflow (https://github.com/phnmnl/jupyter-demo/blob/master/preprocessing_workflow.py) in a real k8s cluster.

@colemanja91
Copy link
Contributor

@mcapuccini Thanks for this! I'm at the beginning of trying to run tasks on an Openshift cluster and this will help a ton. Just curious what your thoughts are from a design/implementation perspective:
Are you running the KubernetesJobTask from within a Kube container itself? If so I'd be curious to hear about any challenges/benefits from doing that.

@mcapuccini
Copy link
Author

mcapuccini commented Feb 20, 2017

@colemanja91 yes, I run Luigi inside a container. What I like a lot is to run a custom Jupyter image where I can edit and run my Luigi workflows. This is not challenging at all, you just need to use the service-account authentication method when setting up Luigi in your container.

I am very soon going to integrate Luigi in KubeNow to enable data science pipelines on top of it.

@mcapuccini
Copy link
Author

@Tarrasch I am quite confident that the build is going to pass this time. Then it should be ready to be merged IMO 🙂

@Tarrasch Tarrasch merged commit 6ce9708 into spotify:master Feb 24, 2017
@Tarrasch
Copy link
Contributor

Thanks!

@apierleoni
Copy link
Contributor

apierleoni commented Feb 24, 2017

@mcapuccini great job here, thanks!
I have a question regarding watching the job status. This implementation is periodically querying the api to get the status. Do you know if this will overload the kubernets api in case too many parallel jobs are executed?

Therr is a "WATCH" method in kubernetes api that returns the status as a stream and might be useful to reduce both the number of calls and the delay to get the response.
Not sure it is available in pykube but is already implemented here:
http://python-kube.readthedocs.io/en/latest/reference.html#resourcewatcher

@mcapuccini
Copy link
Author

@apierleoni I had tested it with 40 parallel Jobs with no problems.

The watch method is interesting, if someone reports any problem in polling from the Kubernetes API, we should change the implementation.

@apierleoni
Copy link
Contributor

apierleoni commented Feb 24, 2017 via email

This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants