guidance on running vasp jobs via Kubernetes #902

jkitchin · 2023-09-06T17:16:55Z

jkitchin
Sep 6, 2023

We have a way to run vasp jobs by launching pods on a Kubernetes cluster. eventually it is basically writing out the input files, creating a yaml file, and then running a shell command to start it. That part I think would not be too hard to setup.

I am not sure though, what happens after the job is launched? How do you check if the job is done so you can get results from it? Do you need some kind of long-running daemon to keep something from returning until the job is done?

Andrew-S-Rosen · 2023-09-06T18:08:30Z

Andrew-S-Rosen
Sep 6, 2023
Maintainer

Hi @jkitchin! Good to hear from you.

We have a way to run vasp jobs by launching pods on a Kubernetes cluster. eventually it is basically writing out the input files, creating a yaml file, and then running a shell command to start it. That part I think would not be too hard to setup.

I agree that should be fairly do-able. Out of curiosity, what's the motivation for using a Kubernetes cluster?

I am not sure though, what happens after the job is launched? How do you check if the job is done so you can get results from it? Do you need some kind of long-running daemon to keep something from returning until the job is done?

I haven't used K8s very much, so I can't say much there specifically. I know that redun has a Kubernetes executor (as does Parsl), so it could be worthwhile to see how they approach things since I imagine that there might be ideas that carry over to what you are looking to do. In the context of quacc specifically, I recently added an interface to redun, so just using that is one option.

At least within the context of this repo, I've purposefully tried to avoid doing anything daemon-related with the idea that one of the several supported workflow engines would handle that far more robustly than I would. That said, I think for the K8s approach you're thinking about, you probably would need some kind of long-running service.

Happy to hear other thoughts.

0 replies

jkitchin · 2023-09-07T12:36:24Z

jkitchin
Sep 7, 2023
Author

The main motivation is we have a kubernetes cluster that we already use.

how does it work if you submit a job to a queue system? what stays running to check on when a job is done?

0 replies

Andrew-S-Rosen · 2023-09-07T15:42:25Z

Andrew-S-Rosen
Sep 7, 2023
Maintainer

how does it work if you submit a job to a queue system? what stays running to check on when a job is done?

That depends entirely on the workflow manager that you choose to use. Because everyone's needs are different, I specifically wanted to avoid enforcing any one approach. Most of the approaches involve some long-running server/daemon that will periodically poll the queuing system. But all of that logic is intentionally kept isolated from the details of quacc.

The relevant details are summarized in the Deploying Calculations section of the documentation.

0 replies

jkitchin · 2023-09-09T14:18:53Z

jkitchin
Sep 9, 2023
Author

I think the gist of this is the following pseudocode:

import polling

def task_complete():
     # check files, job id, pod status, etc
     # do any necessary file transfers 
     # return True when done

@ct.electron
def task():
    # run function that submits/launches job asynchronously
    polling.poll(task_complete, step=x, timeout=y)
    # continue with analysis of results

you have to decide how often to poll, and if there is a timeout.

I still don't understand how else this could work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guidance on running vasp jobs via Kubernetes #902

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

guidance on running vasp jobs via Kubernetes #902

jkitchin Sep 6, 2023

Replies: 4 comments

Andrew-S-Rosen Sep 6, 2023 Maintainer

jkitchin Sep 7, 2023 Author

Andrew-S-Rosen Sep 7, 2023 Maintainer

jkitchin Sep 9, 2023 Author

jkitchin
Sep 6, 2023

Andrew-S-Rosen
Sep 6, 2023
Maintainer

jkitchin
Sep 7, 2023
Author

Andrew-S-Rosen
Sep 7, 2023
Maintainer

jkitchin
Sep 9, 2023
Author