Skip to content

slurm: environment improvements #371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task
d4l3k opened this issue Jan 21, 2022 · 0 comments
Closed
1 task

slurm: environment improvements #371

d4l3k opened this issue Jan 21, 2022 · 0 comments
Labels
enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules slurm slurm scheduler

Comments

@d4l3k
Copy link
Member

d4l3k commented Jan 21, 2022

The current slurm scheduler specifies the working directory via the image field of the Role. This doesn't match how any of the other schedulers work since local_cwd has been switched to use the current working directory. We should update the slurm scheduler to be more inline with the other schedulers.

  • make slurm_scheduler use cwd instead of the image path

The current behavior was around the assumption that users would specify the image as part of the args but that doesn't match how users use TorchX today

slurm also automatically inherits the local conda/virtualenv so not much needs to be done here

would be nice to also support logs though

@d4l3k d4l3k added enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules slurm slurm scheduler labels Jan 21, 2022
facebook-github-bot pushed a commit that referenced this issue Jan 26, 2022
…#373)

Summary:
This adds `torchx log` support to the slurm scheduler.  #371

It writes out a `slurm-<jobid>-<role>-<replica_id>.out` per worker with the combined stdout/stderr. We may want to split these in the future but for now it's fine. These files are logged out to the cwd per normal slurm behavior so `torchx log` will stop working if the user changes their working directory.

Pull Request resolved: #373

Test Plan:
```
pytest torchx/schedulers/test/slurm_scheduler_test.py
scripts/slurmint.sh
```
Updated integration test to use `torchx log` instead of `slurm-<id>.out`

Reviewed By: kiukchung

Differential Revision: D33755503

Pulled By: d4l3k

fbshipit-source-id: b6944822c406d66318181184dadff3038e68d0c9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules slurm slurm scheduler
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant