Skip to content

slurm_scheduler: add support for per replica log files and API access #373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Jan 24, 2022

This adds torchx log support to the slurm scheduler. #371

It writes out a slurm-<jobid>-<role>-<replica_id>.out per worker with the combined stdout/stderr. We may want to split these in the future but for now it's fine. These files are logged out to the cwd per normal slurm behavior so torchx log will stop working if the user changes their working directory.

Test plan:

pytest torchx/schedulers/test/slurm_scheduler_test.py
scripts/slurmint.sh

Updated integration test to use torchx log instead of slurm-<id>.out

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 24, 2022
@d4l3k d4l3k force-pushed the slurmfixes branch 2 times, most recently from 470213e to 1aaa4b3 Compare January 25, 2022 00:04
@codecov
Copy link

codecov bot commented Jan 25, 2022

Codecov Report

Merging #373 (e2cc483) into main (92e6897) will increase coverage by 0.06%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #373      +/-   ##
==========================================
+ Coverage   94.66%   94.73%   +0.06%     
==========================================
  Files          61       61              
  Lines        3208     3227      +19     
==========================================
+ Hits         3037     3057      +20     
+ Misses        171      170       -1     
Impacted Files Coverage Δ
torchx/schedulers/slurm_scheduler.py 98.00% <96.29%> (+1.05%) ⬆️
torchx/schedulers/local_scheduler.py 93.25% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92e6897...e2cc483. Read the comment docs.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@d4l3k d4l3k deleted the slurmfixes branch February 1, 2022 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants