Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many sacct requests for batched tasks #1759

Open
Fadelis98 opened this issue Jan 11, 2024 · 0 comments
Open

Too many sacct requests for batched tasks #1759

Fadelis98 opened this issue Jan 11, 2024 · 0 comments

Comments

@Fadelis98
Copy link

I need to submit thousands of tasks, and due to the max size limit of job array, the tasks are devided into groups and there will be one job array for each group:

submitted_jobs = []
for group_idx,group_jobs_to_run in enumerate(groups):
    with excutor.batch(): # a job_array for each group
        for idx in group_jobs_to_run: # note the idx is the user defined one, not the slurm job id
            task_args,task_kwargs = get_task_args(idx)
            job = excutor.submit(slurm_tasks,*task_args,**task_kwargs)
            submitted_jobs.append(job)
# wait for results
_ = [job.wait() for job in submitted_jobs]

I use job.wait() to wait for all tasks to complete, however, I found it usually trigger the user rpc limit on my slurm cluster, sometimes even stuck the whole cluster, and I got the warning:

sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:07:41,285) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:07:41,285) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:08:51,594) - Call #7 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:08:51,594) - Call #7 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:15:56,718) - Call #9 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:15:56,718) - Call #9 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:25:23,108) - Call #10 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:25:23,108) - Call #10 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 23:15:34,829) - Call #15 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 23:15:34,829) - Call #15 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 23:25:36,817) - Call #16 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 23:25:36,817) - Call #16 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.

It seems that the submitit asked for too many duplicated requests at the same time that exceed the user rpc limit on my clustere. The JOB.wait() method is expected to run in a blocking way that may not request the task's state in parallel, and I'm not sure what machenism in submitit caused the duplicated slurm call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant