-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix signal_file_path to avoid race condition #1253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ofivite can you elaborate on why this is necessary? I'm not quite sure I understand how it has a race condition since only rank 0 should be writing this file
sure, it's not really ranks which write into the same file, but different slurm jobs which happen to start running at the same time on the cluster, e.g. in our case via an array of tasks from |
For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof |
In fact, the same situation apparently happens also in |
yep, think it's better to generalise it that way! |
Although I'm not quite sure, will each slurm task / rank have its own unique SHA in that case? |
Hm I guess this doesn't work since you need each node to have the same path... I guess your original solution is probably best 🤔. @dakinggg any other ideas? |
@ofivite Would attaching the run name as the unique id resolve your issue? Its technically not foolproof, because nothing forces two runs to use different run names, but probably still an improvement? The issue with slurm id is that it would only work for slurm (although maybe this is also still an improvement) |
@dakinggg that would resolve the issue indeed. :) I would still argue for including some other form of identifier into the name so that in the future, other people do not randomly run into this issue. There's still a potential race condition on systems that only initialize random seeds with system time, but if we ignore that, the name could simply be broadcasted from rank 0 to the connected processes using |
Note that if implementing the broadcasting version, the current code would then use the same path across all nodes ( |
Ok I think the right solution is to implement a utility in composer that gets a signal file name + a random id that gets broadcast from rank 0. That way it will be unique per usage. Would be happy to accept a contribution for this! |
FYI, started working on this here: mosaicml/composer#3396. I'm gonna go ahead and close this PR, thanks for raising the issue! |
Hey @janEbert , sorry for the delay. This should be fixed once mosaicml/composer#3485 and #1381 are merged. |
Awesome, thanks all! |
In case of using
slurm
with--array
option, sometimes it happens that multiple tasks start running ~simultaneously. Since the filename of the lock doesn't take into account this parallelism, the tasks start creating/writing/deleting the same file, and create a race condition. AppendingSLURM_JOB_ID
to the filename seems to fix it.