-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New egs-parallel set of scripts to replace run_user_code_batch #628
Conversation
853bf66
to
8db0aea
Compare
This looks so much cleaner!!! |
@crcrewso For slurm, we need someone to copy the |
I know the scripts are a little dense: I am trying here to resolve the lock file issue without changing the EGSnrc code, which is perhaps a bit contorted (see by comparison @mainegra's cleaner uniform run control object solution in #588). I found that separating the two roles of the lock file (indicating that the job is running or not, and serving the number of histories) resolved a lot of the issues. There is now a There is also a lot of code repetition between the sub-scripts, which is not ideal for code maintenance: we have to remember to copy changes to all the sub-scripts. But this is intentional, as I wanted it to be easy to create new sub-scripts for other schedulers (slurm!) or to suit their particular needs, without imposing any logic beyond the arguments passed to the script. If code maintenance becomes an issue, I'll think of something else; I am not worried about this for now. |
e8c9795
to
e84a4f3
Compare
I'm not entirely sure if I'll have a test system for slurm until these quarantimes are over but if I do I'll definitely submit something. |
@ftessier are these scripts tied up to using the locking file mechanism? The reason I ask is that in HPC environments where the locking file does not work, one might want to resort to the URCO mechanism (uniform load on all jobs) and then these script will not work. But perhaps this not an issue since in those cases, one might have to use different scripts, such as the one I created for the GPSC. I will try these scripts on the GPSC and see what happens! |
For the moment yes they are :-( (except the -cpu script, which does not try to synchronize with the |
In the end, the lock file should only be checked and handled by either the script or EGSnrc, not both, for exactly the reason you point out: we don't want to have to change the submit scripts if we change the lock file handling in EGSnrc. |
@mainegra Could the first job still create an (empty) |
@ftessier in practice it is possible, but the name (extension) would be misleading. |
@mainegra Thanks for the suggested improvement! I removed the dependencies on the lock file in the egs-paralllel scripts. The script will still prevent launching the simulation if there is a |
0e72702
to
a07e86a
Compare
@ftessier that's great news! That way we could potentially use your scripts on the GPSC as well! |
a07e86a
to
f59c2af
Compare
@ftessier I am giving these scripts a try! Quick question: Why is there a delay at the beginning? I turned on the verbose option and got a headache! 🤯 |
@ftessier I tried using the urco with this and it failed complaining about the lock file not being there ... are you sure the dependency is gone? |
Indeed, the scripts still contain a |
258bb94
to
c24df7a
Compare
Updated to truly remove the lock file dependencies in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Save the egs-parallel log inside an *.egsparallel file in the application directory, and add a verbosity option (-v) to also echo the log to screen. By default the scripts proceed silently, unless an error condition arises, which is always echoed to the terminal.
Notably, save log message to a log file, add a verbosity option (-v), and allow joined single-letter options and argument (without a space between the option and its argument, as in "-n123").
Apart from format and other minor adjustments, update the standard pbs script egs-parallel-pbs (whereby EGSnrc jobs are submitted individually) so that only the second job waits for the .egsjob file and the .lock file, since the jobs are submitted sequentially anyways.
This egs-parallel-cpu subscript provides the option "--batch cpu" to egs-parallel, to launch a simulation on multiple cores on the local cpu, without requiring a job scheduler. Intentionally, this script is simple: it just launches the jobs sequentially, without waiting around for the .egsjob or .lock files, as in the pbs scripts. However, the logging is consistent with the other egs-parallel scripts The number of threads is always constrained to the number of threads available on the machine, because it is inefficient to go beyond that, and launching a large number of threads on a cpu by mistake may well stall the computer.
Improve the script robustness, in particular by forcing the user to specify either the -n (--dry-run) option, or the -f (--force) option to actually remove files, to prevent accidental erasing (to some extent). This script removes files without warnings (when using -f), so use with caution: run with the -n option first to see what will be deleted. Add the concatenation and sorting of egs-parallel log messages into the .egsparallel file for reference. This is useful, because these log messages may be scattered in different files, for example the .eo files from pbs. After cleaning, the .egsparallel contains a time-ordered sequence of messages from egs-parallel and its subscripts.
Strictly speaking, there can be multiple threads per hardwarde core; this is typical in modern workstations. Change "ncore" to "nthread" throughout the egs-parallel scripts, to avoid confusion.
Add a bin directory in HEN_HOUSE/scripts and add it to the PATH in the shell additions scripts. This allows some EGSnrc scripts to be directly executable by a user, without using aliases (which are not inherited by subshells). The immediate motivation is for the top-level egs-parallel script, and the egs-parallel-clean script, to become visible on the path, while the egs-parallel sub-scripts remain in scripts and are not in the path (these should not be invoked directly).
Do not source the shell additions scripts from within the egs-parallel sub-scripts, as this is not necessary and not secure. Sourcing was only needed in the dshtask script to get the path to the EGSnrc executables, because tasks are launched on the pbs nodes without inheriting the environment. In this case, simply export the PATH variable via the pbsdsh qsub script.
Use a more portable date command format for the timestamp string, and tweak the usage message of egs-parallel scripts.
Add -x (--extra) option to clean up egs-parallel log files .egsparallel and .egsparallel.eo. Although this script always echoes progress to the terminal, add a -v (--verbose) option to echo the commands that are run by the script, instead of the more concise messages usually reported. Internally, add an "action" command to ensure that the log messages remain up to date with the commands.
For convenience, add a -l (--list) option to the cleaning script to list all the .egslog file base names in the current directory. This option is checked first and overrides every other argument: the list is printed to the terminal and the script terminates. Also, reformat the usage message and use the extension .egsparallel-eo (with a hyphen) to avoid collision with the pbs .eo extension. Use executable basename in quit function.
Change the initial value of the --batch option to "cpu" so that the script invokes the multicore parallel sub-script (egs-parallel-cpu) when no --batch option is specified on the command line. This allows users to try egs-parallel out of the box (most computers are multicore nowadays) without worrying about schedulers.
Don't quit the egs-parallel submit scripts if no lock file is found, and add a -f (--force) option to override existing .egsjob or .lock files. The lock file for parallel jobs is managed inside EGSnrc, so the script should not manage it as well: this creates an obscure correlation between the code and the script. Moreover, the uniform run control method does no create a lock file. Previously, the submit script would quit if there was no lock file. The top-level egs-parallel script now prevents the run if there is an .egsjob file OR a .lock file, for the same reason. This can be overridden with the added --force option.
Detect pbs jobs that fail to launch in egs-parallel, by looking at the echoed job pid: quit immediately if it is not an integer. If the first job fails, subsequent jobs are not launched. Report the failure in the log. Also adjust the format of a few log messages.
Fix a crash that occurred when the 14 character truncation of the filename for an egs-parallel pbsdsh job ended up starting with a '.'. The first character is now trimmed away if that is the case, so the job name is only 13 characters.
Ensure that the PBS job name starts with an alphanumeric character [0-9A-Za-z], following the PBS scheduler requirement. To avoid failed jobs solely on the account of a bad job name, strip all leading non-alphanumeric characters from the job name. Note that the egsinp basename is not affected, this is strictly for the job name passed to qsub via the -N option.
a77d8b7
to
8a7311f
Compare
|
@ftessier, these scripts aren't expected to work seamlessly on OSX, correct? For example, on a (my) Mac, the line: |
Good point @blakewalters. I always forget that there is no |
it shouldn't work over Windows Git bash since there's no gcc or gfortran I can test WSL 1 and WSL 2 over the weekend Edit 1: WSL 1 (Pengwin) works without issues
egs-parallel-clean:
|
This pull request implements
egs-parallel
: a new set of bash scripts to submit EGSnrc parallel jobs, with the following improvements over the legacy commandexb
(aliased to therun_user_code_batch
script):--verbose
option)-h
option).lock
filesegs-parallel
script dispatches a specific sub-script for each method (e.g.,egs-parallel-pbs
)egs-parallel
script can launch jobs locally on a multicore computer, with the--batch cpu
optionegs-parallel
egs-parallel-clean
script is provided to help tidying up intermediate simulation files and logsexb
)Here is a sample invocation:
egs-parallel --batch pbsdsh -q short -n12 -v -c 'egs_chamber -i slab -p 521icru'
The scripts
egs-parallel
andegs-parallel-clean
are placed inside a new$HEN_HOUSE/scripts/bin/
directory, which is added to the path in the EGSnrc shell additions. The sub-scripts are only meant to be called from the top-levelegs-parallel
, so they are not placed inside this newbin/
directory to prevent calling them directly (they remain in$HEN_HOUSE/scripts/
which is not added to the path).Take this out for a spin if you will, but remain cautious (especially with the cleaning script!).