Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm commands not found #4

Open
henkela opened this issue Sep 8, 2014 · 4 comments
Open

slurm commands not found #4

henkela opened this issue Sep 8, 2014 · 4 comments

Comments

@henkela
Copy link

henkela commented Sep 8, 2014

I'm seeing another issue, which may be just my own fault. When I install the rpms and start slurmmond afterwards, the daemon doesn't find the slurm commands that are located in /usr/local/bin. Also the PATH environment includes the /usr/local/bin. Here's a trace of the messages that I just took:

slurmmond[5341]: starting
slurmmond[5341]: started sdiag metrics process, pid [5342]
slurmmond[5341]: started jobcount metrics process, pid [5343]
slurmmond[5341]: started reserved cores metrics process, pid [5345]
slurmmond[5341]: started [probejob-compute,IB] metrics process, pid [5346]
slurmmond(sdiag)[5342]: metrics for [slurmmond(sdiag)] failed with message [[Errno 2] No such file or directory]
slurmmond(jobcount)[5343]: metrics for [slurmmond(jobcount)] failed with message [shell code ["squeue -h -o '%u' -t PD | wc -l"] failed with exit status [0], stderr is ['/bin/sh: squeue: command not found\n']]
slurmmond(probejob-compute,IB)[5346]: metrics for [slurmmond(probejob-compute,IB)] failed with message [job submission ["sbatch '-p' 'compute,IB' '-J' 'probejob' '-n' '1' '-t' '2' '--mem' '10' '-o' '/dev/null' '-e' '/dev/null' --wrap 'true'"] failed with non-zero returncode [127] and/or non-empty stderr ['/bin/sh: sbatch: command not found']]
slurmmond(reservations)[5345]: metrics for [slurmmond(reservations)] failed with message [[Errno 2] No such file or directory]

If I add /usr/local/bin/ to all of the command calls slurmmond works.
I also tried to add a sys.path.append but this didn't work. Am I doing somehting wrong?

@jabrcx
Copy link
Contributor

jabrcx commented Sep 9, 2014

Not having /usr/local/bin in the PATH is normal for rhel/centos services (it's not something slurmmon is doing). /sbin/service aggressively sanitizes the environment -- it sources /etc/init.d/functions which sets PATH="/sbin:/usr/sbin:/bin:/usr/bin".

Updating sys.path from within slurmmon doesn't help because that's for python's search path (e.g. for imports), not for executable lookups in subprocesses. You'd want to update the PATH environment variable with something like:

os.environ['PATH'] = ':'.join(('/usr/local/bin', os.environ['PATH']))

Slurmmon has config.py for site tweaks like this, if you decide to go that route rather than something more global like editing /etc/init.d/functions. Does that suffice?

@henkela
Copy link
Author

henkela commented Sep 9, 2014

Hi John,
thanks for the explanations.
I added os.envron[‚PATH‘] to config.py. It solved most of the problems. Just one problem remains, which is the “sudo –u slurm scontrol” call. When I check sudo –u slurm env I don’t see usr/local/bin in PATH so for now I leave the hardcode /usr/local/bin in slurmmond (sorry, I’m not very experienced in using python)

@jabrcx
Copy link
Contributor

jabrcx commented Sep 12, 2014

sudo is another thing that drops the environment customizations. There is a -E option to preserve them, but I think it'd be safer for slurmmon to essentially ask which scontrol first, and provide the full path to sudo. I'll put that in.

@jabrcx
Copy link
Contributor

jabrcx commented Sep 22, 2014

I'm having second thoughts about adding the which scontrol lookup to fully qualify it before executing. This sudo code is a bit sketchy to begin with (and soley for a probe job priority feature that arguably is not needed anyways), and I'd rather not add more ways the command is dynamically created, which could create security issues. I'm thinking better leave it to the sudo config, rather than end-around it.

So, in /etc/sudoers, you could modify the default PATH for the slurm user with something like:

Defaults:slurm secure_path="/usr/local/bin:...(rest of what's there already for secure_path)..."

(Or leave off the :slurm to modify it for all users.)

Do you think that's the better approach here and it'll work in your case?

I want to make slurmmon easy to run, but I think keeping it simple, by assuming Slurm commands are in the dirs in PATH, and relying on system configuration to make that happen (rather than having slurmmon going out of it's way to find them, or not respect system configuration) is the cleaner approach. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants