Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node states are unknown #37

Closed
percyfal opened this issue Feb 16, 2022 · 3 comments
Closed

Node states are unknown #37

percyfal opened this issue Feb 16, 2022 · 3 comments

Comments

@percyfal
Copy link
Contributor

Hi,

I'm using docker-centos7-slurm to test a workflow manager. It has been a while since updating, but when trying out the most recent version, I notice that only one node (c1) is up in the container. I am currently testing this in my fork (see pr #1). Briefly, I parametrized test_job_can_run to pass partition to the --partition option. The normal partition works as expected, but debug fails.

If one enters the latest image with

docker run -it -h slurmctl giovtorres/docker-centos7-slurm:latest

running sinfo yields

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
c1             1   normal*        idle 1       1:1:1   1000        0      1   (null) none                
c2             1   normal*    unknown* 1       1:1:1   1000        0      1   (null) none                
c3             1     debug    unknown* 1       1:1:1   1000        0      1   (null) none                
c4             1     debug    unknown* 1       1:1:1   1000        0      1   (null) none                

See github action resuts, where I added some print statements to see what was going on (nevermind that the test actually passed; I was simply looking at the erroneous slurm output file). I consistently get the feedback that the required nodes are not available; it would seem node c1 is the only node available to sbatch.

Are you able to reproduce this?

Cheers,

Per

@giovtorres
Copy link
Owner

giovtorres commented Feb 17, 2022

Yes, I can reproduce in 21.08, but 20.11.8 appears to be working. Something related to networking changed in Slurm between those releases and the slurm.conf probably is missing something to accommodate. I saw errors about NodeHostName and NodeAddr being the same for all 4 nodes.

Help is appreciated 😄

@percyfal
Copy link
Contributor Author

So I actually found a solution; removing redundant NodeHostName statements in slurm.conf seems to work:

NodeName=c1 NodeHostName=slurmctl NodeAddr=127.0.0.1 RealMemory=1000
NodeName=c2 NodeAddr=127.0.0.1 RealMemory=1000
NodeName=c3 NodeAddr=127.0.0.1 RealMemory=1000 Gres=gpu:titanxp:1
NodeName=c4 NodeAddr=127.0.0.1 RealMemory=1000 Gres=gpu:titanxp:1

Not that I could find anything in the docs that would suggest this solution; I basically reacted to this error in the logs:

cat /var/log/slurm/slurmctld.log | grep Duplicated
[2022-02-17T08:33:22.239] error: Duplicated NodeHostName slurmctl in the config file
[2022-02-17T08:33:22.239] error: Duplicated NodeHostName slurmctl in the config file
[2022-02-17T08:33:22.239] error: Duplicated NodeHostName slurmctl in the config file

I'm currently running the test on my fork (see changeset). I'll submit a pr on success.

@giovtorres
Copy link
Owner

Fixed in #38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants