Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sort_nodes.py #2853

Merged
merged 3 commits into from
Aug 8, 2024
Merged

Add sort_nodes.py #2853

merged 3 commits into from
Aug 8, 2024

Conversation

mr0re1
Copy link
Collaborator

@mr0re1 mr0re1 commented Aug 1, 2024

$ /slurm/scripts/sort_nodes.py --help
usage: sort_nodes.py [-h] [--nodelist NODELIST] [--ntasks-per-node NTASKS_PER_NODE] [--output OUTPUT]

This script sorts nodes based on their `physicalHost`.

See https://cloud.google.com/compute/docs/instances/use-compact-placement-policies

You can reduce latency in tightly coupled HPC workloads (including distributed ML training)
by deploying them to machines that are located close together.
For example, if you deploy your workload on a single physical rack, you can expect lower latency
than if your workload is spread across multiple racks.
Sending data across multiple rack requires sending data through additional network switches.

Example usage:
``` my_sbatch.sh
#SBATCH --ntasks-per-node=8
#SBATCH --nodes=64

export SLURM_HOSTFILE=$(/slurm/scripts/sort_nodes.py)

srun -l hostname | sort
```

options:
  -h, --help            show this help message and exit
  --nodelist NODELIST   Slurm 'hostlist expression' of nodes to sort, if not set the value of SLURM_NODELIST environment variable will be used
  --ntasks-per-node NTASKS_PER_NODE
                        Number of times to repeat each node in resulting sorted list.
                        If not set, the value of SLURM_NTASKS_PER_NODE environment variable will be used,
                        if neither is set, defaults to 1
  --output OUTPUT       Output file to write, defaults to 'hosts.<uuid>'

@mr0re1 mr0re1 added the release-improvements Added to release notes under the "Improvements" heading. label Aug 1, 2024
@mr0re1 mr0re1 self-assigned this Aug 1, 2024
Copy link
Collaborator

@samskillman samskillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest adding a test of numerical hostnames, and needs fix to get the lowest number first. Could be combined with a test of two groups of three nodes at depth 2, where the lowest number node is in the node with a higher first level string. something like

x/slurm-1
x/slurm-2
x/slurm-3
y/slurm-0
y/slurm-4
z/slurm-10

I'd expect 0,4,10,1,2,3

@mr0re1
Copy link
Collaborator Author

mr0re1 commented Aug 5, 2024

@samskillman , for

something like
x/slurm-1
x/slurm-2
x/slurm-3
y/slurm-0
y/slurm-4
z/slurm-10
I'd expect 0,4,10,1,2,3

The order 0, 4, 1, 2, 3, 10 seems to be more natural is the particular reason you wrote 0,4,10,1,2,3?

@mr0re1 mr0re1 changed the title Add topologically_sort_nodes.py Add sort_nodes.py Aug 5, 2024
@mr0re1
Copy link
Collaborator Author

mr0re1 commented Aug 5, 2024

  • Addressed comments about name-order:
    Don't use naive lexicographic order anywhere, always rely on order of names in nodelist.
    Significantly refactor def order.

  • Rename topologicaly_sort_nodes -> sort_nodes to reduce confusion, since order is not "topological".

@mr0re1 mr0re1 requested a review from samskillman August 5, 2024 22:07
@mr0re1 mr0re1 assigned mr0re1 and unassigned mr0re1 Aug 5, 2024
@samskillman
Copy link
Collaborator

@samskillman , for

something like
x/slurm-1
x/slurm-2
x/slurm-3
y/slurm-0
y/slurm-4
z/slurm-10
I'd expect 0,4,10,1,2,3

The order 0, 4, 1, 2, 3, 10 seems to be more natural is the particular reason you wrote 0,4,10,1,2,3?

I agree - I meant for that z to be a y. I think either would be acceptable solutions in principle, but I agree 0,4,1,2,3,10 is better. My mistake on the typo.

@mr0re1 mr0re1 requested review from nick-stroud and removed request for samskillman August 7, 2024 00:09
@mr0re1 mr0re1 assigned nick-stroud and unassigned mr0re1 Aug 7, 2024
@nick-stroud nick-stroud assigned mr0re1 and unassigned nick-stroud Aug 7, 2024
@nick-stroud nick-stroud removed their assignment Aug 7, 2024
@mr0re1 mr0re1 merged commit cad2282 into GoogleCloudPlatform:develop Aug 8, 2024
8 of 51 checks passed
@mr0re1 mr0re1 deleted the rank_sort branch August 8, 2024 03:29
@rohitramu rohitramu mentioned this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants