Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF provided affinity is not supported #791

Closed
jjhursey opened this issue Mar 2, 2021 · 5 comments · Fixed by #1569
Closed

LSF provided affinity is not supported #791

jjhursey opened this issue Mar 2, 2021 · 5 comments · Fixed by #1569
Assignees
Labels
enhancement help wanted NiceHave Would be nice to have fixed in the next release, but may slip
Milestone

Comments

@jjhursey
Copy link
Member

jjhursey commented Mar 2, 2021

LSF allows the user to specify process affinity at bsub time similar to:

bsub -Is -q interactive -W 30 -R "span[ptile=20] affinity[core(1):distribute=pack]" -n 30 /bin/bash

This results in a non-empty file pointed to by $LSB_AFFINITY_HOSTFILE. This file will list the hardware threads that the process should be bound to using physical IDs. The hardware threads is already addressed by setting the PRTE_JOB_HWT_CPUS attribute. However, the physical hardware thread IDs is the problem as PRTE no longer supports physical IDs.

In PR #597 we now throw an error when we detect this scenario. We need to work on a solution to restore this functionality.

@jjhursey jjhursey added enhancement help wanted NiceHave Would be nice to have fixed in the next release, but may slip labels Mar 2, 2021
@rhc54
Copy link
Contributor

rhc54 commented Mar 2, 2021

You should just need to use HWLOC to convert the physical IDs to their logical equivalents. You might look at an old ORTE/OPAL code as that is what we used to do.

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2021

This turns out to be trivial:

 obj = hwloc_get_pu_obj_by_os_index(topo, physical_id);
logical_id = obj->logical_index;

Checked and that works all the way back to HWLOC 1.11, so it should be okay to use.

@jjhursey
Copy link
Member Author

jjhursey commented Mar 4, 2021

I think that'll work fine in a homogeneous configuration. If we detect a heterogeneous configuration then we might have issues if we do the translation on the node with the HNP. In the short-term, that's an ok restriction. In the longer-term, we may want to handle this on the backend, but that would require re-introducing physical IDs more broadly which I don't know if we want to do.

I'll see if I can get to the short-term fix next week.

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2021

Fair point. I'd still do the translation on the HNP for simplicity, but you could do it in the plm/base where we receive the hetero topology from the remote node. You'd have to do it that way in the case (which I believe is common for LSF) where the HNP is on a login node and the compute node (due to cgroup or whatever) is different, even if the physical architecture is the same.

@jjhursey jjhursey added this to the Future milestone Mar 25, 2021
@rhc54 rhc54 closed this as completed May 24, 2021
@jjhursey
Copy link
Member Author

I'm working on this now, and think I have a fix in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted NiceHave Would be nice to have fixed in the next release, but may slip
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants