Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IBM CSM support #804

Closed
jjhursey opened this issue Mar 5, 2021 · 4 comments
Closed

Add IBM CSM support #804

jjhursey opened this issue Mar 5, 2021 · 4 comments

Comments

@jjhursey
Copy link
Member

jjhursey commented Mar 5, 2021

On some IBM systems (most notably CORAL systems like Summit, Sierra, Lassen) LSF plays the role of a scheduler and Cluster Management System (CSM) plays the role of the resource manager.

Often in this configuration, LSF does not have a daemon present on the compute nodes, but CSM does. LSF will identify the nodes for the allocation, but CSM may categorize those nodes differently based on various roles. Most notably CSM will distinguish a 'login' and 'compute' nodes.

The CSM API has a csm_allocation_query that can be used to query for this information. This can be used as the basis for a ras/csm component.

We can detect if we are in a CSM environment by the presence of the CSM_ALLOCATION_ID envar as we do in plm/lsf to disqualify itself here.

The headers for CSM can be found in the repository below

@jjhursey jjhursey changed the title ADD IBM CSM support Add IBM CSM support Mar 5, 2021
@acolinisi
Copy link
Contributor

I did notice a while ago that plm/lsf self-disqualifies. Does this ticket include implementing plm/lsf so that it may be used as the launcher instead of plm/ssh (just like plm/alps is used as the launcher on ALPS systems)?

At risk of conflating issues, I was unsuccessful launching the DVM on large allocations >128(ish) nodes with ras/lsf + plm/ssh. There was no output from prte in that particular instance, so I could not investigate then.

@jjhursey
Copy link
Member Author

jjhursey commented Mar 5, 2021

CSM does not have a generic API for launching daemons, but it does have one specific to JSM csm_jsrun_cmd. If CSM provides such an interface then we could write a plm/csm for the daemon launch.

@acolinisi
Copy link
Contributor

IIUC, plm/slurm uses srun and plm/alps uses aprun. Could plm/[csm|lsf] use jsrun?

@jjhursey
Copy link
Member Author

jjhursey commented Mar 8, 2021

Possibly. I've not tried launching prted daemons via jsrun, but I think that it should work. Since JSM is not in all LSF distributions (it's relatively new) we would probably make a new plm component for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants