Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix machine discovery on compute nodes #511

Merged

Conversation

xylar
Copy link
Contributor

@xylar xylar commented Sep 17, 2023

This merge uses an E3SM-Unified environment variable, E3SMU_MACHINE, to determine the machine if this environment variable has been set (e.g. by sourcing an E3SM-Unified load script). This makes it possible
to detect the machine on compute nodes if you are using E3SM-Unified.

This merge also cleans up a few issues with the docs that I noticed.

Partially addresses #406

If `E3SMU_MACHINE` has been set (e.g. by sourcing an E3SM-Unified
load script), use that as the `machine`.  This makes it possible
to detect the machine on compute nodes if you are using E3SM-Unified.
@xylar xylar force-pushed the fix-machine-discovery-on-compute-nodes branch from fff6a54 to 499332e Compare September 17, 2023 10:14
@xylar
Copy link
Contributor Author

xylar commented Sep 17, 2023

@forsyth2, let me know if you'd like me to make a separate PR for the docs changes.

Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the code changes, this looks good to me.

Partially addresses #406

Why only partially? This seems like a full resolution.

@forsyth2 forsyth2 merged commit aef4552 into E3SM-Project:main Sep 21, 2023
@xylar
Copy link
Contributor Author

xylar commented Sep 21, 2023

Why only partially? This seems like a full resolution.

This only fixes the problem if someone is using E3SMU, not if you use a development env.

@chengzhuzhang
Copy link
Collaborator

@xylar e3sm_diags has this same challenge that when use a development env, and run on compute node, it won't auto detect the machine and use machine specific info. I'm wondering if it is possible to update mache for a full solution?

@xylar xylar deleted the fix-machine-discovery-on-compute-nodes branch September 21, 2023 18:02
@xylar
Copy link
Contributor Author

xylar commented Sep 21, 2023

@chengzhuzhang, I'm sorry but I don't know of any solution to this. The compute nodes have super generic names and there is not a consistent way to detect what machine you are on using things like environment variables or files. If you have a suggestion, I'm up for it. But otherwise we simply need users to identify with a config option or command-line flag what machine they are on when they launch on compute nodes.

@chengzhuzhang
Copy link
Collaborator

I was thinking for the machines that we support, if we can find some pattern of compute nodes names and filter them in mache. But if there is not a pattern, i guess asking users to provide a machine name is not too bad.

@xylar
Copy link
Contributor Author

xylar commented Sep 21, 2023

@chengzhuzhang, on Chrysalis, it might work since the login nodes are named something like chr-0472. I can add that.

But on Anvil, they are named: b677, which seems way too generic to use, but maybe folks feel like it's fine for our limit number of to assume b and 3 numbers is Anvil?

Compy is the same story: n0432. Again is this safe to assume that anything like n and 4 numbers is a Compy compute node? I have been hesitant.

On Perlmutter, even the login nodes are too generic so we have to fall back to an environment variable, and that probably works on compute nodes, though I can't say I've checked.

Frontier was similar with the added problem that there was no definitive environment variable to use so I had to fall back on an environment variable related to modules. Again, that likely works on compute nodes.

Other supported machines like Chicoma, Andes and ACME1, I don't know what the deal is there since I don't use the that much.

@xylar
Copy link
Contributor Author

xylar commented Sep 21, 2023

@chengzhuzhang can you open this issue on https://github.com/E3SM-Project/mache/issues? This is not the right place to discuss this further.

@forsyth2 forsyth2 added the semver: small improvement Small improvement (will increment patch version) label Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: small improvement Small improvement (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants