-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaea F2 to F5 file system migration #2101
Comments
@jkbk2004 FYI |
The current gaea-c5 modulefile for UFSWM does not work. Will we be skipping gaea until this is fixed or is there a backup somewhere? |
I'm not aware of an alternate location for spack-stack, @jkbk2004 would you happen to know? If not then yes we may have to skip Gaea until spack-stack is set up again on F5. I'll add a note to clarify that as well in the description. |
@FernandoAndrade-NOAA @BrianCurtis-NOAA @jkbk2004 - |
@jkbk2004 With this change, are we at a point to stop using a MACHINE_ID of gaea-c5 and go back to just gaea ? |
@FernandoAndrade-NOAA @jkbk2004 @BrianCurtis-NOAA: thanks to @RatkoVasic-NOAA, all dependencies for spack-stack have been rebuilt on f5. new stack installations are underway. we will let you know as soon as they are ready for testing. |
spack-stack installations are done on Gaea-C5 - F5 file system. New directory on F5 is: Path for loading modules (stack-intel):
Additional installations: |
New Spack-stack as described above, works for the UFS-weather-model build.. verified 1/25. It should be mentioned, this change from F2 to F5 with no overlap was very disruptive. |
has anyone else had time to test #2115 w/ rocoto or ecflow? seems like the
running the same job in serial to see if rocoto is causing the issue. fyi @jkbk2004 @zach1221 |
@ulmononian running some tests now with ecflow/rocoto. |
As of this moment, the filesystems are no longer cross-mounted. C4 can see /lustre/f2 but not /gpfs/f5 |
GFDL users who need to copy data from F2 to F5 must use transfer jobs to
special transfer nodes that do have both filesystems mounted.
If not using gcp, users must submit a batch job to the C5/F5 DTNs. The
C5/F5 DTNs are the partitions ldtn_c5 and rdtn_c5.
…On Wed, Jan 31, 2024 at 7:58 PM Samuel Trahan (NOAA contractor) < ***@***.***> wrote:
As of this moment, the filesystems are no longer cross-mounted.
C4 can see /lustre/f2 but not /gpfs/f5
C5 can see /gpfs/f5 but not /lustre/f2
—
Reply to this email directly, view it on GitHub
<#2101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTGDZ5RX2PDGFH552DYRLSDTAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRQGI3TMNBZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
The ufs-weather-model fails to build unless it has |
@SamuelTrahanNOAA Can you take a look at #2115 ? I was able to run full RT tests with rocoto on c5. @ulmononian @RatkoVasic-NOAA @zach1221 @FernandoAndrade-NOAA we need to confirm about ecflow error. |
@SamuelTrahanNOAA we are trying to migrate input files and baseline to /gpfs/f5/epic/world-shared/lustre/epic/UFS-WM_RT/NEMSfv3gfs |
Excellent. I look forward to the result. |
@jkbk2004 - /gpfs/f5/epic/world-shared/UFS-WM_RT/NEMSfv3gfs (without /lustre/epic part ). The "lustre" crept in when the UFS_SRW_data was moved from F2 to F5, and the whole directory structure was replicated. It was later corrected, and the UFS_SRW_data now resides under just /gpfs/f5/epic/world-shared/UFS_SRW_data . |
Yes. The dependencies were on spack-stack which was installed on
/lustre/f2. Once /lustre/f2 was dismounted, spack-stack broke.
A replacement spack-stack is available with
module use
/ncrc/proj/epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load stack-intel/2023.1.0
module load stack-cray-mpich/8.1.25
module load stack-python/3.10.8
The UFS build will have to be updated to put this in the
./ufs_model.fd/modulefiles
…On Thu, Feb 1, 2024 at 8:53 AM Samuel Trahan (NOAA contractor) < ***@***.***> wrote:
The ufs-weather-model fails to build unless it has /lustre/f2 on GAEA C5.
—
Reply to this email directly, view it on GitHub
<#2101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQ3GOFPF5S3D37IENTYROM5RAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRRGM3TSMBSGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA I was able to use the libraries in /ncrc/proj/epic/spack-stack/ until about 9 am this morning. Now my programs cannot find the libraries and when I ls /ncrc/proj/epic/, I get permission denied. |
The directory is inaccessible from the "epic" level: ls -ld /ncrc/proj/epic
drwxrwx--- 6 root epic 4096 Jan 19 13:58 /ncrc/proj/epic |
I think we should use /gpfs/f5/epic/world-shared. @RatkoVasic-NOAA @ulmononian Can we follow up on this permission issue? project specific location is not good. |
For what it's worth, a couple of us at GFDL have been having trouble this morning as well. Some of our directories within /gpfs/f5/gfdl*/scratch seemed to have reset permissions to be unreadable by others, sometime between last night and this morning |
I sent email to helpdesk. Let's see their response. |
Any word on when this might be addressed. Ufs weather model (and probably a lot of other stuff) is not buildable on GaeaC5 without spack-stack. |
Spack stack remains unavailable on gaeaC5 24 hours after this problem was first reported. It appears to be due to a permission change in /ncrc/proj/epic administratively done by gaea admins, not by epic. We need this to build NCEP and RDHPCS applications. |
@GeorgeVandenberghe-NOAA we are in communication with gfdl to address this issue. we will update here as soon as we hear back. correct me if i'm wrong, but you were able to build and test these stacks last week or so (i.e. your comment here #2101 (comment)) i am just trying to discern if the permissions were recently changed. |
My jobs failed around 9am yesterday when the UFS executable could not find the dynamically linked libraries it needed. |
Yes. The permissions were changed. They were usable prior to morning
2/6/2024.
…On Wed, Feb 7, 2024 at 3:46 PM Cameron Book ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> we
are in communication with gfdl to address this issue. we will update here
as soon as we hear back.
correct me if i'm wrong, but you were able to build and test these stacks
last week or so? i am just trying to discern if the permissions were
recently changed.
—
Reply to this email directly, view it on GitHub
<#2101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FVGHNJJGJBUZG7VK4DYSOOUJAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGMZDOMZQHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I know that others here at GFDL have been having directory permission changes with other projects, and our helpdesk was saying it was an unexpected issue on ORNL's side. You might be getting similar communication |
Just to inform anyone who is not in the loop. On 2/6/24 GFDL admins made change of permissions for the /ncrc/proj/epic directory WITHOUT prior notice. |
So both the destruction of spack-stack with the removal of /lustre/f2 and
the blockage of the emergency replacement on /ncrc/proj/epic
were done by GFDL administrative action. The new Spack-stack remains
unavailable due to a permissions change. No word when or if
this will be fixed or if spack-stack will have to be rebuilt in another
location to be determined. Meanwhile NCEP and RDHPCS applications remain
unbuildable
and unusable UFN
…On Wed, Feb 7, 2024 at 5:38 PM RatkoVasic-NOAA ***@***.***> wrote:
Just to inform anyone who is not in the loop. On 2/6/24 GFDL admins made
change of permissions for the /ncrc/proj/epic directory WITHOUT prior
notice.
Ticket is opened with GFDL helpdesk. For your reference and inquires with
GFDL helpdesk ticket number is [GFDL#5023877].
—
Reply to this email directly, view it on GitHub
<#2101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXCPH5R7JD5W7JVSCLYSO32NAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGU2TKMBSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
This needs to be elevated. Has anyone told Brian Gross, or Frank Idiviglio about this? |
I have built an alternate set of alligator clips that allows for ufs-weather-model and all of global workflow (except GDASApp) to build on gaeaC5. It is not good practice to rely on alligator clips for any length of time and we need to get spack-stack available again. export CMAKE_PREFIX_PATH=/autofs/ncrc-svm1_proj/ncep/gwv/simstack/netcdf140.492.460.mapl235.fms2301.crtm240.z And for later UFS I have these two additional versions and All of these libraries are static and do not require module loads or path settings at runtime |
No. Do you want to inform them?
This issue has crippled NCEP and RDHPCS until further notice. A
permissions change on /ncrc/proj/epic would fix it short term.. 28 hours
out this
remains undone.
…On Wed, Feb 7, 2024 at 5:58 PM Phil Pegion ***@***.***> wrote:
This needs to be elevated. Has anyone told Brian Gross, or Frank Idiviglio
about this?
—
Reply to this email directly, view it on GitHub
<#2101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSP3HZOJWSBM7WMWULYSO6FJAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGU4DOMBWGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
They reported for high level attention, I think. @RatkoVasic-NOAA @ulmononian can you confirm? |
yes -- this has been escalated on the epic side to higher level management. |
I also informed Vijay and Brian
On Wed, Feb 7, 2024 at 6:53 PM Cameron Book ***@***.***>
wrote:
… No. Do you want to inform them? This issue has crippled NCEP and RDHPCS
until further notice. A permissions change on /ncrc/proj/epic would fix it
short term.. 28 hours out this remains undone.
… <#m_764950301050899457_>
On Wed, Feb 7, 2024 at 5:58 PM Phil Pegion *@*.*> wrote: This needs to be
elevated. Has anyone told Brian Gross, or Frank Idiviglio about this? —
Reply to this email directly, view it on GitHub <#2101 (comment)
<#2101 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANDS4FSP3HZOJWSBM7WMWULYSO6FJAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGU4DOMBWGI
<https://github.com/notifications/unsubscribe-auth/ANDS4FSP3HZOJWSBM7WMWULYSO6FJAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGU4DOMBWGI>
. You are receiving this because you were mentioned.Message ID: _@***.
*_> -- George W Vandenberghe _Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141 College Park, MD 20740 _*@*.*
301-683-3769(work) 3017751547(cell)
They reported for high level attention, I think. @RatkoVasic-NOAA
<https://github.com/RatkoVasic-NOAA> @ulmononian
<https://github.com/ulmononian> can you confirm?
yes -- this has been escalated on the epic side to higher level management.
—
Reply to this email directly, view it on GitHub
<#2101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FX3TDX62K6H4VYXNE3YSPETDAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGY3TEMRSG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I have been told this is an ORNL issue and GFDL (which does not have gaea root) is also waiting for this mistake to be corrected. |
THe premissions issue has been corrected and spack-stack should be usable again on GaeaC5 |
i received an email this morning that permissions have been restored. @GeorgeVandenberghe-NOAA |
…lock_atmos_copy routines in fv3atm #2124 (#2115) - Gaea C5 modulefile & DISKNM update: closes issues Gaea F2 to F5 file system migration #2101 - Bring in the global-workflow detect_machine.sh to keep consistent between projects. (Closes Bring in detect_machine.sh from global workflow for consistency across the community. #2096 ) - Fix out of bound errors in block_atmos_copy routines in fv3atm
#2115 has been merged in. Thanks to all for your coordination! |
Description
Gaea's F2 file system will be decommissioned soon, with the decommission of C4 compute scheduled for Feb 2nd, read-only mount on Feb 5th, and final shutdown on March 8th. The new F5 file system is now in production and available on C5 nodes. Necessary files should be transferred to avoid losing data currently stored on F2.
Gaea C5's modulefile is also configured to utilize spack-stack 1.5.1 on F2. Spack-stack 1.5.1 will need to be set up again on the F5 file system.
Solution
Transfer the necessary data and files from the F2 file system to F5.
Update configuration of gaeac5 modulefile to F5 locations along with spack-stack.
Alternatives
Related to
The text was updated successfully, but these errors were encountered: