Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaea F2 to F5 file system migration #2101

Closed
FernandoAndrade-NOAA opened this issue Jan 18, 2024 · 42 comments
Closed

Gaea F2 to F5 file system migration #2101

FernandoAndrade-NOAA opened this issue Jan 18, 2024 · 42 comments
Assignees
Labels
enhancement New feature or request

Comments

@FernandoAndrade-NOAA
Copy link
Collaborator

FernandoAndrade-NOAA commented Jan 18, 2024

Description

Gaea's F2 file system will be decommissioned soon, with the decommission of C4 compute scheduled for Feb 2nd, read-only mount on Feb 5th, and final shutdown on March 8th. The new F5 file system is now in production and available on C5 nodes. Necessary files should be transferred to avoid losing data currently stored on F2.

Gaea C5's modulefile is also configured to utilize spack-stack 1.5.1 on F2. Spack-stack 1.5.1 will need to be set up again on the F5 file system.

Solution

Transfer the necessary data and files from the F2 file system to F5.
Update configuration of gaeac5 modulefile to F5 locations along with spack-stack.

Alternatives

Related to

@FernandoAndrade-NOAA
Copy link
Collaborator Author

@jkbk2004 FYI

@BrianCurtis-NOAA
Copy link
Collaborator

The current gaea-c5 modulefile for UFSWM does not work. Will we be skipping gaea until this is fixed or is there a backup somewhere?

@FernandoAndrade-NOAA
Copy link
Collaborator Author

I'm not aware of an alternate location for spack-stack, @jkbk2004 would you happen to know? If not then yes we may have to skip Gaea until spack-stack is set up again on F5. I'll add a note to clarify that as well in the description.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Jan 22, 2024

@FernandoAndrade-NOAA @BrianCurtis-NOAA @jkbk2004 -
Building new spack-stacks for Gaea c5 is currently being taken care of. Please note that NOAA/EPIC was not allowed to start building and testing software in a new location on a new filesystem F5 that went into the production in the end of last week. At the same time, the filesystem F2 used previously for all the software and data locations, was disconnected from Gaea c5 at the same time. There was no overlapping grace time between disconnecting F2 and connecting F5, which caused a disruption. EPIC was aware of such apparent issue, yet no other options for the smooth transition was offered by the GFDL team.

@BrianCurtis-NOAA
Copy link
Collaborator

@jkbk2004 With this change, are we at a point to stop using a MACHINE_ID of gaea-c5 and go back to just gaea ?

@ulmononian
Copy link
Collaborator

@FernandoAndrade-NOAA @jkbk2004 @BrianCurtis-NOAA: thanks to @RatkoVasic-NOAA, all dependencies for spack-stack have been rebuilt on f5. new stack installations are underway. we will let you know as soon as they are ready for testing.

@RatkoVasic-NOAA
Copy link
Collaborator

spack-stack installations are done on Gaea-C5 - F5 file system. New directory on F5 is:
/ncrc/proj/epic/spack-stack/

Path for loading modules (stack-intel):

/ncrc/proj/epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/modulefiles/Core
/ncrc/proj/epic/spack-stack/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
/ncrc/proj/epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
/ncrc/proj/epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core

Additional installations:
Rocoto: /ncrc/proj/epic/rocoto/modulefiles/
Miniconda: /ncrc/proj/epic/miniconda3/modulefiles/
ecFlow, mysql, qt: /ncrc/proj/epic/spack-stack/modulefiles/

@DeniseWorthen
Copy link
Collaborator

@GeorgeVandenberghe-NOAA

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

New Spack-stack as described above, works for the UFS-weather-model build.. verified 1/25. It should be mentioned, this change from F2 to F5 with no overlap was very disruptive.

@ulmononian
Copy link
Collaborator

ulmononian commented Jan 26, 2024

has anyone else had time to test #2115 w/ rocoto or ecflow?

seems like the --partition is being overwritten when using rocoto to run a few RTs. it's set as --partition=batch in https://github.com/ulmononian/ufs-weather-model/blob/feature/update_gaea/tests/fv3_conf/fv3_slurm.IN_gaea, but when the job is submitted, it gets overwritten to --partition=c5. snippet of the output is

#SBATCH --account=epic
#SBATCH --qos=normal
#SBATCH --partition=c5
#SBATCH --ntasks=8
#SBATCH -t 00:30:00
#SBATCH -o /gpfs/f5/epic/scratch/Cameron.Book/FV3_RT/rt_250240/compile_s2swa_intel.log
#SBATCH --comment=e8e4567a647c6b979dc8e98f9d96eb3f
/gpfs/f5/epic/scratch/Cameron.Book/ufs-weather-model/tests/run_compile.sh /gpfs/f5/epic/scratch/Cameron.Book/ufs-weather-model/tests /gpfs/f5/epic/scratch/Cameron.Book/FV3_RT/rt_250240 "-DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8" s2swa_intel
}}
01/26/24 12:09:18 EST :: rocoto_workflow.xml :: WARNING: job submission failed: sbatch: error: invalid partition specified: c5
sbatch: error: Batch job submission failed: Invalid partition name specified

running the same job in serial to see if rocoto is causing the issue. fyi @jkbk2004 @zach1221

@zach1221
Copy link
Collaborator

@ulmononian running some tests now with ecflow/rocoto.

@SamuelTrahanNOAA
Copy link
Collaborator

As of this moment, the filesystems are no longer cross-mounted.

C4 can see /lustre/f2 but not /gpfs/f5
C5 can see /gpfs/f5 but not /lustre/f2

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Feb 1, 2024 via email

@SamuelTrahanNOAA
Copy link
Collaborator

The ufs-weather-model fails to build unless it has /lustre/f2 on GAEA C5.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 1, 2024

The ufs-weather-model fails to build unless it has /lustre/f2 on GAEA C5.

@SamuelTrahanNOAA Can you take a look at #2115 ? I was able to run full RT tests with rocoto on c5. @ulmononian @RatkoVasic-NOAA @zach1221 @FernandoAndrade-NOAA we need to confirm about ecflow error.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 1, 2024

@SamuelTrahanNOAA we are trying to migrate input files and baseline to /gpfs/f5/epic/world-shared/lustre/epic/UFS-WM_RT/NEMSfv3gfs

@SamuelTrahanNOAA
Copy link
Collaborator

we are trying to migrate input files and baseline to /gpfs/f5/epic/world-shared/lustre/epic/UFS-WM_RT/NEMSfv3gfs

Excellent. I look forward to the result.

@natalie-perlin
Copy link
Collaborator

@jkbk2004 -
As a suggestion:
maybe the new path could be more like

/gpfs/f5/epic/world-shared/UFS-WM_RT/NEMSfv3gfs

(without /lustre/epic part ). The "lustre" crept in when the UFS_SRW_data was moved from F2 to F5, and the whole directory structure was replicated. It was later corrected, and the UFS_SRW_data now resides under just /gpfs/f5/epic/world-shared/UFS_SRW_data .

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Feb 1, 2024 via email

@pjpegion
Copy link
Collaborator

pjpegion commented Feb 6, 2024

@GeorgeVandenberghe-NOAA I was able to use the libraries in /ncrc/proj/epic/spack-stack/ until about 9 am this morning. Now my programs cannot find the libraries and when I ls /ncrc/proj/epic/, I get permission denied.

@SamuelTrahanNOAA
Copy link
Collaborator

cannot find the libraries when..

The directory is inaccessible from the "epic" level:

ls -ld /ncrc/proj/epic
drwxrwx--- 6 root epic 4096 Jan 19 13:58 /ncrc/proj/epic

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 6, 2024

I think we should use /gpfs/f5/epic/world-shared. @RatkoVasic-NOAA @ulmononian Can we follow up on this permission issue? project specific location is not good.

@JustinPerket
Copy link
Contributor

For what it's worth, a couple of us at GFDL have been having trouble this morning as well. Some of our directories within /gpfs/f5/gfdl*/scratch seemed to have reset permissions to be unreadable by others, sometime between last night and this morning

@RatkoVasic-NOAA
Copy link
Collaborator

I sent email to helpdesk. Let's see their response.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

Any word on when this might be addressed. Ufs weather model (and probably a lot of other stuff) is not buildable on GaeaC5 without spack-stack.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

Spack stack remains unavailable on gaeaC5 24 hours after this problem was first reported. It appears to be due to a permission change in /ncrc/proj/epic administratively done by gaea admins, not by epic.

We need this to build NCEP and RDHPCS applications.

@ulmononian
Copy link
Collaborator

ulmononian commented Feb 7, 2024

@GeorgeVandenberghe-NOAA we are in communication with gfdl to address this issue. we will update here as soon as we hear back.

correct me if i'm wrong, but you were able to build and test these stacks last week or so (i.e. your comment here #2101 (comment)) i am just trying to discern if the permissions were recently changed.

@pjpegion
Copy link
Collaborator

pjpegion commented Feb 7, 2024

My jobs failed around 9am yesterday when the UFS executable could not find the dynamically linked libraries it needed.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Feb 7, 2024 via email

@JustinPerket
Copy link
Contributor

I know that others here at GFDL have been having directory permission changes with other projects, and our helpdesk was saying it was an unexpected issue on ORNL's side. You might be getting similar communication

@RatkoVasic-NOAA
Copy link
Collaborator

Just to inform anyone who is not in the loop. On 2/6/24 GFDL admins made change of permissions for the /ncrc/proj/epic directory WITHOUT prior notice.
Ticket is opened with GFDL helpdesk. For your reference and inquires with GFDL helpdesk ticket number is [GFDL#5023877].

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Feb 7, 2024 via email

@pjpegion
Copy link
Collaborator

pjpegion commented Feb 7, 2024

This needs to be elevated. Has anyone told Brian Gross, or Frank Idiviglio about this?

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

I have built an alternate set of alligator clips that allows for ufs-weather-model and all of global workflow (except GDASApp) to build on gaeaC5. It is not good practice to rely on alligator clips for any length of time and we need to get spack-stack available again.

export CMAKE_PREFIX_PATH=/autofs/ncrc-svm1_proj/ncep/gwv/simstack/netcdf140.492.460.mapl235.fms2301.crtm240.z
export ESMFMKFILE=/autofs/ncrc-svm1_proj/ncep/gwv/simstack/netcdf140.492.460.mapl235.fms2301.crtm240.z/ESMF_8_4_1/lib/esmf.mk
export FMS_ROOT=/autofs/ncrc-svm1_proj/ncep/gwv/simstack/netcdf140.492.460.mapl235.fms2301.crtm240.z/fms.2023.02.01
export SCOTCH_ROOT=/autofs/ncrc-svm1_proj/ncep/gwv/simple/simple.0129.2024/libs/ufslibs/install/scotch

And for later UFS I have these two additional versions
export ESMFMKFILE=/autofs/ncrc-svm1_proj/ncep/gwv/simstack/netcdf140.492.460.mapl235.fms2301.crtm240.z/ESMF_8_5_0/lib/esmf.mk

and
export MAPL_ROOT=/autofs/ncrc-svm1_proj/ncep/gwv/simstack/netcdf140.492.460.mapl235.fms2301.crtm240.z/MAPL-2.40.3

All of these libraries are static and do not require module loads or path settings at runtime

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Feb 7, 2024 via email

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 7, 2024

No. Do you want to inform them? This issue has crippled NCEP and RDHPCS until further notice. A permissions change on /ncrc/proj/epic would fix it short term.. 28 hours out this remains undone.

On Wed, Feb 7, 2024 at 5:58 PM Phil Pegion @.> wrote: This needs to be elevated. Has anyone told Brian Gross, or Frank Idiviglio about this? — Reply to this email directly, view it on GitHub <#2101 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSP3HZOJWSBM7WMWULYSO6FJAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGU4DOMBWGI . You are receiving this because you were mentioned.Message ID: @.>
-- George W Vandenberghe Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 @.** 301-683-3769(work) 3017751547(cell)

They reported for high level attention, I think. @RatkoVasic-NOAA @ulmononian can you confirm?

@ulmononian
Copy link
Collaborator

No. Do you want to inform them? This issue has crippled NCEP and RDHPCS until further notice. A permissions change on /ncrc/proj/epic would fix it short term.. 28 hours out this remains undone.

On Wed, Feb 7, 2024 at 5:58 PM Phil Pegion @.> wrote: This needs to be elevated. Has anyone told Brian Gross, or Frank Idiviglio about this? — Reply to this email directly, view it on GitHub <#2101 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FSP3HZOJWSBM7WMWULYSO6FJAVCNFSM6AAAAABCA22VUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZSGU4DOMBWGI . You are receiving this because you were mentioned.Message ID: _@**._>
-- George W Vandenberghe _Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 _
@. 301-683-3769(work) 3017751547(cell)

They reported for high level attention, I think. @RatkoVasic-NOAA @ulmononian can you confirm?

yes -- this has been escalated on the epic side to higher level management.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

GeorgeVandenberghe-NOAA commented Feb 7, 2024 via email

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

I have been told this is an ORNL issue and GFDL (which does not have gaea root)  is also waiting for this mistake to be corrected.
The mistake was  an account management script that removes world read and execute from project root directories e.g. epic.  Epic does not own the directory and cannot fix it.  GFDL does not have root and can't fix it.  It's an ORNL problem now

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

THe premissions issue has been corrected and spack-stack should be usable again on GaeaC5

@ulmononian
Copy link
Collaborator

i received an email this morning that permissions have been restored. @GeorgeVandenberghe-NOAA
thanks for confirming.

zach1221 pushed a commit that referenced this issue Feb 10, 2024
…lock_atmos_copy routines in fv3atm #2124 (#2115)

- Gaea C5 modulefile & DISKNM update: closes issues Gaea F2 to F5 file system migration #2101
- Bring in the global-workflow detect_machine.sh to keep consistent between projects. (Closes Bring in detect_machine.sh 
   from global workflow for consistency across the community. #2096 )
- Fix out of bound errors in block_atmos_copy routines in fv3atm
@FernandoAndrade-NOAA
Copy link
Collaborator Author

#2115 has been merged in. Thanks to all for your coordination!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.