The enkfgdaseobs job can fail to collect all necessary data #2092

DavidHuber-NOAA · 2023-11-28T18:00:57Z

What is wrong?

If the enkfgdaseobs job is run with more processors than (MPI tasks) x (threads), data will be left on the floor and result in an incomplete analysis. Kludges have been placed for S4 and Jet, but new systems with different core/node counts will need similar kludges.

What should have happened?

The enkfgdaseobs job should be able to collect all necessary data regardless of how many cores are used.

What machines are impacted?

All or N/A

Steps to reproduce

Setup a cycled experiment and modify config.resources to use a different number of PEs for the eobs job
Run the enkfgdaseobs job and plot the resulting ingested data points

An example pair of plots from @CoryMartin-NOAA is below:

Additional information

This was first captured in #154.

Do you have a proposed solution?

I'm not sure if this is a scripting change in the global-workflow or a code change in the GSI. But once it is fixed, the config.resources file should be simplified to use the same number of processes across all systems.

The text was updated successfully, but these errors were encountered:

DavidHuber-NOAA · 2024-07-02T13:18:09Z

I believe that the problematic code is located here:

global-workflow/scripts/exglobal_atmos_analysis.sh

Lines 576 to 593 in de87067

    
           ############################################################## 
        
           # Diagnostic files 
        
           # if requested, link GSI diagnostic file directories for use later 
        
           if [ ${GENDIAG} = "YES" ] ; then 
        
              if [ ${lrun_subdirs} = ".true." ] ; then 
        
                 if [ -d ${DIAG_DIR} ]; then 
        
                    rm -rf ${DIAG_DIR} 
        
                 fi 
        
                 npe_m1="$((${npe_gsi}-1))" 
        
                 for pe in $(seq 0 ${npe_m1}); do 
        
                   pedir="dir."$(printf %04i ${pe}) 
        
                   mkdir -p ${DIAG_DIR}/${pedir} 
        
                   ${NLN} ${DIAG_DIR}/${pedir} ${pedir} 
        
                 done 
        
              else 
        
                 err_exit "FATAL ERROR: lrun_subdirs must be true. lrun_subdirs=${lrun_subdirs}" 
        
              fi 
        
           fi

Looping over npe_gsi-1 will not create all of the links necessary if npe does not equal ncpus=(npe_node*nodes). To fix this, the loop should be changed to loop over npe_node * nnodes - 1.

DavidHuber-NOAA added bug Something isn't working triage Issues that are triage labels Nov 28, 2023

WalterKolczynski-NOAA removed the triage Issues that are triage label Nov 29, 2023

DavidHuber-NOAA assigned DavidHuber-NOAA and unassigned DavidHuber-NOAA Jan 8, 2024

DavidHuber-NOAA mentioned this issue May 1, 2024

Implement global-workflow on AWS #2549

Closed

DavidHuber-NOAA added a commit to DavidHuber-NOAA/global-workflow that referenced this issue Jun 10, 2024

Force npe_node_eobs to be set properly NOAA-EMC#2092

8376ee6

DavidHuber-NOAA mentioned this issue Jun 10, 2024

Assign machine- and RUN-specific resources #2672

Merged

17 tasks

WalterKolczynski-NOAA added this to the GFS v17 milestone Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The enkfgdaseobs job can fail to collect all necessary data #2092

The enkfgdaseobs job can fail to collect all necessary data #2092

DavidHuber-NOAA commented Nov 28, 2023 •

edited

Loading

DavidHuber-NOAA commented Jul 2, 2024

The enkfgdaseobs job can fail to collect all necessary data #2092

The enkfgdaseobs job can fail to collect all necessary data #2092

Comments

DavidHuber-NOAA commented Nov 28, 2023 • edited Loading

What is wrong?

What should have happened?

What machines are impacted?

Steps to reproduce

Additional information

Do you have a proposed solution?

DavidHuber-NOAA commented Jul 2, 2024

DavidHuber-NOAA commented Nov 28, 2023 •

edited

Loading