Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The enkfgdaseobs job can fail to collect all necessary data #2092

Open
DavidHuber-NOAA opened this issue Nov 28, 2023 · 1 comment
Open

The enkfgdaseobs job can fail to collect all necessary data #2092

DavidHuber-NOAA opened this issue Nov 28, 2023 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@DavidHuber-NOAA
Copy link
Contributor

DavidHuber-NOAA commented Nov 28, 2023

What is wrong?

If the enkfgdaseobs job is run with more processors than (MPI tasks) x (threads), data will be left on the floor and result in an incomplete analysis. Kludges have been placed for S4 and Jet, but new systems with different core/node counts will need similar kludges.

What should have happened?

The enkfgdaseobs job should be able to collect all necessary data regardless of how many cores are used.

What machines are impacted?

All or N/A

Steps to reproduce

  1. Setup a cycled experiment and modify config.resources to use a different number of PEs for the eobs job
  2. Run the enkfgdaseobs job and plot the resulting ingested data points

An example pair of plots from @CoryMartin-NOAA is below:

MissedData

Additional information

This was first captured in #154.

Do you have a proposed solution?

I'm not sure if this is a scripting change in the global-workflow or a code change in the GSI. But once it is fixed, the config.resources file should be simplified to use the same number of processes across all systems.

@DavidHuber-NOAA DavidHuber-NOAA added bug Something isn't working triage Issues that are triage labels Nov 28, 2023
@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Nov 29, 2023
DavidHuber-NOAA added a commit to DavidHuber-NOAA/global-workflow that referenced this issue Jun 10, 2024
@DavidHuber-NOAA
Copy link
Contributor Author

I believe that the problematic code is located here:

##############################################################
# Diagnostic files
# if requested, link GSI diagnostic file directories for use later
if [ ${GENDIAG} = "YES" ] ; then
if [ ${lrun_subdirs} = ".true." ] ; then
if [ -d ${DIAG_DIR} ]; then
rm -rf ${DIAG_DIR}
fi
npe_m1="$((${npe_gsi}-1))"
for pe in $(seq 0 ${npe_m1}); do
pedir="dir."$(printf %04i ${pe})
mkdir -p ${DIAG_DIR}/${pedir}
${NLN} ${DIAG_DIR}/${pedir} ${pedir}
done
else
err_exit "FATAL ERROR: lrun_subdirs must be true. lrun_subdirs=${lrun_subdirs}"
fi
fi

Looping over npe_gsi-1 will not create all of the links necessary if npe does not equal ncpus=(npe_node*nodes). To fix this, the loop should be changed to loop over npe_node * nnodes - 1.

@WalterKolczynski-NOAA WalterKolczynski-NOAA added this to the GFS v17 milestone Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants