Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement the ability to automatically calculate PE bounds and tasking for component models on different platforms and resolutions #320

Closed
DeniseWorthen opened this issue Dec 7, 2020 · 7 comments · Fixed by #1200
Labels
enhancement New feature or request

Comments

@DeniseWorthen
Copy link
Collaborator

Description

Currently when a new platform is added, additions need to be made to default vars to specify the tasking for all the components on the new platform. For the coupled model this requires hand-editing of the various tasking variables (e.g. PET bounds) for each component and resolution. This is prone to error.

Solution

Implement a system where the required variables could be automatically calculated and set. Each machine would have variables which define that platform (eg. TPN) and each component+resolution would have the required tasking defined. For example MOM6-mx025=120 Tasks, CICE6-mx100=12 Tasks etc.

Using this information for all components (ufsATM, CMEPS, MOM6, CICE6, WW3), the required PET bounds would be calculated automatically and used to set the required variables in nems.configure.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented May 3, 2021

@binli2337 How many tasks does each node have on jet? Since the TPN_cpl_thrd in your PR #545 is 18, it seems jet has >36 tasks/node?

@DeniseWorthen
Copy link
Collaborator Author

@DusanJovic-NOAA I think you started to work on this issue a couple of months ago after a discussion on a morning tag-up. Did you ever get it working? During the P7 update, I had to once again implement the same PE/tasking changes for all the platforms and it reminded me of why I created this issue in the first place.

@DusanJovic-NOAA
Copy link
Collaborator

I tested this function (off-line) to compute PET bounds for each component. But I didn't have time to test it in rt.sh

#!/bin/bash
set -eu

function compute_petbounds () {

  local n=0

  # ATM
  if [[ $((ATM_compute_tasks + ATM_io_tasks)) -gt 0 ]]; then
     ATM_petlist_bounds="${n} $((n + ATM_compute_tasks + ATM_io_tasks -1))"
     n=$((n + ATM_compute_tasks + ATM_io_tasks))
  fi

  # CHM
  if [[ ${CHM_tasks:-0} -gt 0 ]]; then
     CHM_petlist_bounds="${n} $((n + CHM_tasks - 1))"
     n=$((n + CHM_tasks))
  fi

  # OCN
  if [[ ${OCN_tasks:-0} -gt 0 ]]; then
     OCN_petlist_bounds="${n} $((n + OCN_tasks - 1))"
     n=$((n + OCN_tasks))
  fi

  # ICE
  if [[ ${ICE_tasks:-0} -gt 0 ]]; then
     ICE_petlist_bounds="${n} $((n + ICE_tasks - 1))"
     n=$((n + ICE_tasks))
  fi

  # WAV
  if [[ ${WAV_tasks:-0} -gt 0 ]]; then
     WAV_petlist_bounds="${n} $((n + WAV_tasks - 1))"
     n=$((n + WAV_tasks))
  fi

  # MED
  MED_petlist_bounds="0 $((ATM_compute_tasks - 1))"

  UFS_tasks=${n}
}


# each test MUST define ${COMPONENT}_tasks variable for all components it is using
# and MUST NOT define those that it's not using or set the value to 0.

# ATM is a specaial case since it is ruuning on sum of compute and io tasks, and mediator is
# running only on compute tasks
ATM_compute_tasks=$((3 * 8 * 6))
ATM_io_tasks=$((1 * 6))
#CHM_tasks=0
OCN_tasks=30
ICE_tasks=12
WAV_tasks=208

compute_petbounds

echo "ATM_petlist_bounds: ${ATM_petlist_bounds:-}"
echo "OCN_petlist_bounds: ${OCN_petlist_bounds:-}"
echo "ICE_petlist_bounds: ${ICE_petlist_bounds:-}"
echo "WAV_petlist_bounds: ${WAV_petlist_bounds:-}"
echo "CHM_petlist_bounds: ${CHM_petlist_bounds:-}"
echo "MED_petlist_bounds: ${MED_petlist_bounds:-}"
echo "UFS_tasks         : ${UFS_tasks:-}"

@DusanJovic-NOAA
Copy link
Collaborator

This function assumes that mediator will run on the same tasks as ATM (compute tasks, not i/o), but this is not the case in all tests. For example in https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/tests/hafs_regional_atm_ocn

export atm_petlist_bounds="0000 0299"
export ocn_petlist_bounds="0300 0359"
export med_petlist_bounds="0300 0359"

mediator runs on the same tasks ac ocean. So the above function will not fork for that case.

@DeniseWorthen
Copy link
Collaborator Author

You're right. That test doesn't actually use the default_vars to set the petlist. Maybe there would have to be a logical that would control whether we "auto-set" the petlist.

You were also talking about doing having a test where the atm didn't always get the first PEs which makes a function more complicated too.

I do see the trade-off between just brute force setting of the bounds like we do now vs doing something by a function.

@DeniseWorthen
Copy link
Collaborator Author

@arunchawla-NOAA @junwang-noaa Of the outstanding issues for UFS, I still believe addressing this one would be very productive.

Currently we are manually having to set values across multiple platforms and multiple tests. Each of these represent a potential failure point. I'm also not sure how or whether this might intersect w/ the implementation of ESMF threading?

@DeniseWorthen
Copy link
Collaborator Author

bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants