Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI optimization with focus on ensemble input and multiscale setup. #585

Closed
jderber-NOAA opened this issue Jun 16, 2023 · 11 comments · Fixed by #594
Closed

GSI optimization with focus on ensemble input and multiscale setup. #585

jderber-NOAA opened this issue Jun 16, 2023 · 11 comments · Fixed by #594
Assignees

Comments

@jderber-NOAA
Copy link
Contributor

jderber-NOAA commented Jun 16, 2023

The GSI spends a lot of wall time on reading in the ensembles and setting up the multiscale ensemble setup when these options are used. In these changes, the code is optimized to speed up these two steps of the analysis.

The changes made to the analysis are:

  1. Keep ensembles single precision throughout the cplr step for the ensembles.
    Note this does not result in any change in the result since the variables were changed to double precision right before the update_halos_ routine (step removed). A few things such as conversion from sensible temperature to virtual temperature were moved to later so that the results were not impacted by the change in precision.
  2. In the creation of the multiscale ensembles, the conversion from grid to spectral was done only once for all scales. Before it was done for each scale thus eliminating nsclgrp-1 sub2grids and conversion to spectrals.
  3. In the creation of the multiscale ensembles, the sub2grid and grid2sub was converted from double to single precision reducing the amount of data passed between processors. Note there is no loss of information because the data is in single precision immediately before and after the sub2grid and grid2sub routines.
  4. The calculation of the weights for the multiscale ensembles and spectral filter was optimized.
  5. A specialized sub2grid and grid2sub was created to eliminate some unnecessary data movement.
  6. The update_halos routine was removed, the read ensemble routines modified and the input parameters modified for the genex routines to ensure that the boundary points of the subdomains were specified when genex called. By eliminating the update_halos routine a lot of all-to-all communications were removed. Also significant simplification of code results. This impacts both multiscale runs and non-multiscale runs.
  7. Some threading and restructuring of loops was done in hybrid_ensemble_isotropic.F90, especially for the background errors for the ensembles.
  8. The calculation of the RH in get_gefs_ensperts_dualres.f90 was simplified and optimized.

All of the above changes can be made with bit identical results. However, two minor improvements to the code were made that do result in a minor change to the results.
A. The moisture is not checked for negative values before the virtual temperature is calculated in the original code.
B. To get sensible temperature, the old code converts it to virtual temperature (with A) and then converts back to sensible temperature. Using the original sensible temperature results in some minor round-off differences.

The above 2 changes can easily be removed and results in bit-wise identical results.

Note that 2 minor changes to the code were also included which allow the code to pass in debug mode.
a. In read_iasi, the code was changed when the channel was zero to not have it write to that location.
b. In stprad, the abi2km_bc array is only initialized when abi2km and regional. It could write outside the array when the number of channels was < 4.

When these changes were incorporated the wall times for a multiscale ensemble test I am running went from about 4000-4200s to 2800 - 3100s. Of course these times are variable, but the new run times seemed to be more consistent from run-to-run. The multiscale test I am running is not necessarily realistic. I am using full resolution ensembles and only 2 scales for the multiscale test.

Test multiscale GSI

        Run time old    Run time new  Max Mem old Max Mem new

1 4384s 2980s 90605476MB 99611360MB
2 4058s 3115s 98866156MB 90699556MB
3 4127s 3159s 98338100MB 91108816MB
4 4179s 2743s 97168444MB 100365004MB

With lower resolution the impact should be less. With more scales the impact should be larger. I am working with Cathy to test in a more realistic situation. Some results are posted by Cathy below. Thanks Cathy!

The changes should also impact the run time of the global GSI as it is being run in operations. I do not believe it will impact any of the other current operational runs.

For the global GSI, the run times for the global are:

        Run time old    Run time new  Max Mem old Max Mem new

1 1829s 1829s 75373748MB 73627272MB
2 1804s 1663s 70815928MB 69961644MB
3 1781s 1767s 75953952MB 76127660MB
4 1849s 1644s 69545452MB 68708012MB

In general the test case shows faster times, but there is still significant run-to-run variability.

Changes are in my optimize3 branch. All testing was done on Hera. All runs are done with --mem=0

@jderber-NOAA jderber-NOAA self-assigned this Jun 16, 2023
@CatherineThomas-NOAA
Copy link
Collaborator

I ran some initial tests on Hera with "Working Version 1" at revision a899e45. I used a current experiment (vert_loc_levels) as my baseline, which is at C384/C192/L127 resolution and uses a v16.3-like workflow with the GSI component close to develop.

I ran 5 cases with the new code and found all 5 cases (gold lines) to run faster than all of previous GSI cycles from the control experiment (~130 cases, red histogram).

times_hist_set1

Next, I am running similar tests using 3-band SDL as well as running on WCOSS2. I am also iterating with @jderber-NOAA on additional changes.

@CatherineThomas-NOAA
Copy link
Collaborator

I tested some of @jderber-NOAA's additional changes on Hera using the same setup as the previous comment. Here is a new histogram with 5 cases at commit a899e45 and 5 cases at a41b258 (also including more cases in the control):
times_hist_vertloc

I ran another 5 cases of both commits, but this time using a 3-band multiscale experiment of Travis Elless as the control:
times_hist_sdl3

For the SDL case, the first iteration did not result in significant savings, but the second iteration had a much larger impact.

Next, I will run some cases on WCOSS2.

@jderber-NOAA
Copy link
Contributor Author

Cathy,

Thank you very much for running these experiments. Looks like I am not completely going in the wrong direction.

Somehow you must be looking at what I am doing. This came in when I am just about ready to put out a new version. The big difference with the earlier version is that I am able to get rid of the update_halos routine. This should help with the reading of the ensembles and impact both the multiscale and single scale versions. Will let you know when it is ready. Sorry for making so many changes. I think after I get this done, it would be a good time to get the changes into the trunk.

@jderber-NOAA
Copy link
Contributor Author

Cathy,

New version out there. Best times better!

John

@jderber-NOAA
Copy link
Contributor Author

I mean "Best times ever!"

@jderber-NOAA
Copy link
Contributor Author

Having some problems with my multiscale ensemble test. Occasionally running out of memory. Putting in some changes to slightly reduce memory. My test uses full resolution ensembles - so unlikely to be issue with Cathy's tests.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Jul 22, 2023

Memory issue is solved by adding "#SBATCH --mem=0" to beginning of run deck. This give exclusive use of the node's memory to the job. Still not clear why this suddenly became an issue. Tests being rerun to ensure that the results/time hold. Times above have been been updated.

It has been verified that control and test give identical results when the 2 minor changes in cplr_ensmod are removed.

@jderber-NOAA
Copy link
Contributor Author

Regression tests passed. Only failures are due to timing and minor difference for one using ensembles. Minor difference with ensemble was gone when minor changes with zeroing q and sensible temperature mentioned above were removed.

@CatherineThomas-NOAA
Copy link
Collaborator

Here are another 5 cases for both single scale localization on Hera (C384/C192) using commit 4b3aa5b:
times_hist_vertloc_final

And another 5 for a 3-band scale dependent localization, also using commit 4b3aa5b:
times_hist_sdl3_final

These changes make the prospect of using the scale dependent localization in GFSv17 much more feasible timing-wise.

@CatherineThomas-NOAA
Copy link
Collaborator

I ran a full resolution v16-like configuration on WCOSS2 dev. Here are the timings:

develop

  • 2010.54
  • 1998.69
  • 1975.95
  • 1991.79
  • 2020.50

optimize3 @ 89c975f

  • 1990.24
  • 1935.61
  • 1965.83
  • 1941.39
  • 1944.87

Doesn't seem to be as dramatic as my lower resolution tests on Hera, but it's still substantial.

@CatherineThomas-NOAA
Copy link
Collaborator

To round out my testing, here are WCOSS2 timings with a full resolution 2-scale configuration:

develop @ be4a3d9

  • 2738.35
  • 2740.69
  • 2747.86
  • 2729.79
  • 2773.25

optimize3 @ 89c975f

  • 2588.17
  • 2671.70
  • 2585.83
  • 2597.86
  • 2647.45

Again while the reduction is notable, it's not as dramatic as the half resolution tests on Hera. Comparing the 2-scale SDL with optimization and the single scale without optimization, on average the difference is more than the target 5 minutes where NCO doesn't panic. However, develop is not what's running in operations and there were some optimizations included already. For completeness, here are some current operational timings:

  • 2099.52
  • 2110.14
  • 2101.31
  • 2114.83

So even with develop we've already had some reduction. That said, this PR is still substantial in my opinion, maybe just not as much as Hera indicated.

RussTreadon-NOAA pushed a commit that referenced this issue Sep 22, 2023
#594)

This update improves the efficiency of the GSI, especially for
multiscale runs. Details can be found in issue #585
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants