Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds write in w3prosfmd_pdlib.F90 #1013

Merged

Conversation

ukmo-ccbunney
Copy link
Collaborator

@ukmo-ccbunney ukmo-ccbunney commented May 19, 2023

Pull Request Summary

Bugfix for out-of-bounds write to CCOSA and CSINA in w3profsmd_pdlib.F90

Description

There is an out-of-bounds array bug in the w3profsmd_pdlib.F90 module:

CCOSA = FACX * ECOS
CSINA = FACX * ESIN

CCOSA and CSINA are defined locally as size NTH, but ECOS and ESIN are defined in w3gdatmd with size MSPEC+MTH (MSPEC = NK * NTH)

This is causing a potential out of bounds error and clobbering memory adjacent to CCOSA/CSINA

I think this is what was intended:

CCOSA = FACX * ECOS(1:NTH)
CSINA = FACX * ESIN(1:NTH)

which takes the direction based cosine/sine values from the first frequency band (values are identical for each freq band).

Some changes are expected to the PDLIB regtests due to the possibility of memory being overwritten.
-->

Issue(s) addressed

Commit Message

Bugfix to out of bounds array write in w3profsmd_pdlib.f90

Check list

Testing

  • How were these changes tested? Regtests (of PDLIB variant)
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Yes
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)? Yes; GNU compiler, Cray EX HPC
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.) Some changes expected.
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

Results of PDLIB regtests below. Differences in field outputs are only present in ww3_tp2.21.
Most differences seem to be very small, e.g.:

***
/home/h06/christopher.bunney/WW3/UKMO_WW3/regtests/output/ww3_tp2.21/work_b/ww3.201809_hs.nc_diff.txt
***
65209c65209
<     2.741145, 2.331527, 0.692266, 2.702155, 1.944189, 2.083071, 2.161465,
---
>     2.741145, 2.331527, 0.6922661, 2.702155, 1.944189, 2.083071, 2.161465,

All differences in other regtests are due to solely the netCDF in the bugfix branch not having the "forecast_period" variable...not sure why it is missing. I will investigate.

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
ww3_tp2.17/./work_b                     (2 files differ)
ww3_tp2.17/./work_c                     (2 files differ)
ww3_tp2.19/./work_1C_a                     (10 files differ)
ww3_tp2.19/./work_1B_a                     (10 files differ)
ww3_tp2.21/./work_b                     (4 files differ)
ww3_tp2.21/./work_mb                     (6 files differ)
ww3_tp2.21/./work_b_metis                     (3 files differ)
ww3_tp2.6/./work_pdlib                     (13 files differ)

**********************************************************************
************************ identical cases *****************************
**********************************************************************
ww3_tp2.17/./work_mb
ww3_tp2.17/./work_mc
ww3_tp2.19/./work_1A_a

@ukmo-ccbunney ukmo-ccbunney marked this pull request as draft May 19, 2023 13:49
@ukmo-ccbunney ukmo-ccbunney self-assigned this May 19, 2023
@ukmo-ccbunney ukmo-ccbunney marked this pull request as ready for review May 19, 2023 15:43
@JessicaMeixner-NOAA
Copy link
Collaborator

@ukmo-ccbunney I got no differences when I ran this and compared with develop, so I'm going to re-run to see if I messed something up on my end.

@ukmo-ccbunney
Copy link
Collaborator Author

@ukmo-ccbunney I got no differences when I ran this and compared with develop, so I'm going to re-run to see if I messed something up on my end.

It's the sort of memory bug that might behave differently on different compilers/systems...
It might be that you just don't see any differences.

Interestingly for me, the only difference I saw was in ww3_tp2.21, which suspiciously is the same regtest that was giving differences for you in #1010 ....

@JessicaMeixner-NOAA
Copy link
Collaborator

Okay, I've run this again, and I get no differences. I'm not surprised we're getting different behavior on different systems and I agree this needs to be fixed. I think I was hoping we'd get a difference which would then resolve the other issues which is why I re-ran. This will get merged here shortly.

Copy link
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While @ukmo-ccbunney got diffs, I did not see any:

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (15 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (6 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

@JessicaMeixner-NOAA JessicaMeixner-NOAA merged commit 2a2f204 into NOAA-EMC:develop May 23, 2023
@aronroland
Copy link
Collaborator

Hi @JessicaMeixner-NOAA, @ukmo-ccbunney,

as for this syntax I doubt it is a bug and there it would be interesting to see the error and the compiler output that u have found @ukmo-ccbunney.

Here is my opinion on this:

When assigning an array of larger size to an array of smaller size, Fortran only assigns as many elements as fit into the smaller array.

So, if you have:

A = B

And A has size NX, it is equivalent to:

A = B(1:NX)

In other words, the assignment operation in Fortran truncates the larger array to the size of the smaller one. This behavior can be useful but also can lead to unintentional data loss if you are not careful about array sizes. Here it was used intentionally so i would not say that this a bug. I do not see also that this is compiler/environment depended it should fit the FORTRAN standard given in e.g. International Organization for Standardization. Information technology -- Programming languages -- Fortran. ISO/IEC 1539-1:2018, 2018. To my best knowledge this is the latest and most complete reference.

@ukmo-ccbunney
Copy link
Collaborator Author

Hi @aronroland

Thanks for your input on this.

This was bought to my attention when I got the following runtime error (GNU compiler):
Fortran runtime error: Array bound mismatch for dimension 1 of array 'ccosa' (36/1836)

To be honest, I was not aware of that the original syntax was considered valid Fortran, so apologies for raising a fix for valid code! Are you happy if the fix is left in? To me the explicit index range makes things a bit clearer.

Interestingly, I did get small differences in the output when I made the fix, which made me think that some sort of memory corruption was occuring, but this could possibly have been a red herring...

@JessicaMeixner-NOAA
Copy link
Collaborator

@ukmo-ccbunney I think since you got differences in answers, and the fact that this generated a runtime error, we should keep the update in the code. That being said, we should certainly follow fortran standards and I didn't quite understand @aronroland concern about data loss in this particular case, so perhaps I'm missing something obvious here?

@aronroland
Copy link
Collaborator

Hi @JessicaMeixner-NOAA, actually I am quite happy that this happened. Since GNU is saying that this is an "error" i am afraid that I am wrong with what I have said before. I will test that on my side and get back to u.

MatthewMasarik-NOAA added a commit to MatthewMasarik-NOAA/WW3 that referenced this pull request Jun 27, 2023
* origin/develop:
  handle NaN air-sea temperatures from nearest land points (NOAA-EMC#869)
  Increase valid_max for f in ounp (NOAA-EMC#1014)
  Bugfix deallocation of invalid memory in ww3_prnc (NOAA-EMC#1016)
  Update to orion intel module path and two typo corrections. (NOAA-EMC#1011)
  Bugfix to out of bounds array write in w3profsmd_pdlib.f90 (NOAA-EMC#1013)
  Simple logic fix for time interpolation of boundary nodes at the end of W3XYPFSNIMP. (NOAA-EMC#1005)
  in w3iors use NSEA instead of NSEAL in serial write/read of VA (NOAA-EMC#954)
  Update documentation for UNST namelist (NOAA-EMC#986)
  In certain coupled configurations, the piece of code testing the coupling frequency to check if 'receive' coupling exchanges need to take place fail, resulting in an infinite loop causing the integration between time zero and the first time step to repeat indefinitely. This check needs to be rewritten, which fixes also issue NOAA-EMC#816 in a simpler way.  (NOAA-EMC#999)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out of bounds array access in w3profsmd_pdlib.F90
3 participants