Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRTM-fix differs from WCOSS2, CRTM produces copious errors in the GSI #963

Closed
22 of 25 tasks
DavidHuber-NOAA opened this issue Jan 23, 2024 · 29 comments · Fixed by JCSDA/spack#399
Closed
22 of 25 tasks
Labels
bug Something is not working

Comments

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Jan 23, 2024

LIST OF FIXED INSTALLATIONS

FYI @climbfuji @RatkoVasic-NOAA @ulmononian @AlexanderRichert-NOAA

Describe the bug
The CRTM fix files seem to be incomplete for amsua_metop-c. When comparing the fix files against those on WCOSS2, the following size difference was noted:
Hera:
1252 Jan 4 21:17 amsua_metop-c.SpcCoeff.bin
WCOSS2:
12196 Nov 13 15:03 amsua_metop-c.SpcCoeff.bin

Possibly related, the GSI produces ~60K lines of errors when running the CRTM of
Remove_AntCorr(FAILURE) : Input iFOV inconsistent with AC data

Expected behavior
The fix files would match between WCOSS2 and the research systems.

System:
All RDHPCS machines and S4.

Additional context
Found while testing the GSI: NOAA-EMC/GSI#684.

@DavidHuber-NOAA DavidHuber-NOAA added the bug Something is not working label Jan 23, 2024
@ADCollard
Copy link

These coefficient files should come from https://github.com/JCSDA/crtm tag v2.4.0_emc.3 or from the tar file at https://bin.ssec.wisc.edu/pub/s4/CRTM/fix_REL-2.4.0_emc_07112023.tgz

I checked the coefficient files in these github and tarred versions and found that in both cases the metop-c coefficient files are small (lacking the antenna correction information). So the WCOSS2 coefficient files appear inconsistent with the official release, but are in fact the correct ones to use.

@Hang-Lei-NOAA worked with NCO to install the coefficient files on the WCOSS machines, so maybe he can explain the difference.

My guess is that as this was supposed to only be an incremental change (only six files needed to be added) only these files were updated on WCOSS2 (which is definitely the safest way to manage this change).

Ideally the CRTM github tag should be updated to ensure the correct files are present. Adding @BenjaminTJohnson who is the CRTM manager.

@BenjaminTJohnson
Copy link

BenjaminTJohnson commented Jan 23, 2024 via email

@ADCollard
Copy link

ADCollard commented Jan 23, 2024

@BenjaminTJohnson Thanks for your reply.

So in the tarball and in the tag we have two amsua_metop-c spectral coefficient files:

Hera(hfe06):/scratch1/NCEPDEV/stmp2/Andrew.Collard/crtm/git/crtm/fix/SpcCoeff/Little_Endian$ cksum amsua_metop-c*bin
352532461 1252 amsua_metop-c.SpcCoeff.bin
1237040622 12196 amsua_metop-c_v2.SpcCoeff.bin

The v2 version has the cksum 1237040622 while the original version seems to be virtually empty.

As you say, on WCOSS, the file with the same cksum has been renamed to amsua_metop-c.SpcCoeff.bin.

andrew.collard@clogin02:/apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0/fix> cksum amsua_metop-c*bin
1237040622 12196 amsua_metop-c.SpcCoeff.bin
1692416252 1252 amsua_metop-c.SpcCoeff.noACC.bin

The reason for v2 files is that (for MetOp-A only) some direct broadcast stations were using a later version of the antenna corrections compared to most and needed to be corrected differently for consistancy (See this issue).

So does amsua_metop-c_v2.SpcCoeff.bin actually contain a version 2 of the antenna correction? Or is it consistent with the processing of the direct broadcast stations that all use v1?

The same argument applies to mhs_metop-c.

Tagging @emilyhcliu for information.

@DavidHuber-NOAA
Copy link
Collaborator Author

@AlexanderRichert-NOAA @BenjaminTJohnson I believe I see the issue. In the spack recipe, the "little-endian" amsua_metop-c_v2.SpcCoeff.bin file is copied over to the fix directory for version 2.4.0_emc, but not 2.4.0.1_emc.:

https://github.com/JCSDA/spack/blob/e7ef791045012b6ef989be97c6eee24ec571c4b6/var/spack/repos/builtin/packages/crtm-fix/package.py#L61-L77

        if "+big_endian" in spec and spec.version == Version("2.4.0_emc"):
            remove_path = join_path(
                os.getcwd(), "fix", "SpcCoeff", "Big_Endian", "amsua_metop-c.SpcCoeff.bin"
            )
            fix_files.remove(remove_path)


            # This file is incorrect, install it as a different name.
            install(
                join_path("fix", "SpcCoeff", "Big_Endian", "amsua_metop-c.SpcCoeff.bin"),
                join_path(self.prefix.fix, "amsua_metop-c.SpcCoeff.noACC.bin"),
            )


            # This "Little_Endian" file is actually the correct one.
            install(
                join_path("fix", "SpcCoeff", "Little_Endian", "amsua_metop-c_v2.SpcCoeff.bin"),
                join_path(self.prefix.fix, "amsua_metop-c.SpcCoeff.bin"),
            )

Is it possible to replace this file in the current installations?

@Hang-Lei-NOAA
Copy link
Collaborator

As David pointed out, that should be the issue.
The one on wcoss2 installed by NCO has been verified by @ADCollard in the past.
Please let me know if the current GSI has any issue with NCO installation anymore, I will coordinate it to solve.

The spack-stack should used the right CRTM tag for installation as we previously discussed with Alex. The little_endian file should be replaced by Big_endian file.

@ADCollard
Copy link

@BenjaminTJohnson @Hang-Lei-NOAA @AlexanderRichert-NOAA @DavidHuber-NOAA

This appears to be the solution. I am however very confused about why this renaming is being done rather than correcting the github tag/tarball itself.

@BenjaminTJohnson Can you confirm what v2 means in this context? All RARS processing of metop-c amsu-a uses version 1 of the antenna correction, according to the bufr files. I am not sure that there even is a v2. In other words, are the antenna corrections coefficients in amsua_metop-c_v2.SpcCoeff.bin consistent with what is used in the RARS processing?

@BenjaminTJohnson
Copy link

@ADCollard "v2" was just that version of that coefficient file -- it nothing to do with the internal antenna correction version. "v2" was used to separate from the non-corrected version.

I'm more than happy to rename the files, just let me know what your preference is for the ACCoeff versions and the non-ACCoeff versions (presently *_v2.SpcCoeff.bin and *.SpcCoeff.bin respectively).

@ADCollard
Copy link

@ADCollard "v2" was just that version of that coefficient file -- it nothing to do with the internal antenna correction version. "v2" was used to separate from the non-corrected version.

I'm more than happy to rename the files, just let me know what your preference is for the ACCoeff versions and the non-ACCoeff versions (presently *_v2.SpcCoeff.bin and *.SpcCoeff.bin respectively).

Thanks, @BenjaminTJohnson, for clarifying the meaning of v2 here. It should be noted that this is different to the files for metop-a where v2 does mean version 2 of the AC coefficients.

Is there any need to retain versions of the SpcCoeff files that have no AC coeffs? I would simply rename:
amsua_metop-c_v2.SpcCoeff.bin to amsua_metop-c.SpcCoeff.bin

Again, this should not be done for metop-a, where the v2 file actually reflects a second version of the AC coefficients that are being used at a handful of RARS stations.

Tagging @emilyhcliu for awareness

@BenjaminTJohnson
Copy link

@ADCollard @emilyhcliu sure, no problem. Which prior release of CRTM do you want that change to be made?

Right now I plan to update tag v2.4.0_emc.3 (which would become v2.4.0_emc.4) and update the tar file at https://bin.ssec.wisc.edu/pub/s4/CRTM/fix_REL-2.4.0_emc_07112023.tgz which would become fix_REL-2.4.0.2_emc_01242024.tgz using the new versioning convention.

Also, any changes would be incorporated into the upcoming v3.1.0-beta release (today). which will have a tarball named fix_REL-3.1.0.1_01242024.tgz. This v3.1.0-beta release is for partner integration testing with GSI/GFS/UPP and JEDI/UFO applications. Uses only cmake for building, no autotools, no ecbuild. So there might be some workflow issues. with the final v3.1.0 release there will be a spack-stack module as well, but I can't impose that requirement on Dom for a beta release. A comprehensive email will be sent later this evening regarding the v3.1.0-beta release.

@ADCollard
Copy link

@BenjaminTJohnson If we are going to be creating new tags rather than just updating the current one (yes I know this is the correct practice) I think should not do this at this time to avoid further confusion.

But for any future tags we should rename the files as described above.

But I defer to @Hang-Lei-NOAA on how best to proceed from the EMC perspective.

@DavidHuber-NOAA
Copy link
Collaborator Author

@climbfuji @AlexanderRichert-NOAA The easiest way to resolve this in each installation is to rename the existing 1.6.0 amsua_metop-c.SpcCoeff.bin file to amsua_metop-c.SpcCoeff.noACC.bin and then copy the amsua_metop-c.SpcCoeff.bin file from the 1.5.1 installations into the 1.6.0 installations. For instance, on Orion:

cd /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/intel/2022.0.2/crtm-fix-2.4.0.1_emc-ezbyzji/fix
mv amsua_metop-c.SpcCoeff.bin amsua_metop-c.SpcCoeff.noACC.bin
cp /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/envs/unified-env/install/intel/2022.0.2/crtm-fix-2.4.0_emc-r5j6suo/fix/amsua_metop-c.SpcCoeff.bin .

I have verified that the cksum of the spack-stack/1.5.1 coefficient file is identical to WCOSS2's copy:
Orion 1237040622 12196 amsua_metop-c.SpcCoeff.bin
WCOSS2 1237040622 12196 amsua_metop-c.SpcCoeff.bin

If this could be done on Orion/intel first, I can verify it works for the GSI before the other systems' intel and gnu installations are updated.

@BenjaminRuston
Copy link

@gmao-yzhu I see you are not part of this repository
but this will make the following issue a duplicate: JCSDA/crtm#46

so will close this

@gmao-yzhu
Copy link

gmao-yzhu commented Jan 25, 2024 via email

@DavidHuber-NOAA
Copy link
Collaborator Author

@AlexanderRichert-NOAA @climbfuji @ulmononian Just wanted to check in on this and see what you thought of my proposed solution. Also wondering when I can start testing on Orion.

@climbfuji
Copy link
Collaborator

@DavidHuber-NOAA I just made the manual change on Orion in /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/intel/2022.0.2/crtm-fix-2.4.0.1_emc-ezbyzji/fix only.

@DavidHuber-NOAA
Copy link
Collaborator Author

@climbfuji Excellent, thank you! I will give that a try today.

@DavidHuber-NOAA
Copy link
Collaborator Author

@climbfuji The regression tests were successful! Could the same fix be applied across all systems? I can provide specific commands for each system if needed.

@ulmononian
Copy link
Collaborator

@DavidHuber-NOAA: @RatkoVasic-NOAA and i can handle this on the remainder of the rdhpcs intel/gnu installations. identical process on each machine, i.e., are the commands the same for each system (of course, with different paths to crtm-fix-2.4.0.1_emc-xxxx)?

@DavidHuber-NOAA
Copy link
Collaborator Author

@ulmononian @RatkoVasic-NOAA Great, thank you. Yes, the same process will work on each system.

@climbfuji
Copy link
Collaborator

Let's add a list of machines to the description of this issue. We need to do this for ALL environments and ALL compilers on each of the machines. It would also be good to backport the bug fix for spack to the release branch.

@ulmononian
Copy link
Collaborator

Let's add a list of machines to the description of this issue. We need to do this for ALL environments and ALL compilers on each of the machines. It would also be good to backport the bug fix for spack to the release branch.

when you say "all envs", do you mean this should extend to envs beyond unified env, e.g., gsi-addon?

@climbfuji
Copy link
Collaborator

Let's add a list of machines to the description of this issue. We need to do this for ALL environments and ALL compilers on each of the machines. It would also be good to backport the bug fix for spack to the release branch.

when you say "all envs", do you mean this should extend to envs beyond unified env, e.g., gsi-addon?

@DavidHuber-NOAA Do we need to apply this bug fix for gsi-addon-env, too? I don't think so, but better to confirm.

@DavidHuber-NOAA
Copy link
Collaborator Author

@climbfuji You are correct. We do not need to do this in gsi-addon-env, only the unified-env. The gsi-addon-env environment uses the unified-env crtm-fix.

@climbfuji
Copy link
Collaborator

@DavidHuber-NOAA We are nearly done here. Someone needs to apply the fix for spack-stack-1.6.0 on the EPIC ParallelWorks systems (@ulmononian FYI), and once Nautilus is coming out of maintenance I will finish that last system on my end.

@climbfuji
Copy link
Collaborator

Nautilus done - just need EPIC to finish ParallelWorks on their end to close this

@DavidHuber-NOAA
Copy link
Collaborator Author

@climbfuji Fantastic, thank you. I have tested this on all GSI-RT-supported platforms (Hera, Jet, Orion, Hercules) and results are as expected.

@souopgui
Copy link

souopgui commented Feb 1, 2024

Any plan for S4?
What can I do to help bring this fix to S4?

@climbfuji
Copy link
Collaborator

Any plan for S4? What can I do to help bring this fix to S4?

It's already there, see list at the top of this issue.

@climbfuji
Copy link
Collaborator

@DavidHuber-NOAA @ulmononian @AlexanderRichert-NOAA I ported the bug fix in the PR for develop back to the release/1.6.0 branch of our spack fork, retagged, then updated the submodule pointer in our spack-stack release branch and retagged. I then decided to close out this issue - @ulmononian or someone else from EPIC is going to make the remaining changes in the EPIC ParallelWorks installations next week and tick the box in the issue description above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants