-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E3SM Land Developer Tests Failing in DEBUG Mode #4820
Comments
If I try Note I'm setting g=0 (which should not affect anything here) just as a no-op so that can see exactly what compiler is trapping on.
|
Thanks for following up @ndkeen . Something to try is to remove I hope that makes sense -- if not I can try looking at this SMS test specifically rather than the ERS test I used. |
I think in general, we benefit from having floating-point traps. If I understand, it sounds like the fp trapping did catch this error? I would just like to see example of a floating-point trap that you do not want to stop on. So can I run with updated code to correct said typo? Or a different test perhaps? And yes, I do think it's a very bad idea to set arrays to nan on purpose (which I know land does), but that may or may not be the issue here. |
So what I mean by it hindering debugging is that it raises this error but gives a pretty useless backtrace. Removing the fpe flag will give you the exact line number of the bug. If you try what I suggest, you may see what I mean for your SMS test case. |
@ndkeen Sorry I just realized I missed your direct question, I can make a branch this afternoon with the typo fixed to aid you in either confirming or refuting my tests |
Can you paste an example here of "NaN evaluation in logical expression" in ELM ? |
@rljacob I worded that incorrectly: it should be "evaluation of a logical expression involving NaNs" |
Could ELM just set spval to a large non-physical number instead of NaN? |
@rljacob It would require some additional changes to filter out by spval or |
@ndkeen The branch peterdschwartz/lnd/cnpbudget-typo-fix fixes the typo that you can use. I guess I didn't explicitly mention it in the original post, but every test I listed there is one that FAILs in debug mode with fpe0 turned on. Also, the traceback pointing to the index |
With your branch, I still see the same error. I also tried on cori-knl with Intel I just wonder if there really is an error somewhere. |
What is ELM trying to do by initializing values this way? Yes there should be an E3SM programming standard for this and it would start with never purposefully initializing variables to NaN. |
I don't know. They also overload the assignment operation for entire modules for data types with some special NaN assignment elemental function. |
@ndkeen This is promising that it may just be a few places to fix. A few major land PRs went in around then, but so did the initial implementation of carbon budgets so I will focus in on that first. |
I reminded myself that I created an issue quite a while ago that does mention these fails. #3123 I see that for GNU builds, we have removed the trap on "invalid". If I put that back, then While it may be bad idea to init arrays to nan, it doesn't necessarily mean fp-traps will balk -- I could write a simple test, but can you have array set to nan, but then not evaluated/used until after it is reset to a value? I'm just suggesting that it might be good to track/fix this issue -- or at least verify the fp-trap is triggered simply because a land array is set to nan and there's a good reason why it wasn't reset to value. |
Ha, hopefully we can get some closure on this 2.5 year issue.
I struggling to imagine exactly what you are saying code wise. IEEE standard does allow quiet NaNs and signaling NaNs and each compiler has different levels of fpe detecting. What exactly One of the crop smallville tests raises the exception at this line L373 in FireMod.F90
Using the special intrinsic (commented out here), won't raise an exception and why ELM overloads the assignment operation to initialize NaNs too. This is one of the main reasons I think it's NaN triggering the exceptions. But, as you have mentioned, each test may not be failing for the same reason and so it's definitely worth further investigation. |
This initialize-to-NaN issue came up in atmospheric chemistry too (i think). There was some logic to first initialize to nan, then later the variable is initialized to a correct value and then still later there is a check to make sure it was initialized correctly by checking for NaN. Its a built-in debugging system for adding new chemistry reactions. That's reasonable but here's what should be done in the program: Initialize to a variable that can be set at compile time to NaN but by default is just a large number. That way the programmer can activate it when checking a new chemistry reaction but the default is to NOT initialize to NaN so debugging can work. |
@rljacob does E3SM have any examples of this strategy employed yet? if so we can try to reproduce this in FATES |
I don't think so. |
With current master, I ran many of the land developer tests in debug mode ( case directory is here: /lcrc/group/e3sm/ac.schwartzpd/master-debug-tests/cases) and it seems any that use the chemistry of ELM all fail. I believe this has been an issue for a while so I can't say how many code changes this will require and not each test failed in the same place. Almost all of them are due to the triggering of a floating point exception edit: due to evaluation of logical expressions involving NaNs.
A potential fix is to add
shr_infnan_isnan
consistently as that will not trigger an fpe error. An potential downside ofshr_infnan_isnan
is I am not sure how to readily port them to use in GPU regions -- a GPU friendly check that I implemented is to use if statements likeif(.not. (x .ne. x) )
which trigger a fpe (the FireMod backtraces below). I can make a elm wrapper function that switches between the two at compilation but will have to check how adding an extra function call impacts performance.Currently, many new development runs cannot be run in DEBUG mode without removal of
-fpe0
flag, and since I do not believe it is a good policy that an atm developer should have to contact an elm developer just to enable debug mode, then the-fpe0
should be temporarily removed for elm files. And tests will need to be added to catch these issues sooner as the current ELM debug tests are clearly inadequate.@rljacob @bishtgautam
The text was updated successfully, but these errors were encountered: