Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails compare on cori-knl, but passes in DEBUG #1477

Closed
ndkeen opened this issue May 1, 2017 · 25 comments
Closed
Assignees
Labels

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 1, 2017

On cori-knl, might have an issue with compiler flags.
To run: create_test PEM_Ld3.ne4_ne4.FC5AV1C-04P2 --machine=cori-knl

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails
PEM_Ld3_PMx1.ne4_ne4.FC5AV1C-04P2 fails (force 1 thread)

PEM_D_Ld3.ne4_ne4.FC5AV1C-04P2 passes (debug)
PEM_D_Ld3_PMx1.ne4_ne4.FC5AV1C-04P2 passes (debug and force 1 thread)

Same thing happens on cori-haswell -- fail compare, but passed in DEBUG.
Tests on Edison have been passing.

@rljacob
Copy link
Member

rljacob commented May 2, 2017

If you can't find a way around this with compiler flags, someone form the atmosphere group will have to look at it.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 5, 2017

OK. I'm trying again after some changes to verify. My test on edison has still not started.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 17, 2017

Hmm, perhaps what I'm trying is not ideal. After creating/running this test, I edit Macros.make to adjust compiler flags, then rebuild/rerun. However, since this test creates two executables and runs two runs (so it can compare) it is probably only rebuilding/rerunning one of them. Is there a suggested way to have a test such as this "try again" with edited Macros.make ?

@singhbalwinder
Copy link
Contributor

@ndkeen: After you change compiler flags, do you clean compile the case?

@ndkeen
Copy link
Contributor Author

ndkeen commented May 18, 2017

Yes, I was doing case.build --clean-all; case.build; case.submit
But this would only rebuild/submit in the case directory
say here: m33-may12/PEM_Ld3.ne4_ne4.FC5AV1C-04P.cori-knl_intel.20170512_102147_u71tm7

-- there is also another entire case dir

m33-may12/PEM_Ld3.ne4_ne4.FC5AV1C-04P.cori-knl_intel.20170512_102147_u71tm7/PEM_Ld3.ne4_ne4.FC5AV1C-04P.cori-knl_intel.20170512_102147_u71tm7.test

I guess I need to also make the same change there and case.build --clean-all; case.build; case.submit ?

@singhbalwinder
Copy link
Contributor

Ok. Yes, it seems like you have to do that as well. I recently got into a lot of issues (unexpected behavior) when I used case.build --clean-all. I think the best thing to do is to completely delete bld directory and then build the case again.

For this particular problem, we just need to find out which flag in debug is making it to PASS the test. It may be optimization level as well.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 18, 2017

Just a quick update. To help diagnose this, I'm changing compiler flags in config_compiler.xml, then doing create_tests so I know for sure I'm testing what I think.

I'm using intel/17.0.2.174 (i1702) which is current default.
-O2 still fails
-O0 passes
-O1 passes
-O2 without -qno-opt-dynamic fails
-O2 and set all special builds for HOMME to -O2 from -O3 fails
-O2 and use i1701 fails
-O2 and use brand new i1703 fails

Also, all of these tests were on KNL.

@singhbalwinder
Copy link
Contributor

So, O2 always fails but the test passes with O0 and O1. It sounds like compiler is optimizing some code differently for each compile or there is a bug somewhere in the code.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 18, 2017

OK, there is a new-to-me setting for the fp-model flag called "consistent". So instead of -fp-model=source, use -fp-model=consistent. From the man page:

Enables consistent, reproducible results for different optimization levels or between different processors of the same architecture.

Sounds like it's what we want, right? I tried this and the test passed (both forcing 1 thread and without).

I can try this setting for all acme_developer as well as a larger problem to see if there are any performance issues? acme_developer tests passed on cori-haswell and cori-knl. I haven't looked closely, and would need to run a larger problem to know better, but don't see any major perf differences.

@singhbalwinder
Copy link
Contributor

Thanks @ndkeen ! Its great that this flag solves this problem. I didn't know about this flag either.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 19, 2017

Note: I edited my previous comment with results of tests. Should we
a) make another issue specific to changing that flag to be -fp-model=consistent, referencing this issue, to see if others have any opinions?
b) go ahead and make the change.
c) continue trying to see if I can first measure performance difference for a hires problem (such as F or G case)
d) continue looking for a different solution?

@singhbalwinder
Copy link
Contributor

I do not anticipate it to make much difference performance wise but it is always a good thing to check to be sure. I think that's my only concern, otherwise it is a very neat solution to this problem. If there is no major performance penalty, I don't see any other reason for not going for this solution.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 24, 2017

OK, I ran a hires F compset -- same type of experiment as I was doing before when looking at KNL performance with F compsets. When using fp-model=consistent instead of fp-model=source, I'm seeing a slowdown of about 13%. Which is much more than I anticipated.

In /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m36n0169p10816t01 I ran two runs for 5 days each. First as normal, the second I changed Macros and did a build clean and rebuild.

jgfouca added a commit that referenced this issue Jun 2, 2017
additional fields needed for environment_variables

additional environment attributes needed
Test suite: hand tested
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes cdash on blues and melvin

User interface changes?:

Code review:
@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 10, 2017

Reminding myself of this issue. I'm not sure what to do. I could:
a) Continue trying to find other flags/settings that allow the test to pass besides the slow -fp-model=consistent.
b) Create another build option (debug, release, fast?) where we set-fp-model=consistent for debug/release. I can think of other build options that will allow code to run faster, but not always bfb.
c) Add -fp-model=consistent to debug/release and just deal with the slower performance.
d) Look for certain files that can be compiled with -fp-model=consistent and use build Dependency.
e) Leave as-is for now

@mt5555
Copy link
Contributor

mt5555 commented Jul 11, 2017

my vote is for a (b)-like option. Use -fp-model=consistent for testing, but not the default for production runs. I agree with your suggestion that BFB reproducibility is a critical feature to maintain, but should not be a requirement for all production runs.

And then, as a low priority, work on (d) so that eventually the production runs can also pass this test.

@worleyph
Copy link
Contributor

@ndk and @mt5555 , just browsing the performance data from

 In /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m36n0169p10816t01 
 I ran two runs for 5 days each. First as normal, the second I changed Macros and did a build
 clean and rebuild.

Lots of timers appear to be slower, but also

  "a:caar_bexchV"     -    28860    -      25.388393     0.084374     0.000255 ...

vs.

  "a:caar_bexchV"     -    28860    -      57.876507     0.078818     0.000279  ...

so halo exchanges, which should not be affected, directly, by '-fp-model=consistent '. Could be due to an increase in load imbalance preceding the halo exchange, due to larger computational cost, but the global statistics are

 "a:caar_bexchV"      ...   125.239 ( 10788      0)    13.170 (  9408      0)

vs.

 "a:caar_bexchV"      ...   164.555 ( 10790      0)    43.892 (  7671      0)

so even the minimum times are much slower. Not sure that I trust that '-fp-model=consistent ' is the only source of the performance difference in these two experiments.

Have a question though ... another timer with a large difference is

 "a:eus_2d_advec"     ...     342.087799     0.055931     0.017093         0.014062

vs.

 "a:eus_2d_advec"     ...     400.628357     0.060689     0.019248         0.013781

However, "a:eus_2d_advec" calls

   call t_startf('eus_bexchV')
   call bndry_exchangeV( hybrid , edgeAdvp1 )
   call t_stopf('eus_bexchV')

but I don't see the "eus_bexchV" timer in the timing data, so don't know how much of the performance difference is in bndry_exchangeV . Any idea why this timer is missing?

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 12, 2017

I'm not familiar with this new flag, so not sure what it might affect.

I ran the experiments again with my favorite June 1st master. This time it's only 4.7% slower.

normal:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m38n0169p10816t02b

 tStamp_write: model date =    10102       0 wall clock = 2017-07-11 21:50:05 avg dt =   364.31 dt =   364.31
 tStamp_write: model date =    10103       0 wall clock = 2017-07-11 21:56:01 avg dt =   359.98 dt =   355.65
 tStamp_write: model date =    10104       0 wall clock = 2017-07-11 22:01:56 avg dt =   358.28 dt =   354.87
 tStamp_write: model date =    10105       0 wall clock = 2017-07-11 22:07:49 avg dt =   357.14 dt =   353.74
 tStamp_write: model date =    10106       0 wall clock = 2017-07-11 22:13:43 avg dt =   356.44 dt =   353.60

and with -fp-model consistent:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m38n0169p10816t02fpconsistent

 tStamp_write: model date =    10102       0 wall clock = 2017-07-11 21:53:53 avg dt =   386.75 dt =   386.75
 tStamp_write: model date =    10103       0 wall clock = 2017-07-11 22:00:09 avg dt =   381.37 dt =   375.98
 tStamp_write: model date =    10104       0 wall clock = 2017-07-11 22:06:25 avg dt =   379.70 dt =   376.38
 tStamp_write: model date =    10105       0 wall clock = 2017-07-11 22:12:41 avg dt =   378.69 dt =   375.66
 tStamp_write: model date =    10106       0 wall clock = 2017-07-11 22:18:56 avg dt =   377.99 dt =   375.18

@singhbalwinder
Copy link
Contributor

I agree that (b) would be a fine solution until we find a better solution. I think (d) will be most time consuming to figure out but is also most desirable. Loosing BFB reproducibility is okay in production runs as long as we are sure that non-BFB nature is roundoff level noise (which I think is the case here) and not systematic. Sometimes non-BFB behavior makes it hard to debug an issue due to obvious issue of loosing reproducibility.

@worleyph
Copy link
Contributor

Note that in the latest test the performance hit is less than 5% (for F case). How much of a performance hit is too much (for option (c))?

@singhbalwinder
Copy link
Contributor

Note that in the latest test the performance hit is less than 5% (for F case). How much of a performance hit is too much (for option (c))?

I don't know the answer to that but I would be fine with a 5% loss for maintaining BFB, which I believe is very critical to maintain and helps a lot if something goes wrong.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 28, 2017

Quick minor update: after edison upgrade, I ran the 4 tests again (force-1-thread/normal and debug/opt) and they all passed.

@singhbalwinder
Copy link
Contributor

That's great! It might just be a compiler bug which is fixed in the newest version.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 10, 2017

Using most recent Intel compiler on cori-knl (18.0.0.128, which is the current NERSC default) the PEM tests pass. This is with no other change.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 23, 2017

Add PEM_Ln9.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel as another example of a test that fails on cori-knl, but passes when I add fp-model consistent.

@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 10, 2017

I think I have a "solution" here. We use -fp-model consistent as mentioned above.
Note that I've learned that while this flag is new in version17, it's really just an abbreviation: -fp-model consistent is equivalent to -fp-model precise -fimf-arch-consistency=true -no-fma.
This allows several tests (I can make a table) to pass that would otherwise fail on KNL (and I don't see any new issues). It does slow things down a bit (and I can quantify) -- however, Intel version 18 has a new flag that gives us some of that lost performance back. The new flag is -fimf-use-svml and explained here:

New option -fimf-use-svml to force the usage of SVML

New option forces use of SVML where currently LIBM is used, for scalar math. This guarantees 
bitwise-same result of computations made with vectorized code vs computations made with
 scalar code. With this feature the compiler vectorizes math functions in -fp-model precise FP
 model and vectorized code produces results consistent with scalar code results. 

https://software.intel.com/en-us/articles/intel-fortran-compiler-180-for-linux-release-notes-for-intel-parallel-studio-xe-2018#imf_use_svml

Also of interest:
https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

I will make a PR and provide more detailed data from my experiments unless there is objection.
ndk/machinefiles/intel-fp-consistent-use-svml

ndkeen added a commit that referenced this issue Nov 11, 2017
… restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
ndkeen added a commit that referenced this issue Jan 2, 2018
…t (PR #1919)

For cori-knl and Intel compiler, adjust the compiler flags to be more restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
jgfouca pushed a commit that referenced this issue Jan 23, 2018
… restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
jgfouca pushed a commit that referenced this issue Feb 27, 2018
… restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
jgfouca pushed a commit that referenced this issue Mar 14, 2018
… restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
rljacob pushed a commit that referenced this issue Apr 21, 2021
… restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
rljacob pushed a commit that referenced this issue May 6, 2021
… restrictive and allow more tests to pass.

For Intel 17, instead of -fp-model source, use -fp-model consistent.
For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml.
Fixes #1477
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants