PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails compare on cori-knl, but passes in DEBUG #1477

ndkeen · 2017-05-01T19:48:08Z

On cori-knl, might have an issue with compiler flags.
To run: create_test PEM_Ld3.ne4_ne4.FC5AV1C-04P2 --machine=cori-knl

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails
PEM_Ld3_PMx1.ne4_ne4.FC5AV1C-04P2 fails (force 1 thread)

PEM_D_Ld3.ne4_ne4.FC5AV1C-04P2 passes (debug)
PEM_D_Ld3_PMx1.ne4_ne4.FC5AV1C-04P2 passes (debug and force 1 thread)

Same thing happens on cori-haswell -- fail compare, but passed in DEBUG.
Tests on Edison have been passing.

rljacob · 2017-05-02T02:29:15Z

If you can't find a way around this with compiler flags, someone form the atmosphere group will have to look at it.

ndkeen · 2017-05-05T02:08:06Z

OK. I'm trying again after some changes to verify. My test on edison has still not started.

ndkeen · 2017-05-17T23:51:59Z

Hmm, perhaps what I'm trying is not ideal. After creating/running this test, I edit Macros.make to adjust compiler flags, then rebuild/rerun. However, since this test creates two executables and runs two runs (so it can compare) it is probably only rebuilding/rerunning one of them. Is there a suggested way to have a test such as this "try again" with edited Macros.make ?

singhbalwinder · 2017-05-17T23:56:12Z

@ndkeen: After you change compiler flags, do you clean compile the case?

ndkeen · 2017-05-18T00:00:10Z

Yes, I was doing case.build --clean-all; case.build; case.submit
But this would only rebuild/submit in the case directory
say here: m33-may12/PEM_Ld3.ne4_ne4.FC5AV1C-04P.cori-knl_intel.20170512_102147_u71tm7

-- there is also another entire case dir

m33-may12/PEM_Ld3.ne4_ne4.FC5AV1C-04P.cori-knl_intel.20170512_102147_u71tm7/PEM_Ld3.ne4_ne4.FC5AV1C-04P.cori-knl_intel.20170512_102147_u71tm7.test

I guess I need to also make the same change there and case.build --clean-all; case.build; case.submit ?

singhbalwinder · 2017-05-18T00:05:56Z

Ok. Yes, it seems like you have to do that as well. I recently got into a lot of issues (unexpected behavior) when I used case.build --clean-all. I think the best thing to do is to completely delete bld directory and then build the case again.

For this particular problem, we just need to find out which flag in debug is making it to PASS the test. It may be optimization level as well.

ndkeen · 2017-05-18T20:20:00Z

Just a quick update. To help diagnose this, I'm changing compiler flags in config_compiler.xml, then doing create_tests so I know for sure I'm testing what I think.

I'm using intel/17.0.2.174 (i1702) which is current default.
-O2 still fails
-O0 passes
-O1 passes
-O2 without -qno-opt-dynamic fails
-O2 and set all special builds for HOMME to -O2 from -O3 fails
-O2 and use i1701 fails
-O2 and use brand new i1703 fails

Also, all of these tests were on KNL.

singhbalwinder · 2017-05-18T21:14:34Z

So, O2 always fails but the test passes with O0 and O1. It sounds like compiler is optimizing some code differently for each compile or there is a bug somewhere in the code.

ndkeen · 2017-05-18T22:28:06Z

OK, there is a new-to-me setting for the fp-model flag called "consistent". So instead of -fp-model=source, use -fp-model=consistent. From the man page:

Enables consistent, reproducible results for different optimization levels or between different processors of the same architecture.

Sounds like it's what we want, right? I tried this and the test passed (both forcing 1 thread and without).

I can try this setting for all acme_developer as well as a larger problem to see if there are any performance issues? acme_developer tests passed on cori-haswell and cori-knl. I haven't looked closely, and would need to run a larger problem to know better, but don't see any major perf differences.

singhbalwinder · 2017-05-18T22:33:33Z

Thanks @ndkeen ! Its great that this flag solves this problem. I didn't know about this flag either.

ndkeen · 2017-05-19T18:00:40Z

Note: I edited my previous comment with results of tests. Should we
a) make another issue specific to changing that flag to be -fp-model=consistent, referencing this issue, to see if others have any opinions?
b) go ahead and make the change.
c) continue trying to see if I can first measure performance difference for a hires problem (such as F or G case)
d) continue looking for a different solution?

singhbalwinder · 2017-05-19T19:09:16Z

I do not anticipate it to make much difference performance wise but it is always a good thing to check to be sure. I think that's my only concern, otherwise it is a very neat solution to this problem. If there is no major performance penalty, I don't see any other reason for not going for this solution.

ndkeen · 2017-05-24T16:41:06Z

OK, I ran a hires F compset -- same type of experiment as I was doing before when looking at KNL performance with F compsets. When using fp-model=consistent instead of fp-model=source, I'm seeing a slowdown of about 13%. Which is much more than I anticipated.

In /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m36n0169p10816t01 I ran two runs for 5 days each. First as normal, the second I changed Macros and did a build clean and rebuild.

additional fields needed for environment_variables additional environment attributes needed Test suite: hand tested Test baseline: Test namelist changes: Test status: bit for bit Fixes cdash on blues and melvin User interface changes?: Code review:

ndkeen · 2017-07-10T20:49:32Z

Reminding myself of this issue. I'm not sure what to do. I could:
a) Continue trying to find other flags/settings that allow the test to pass besides the slow -fp-model=consistent.
b) Create another build option (debug, release, fast?) where we set-fp-model=consistent for debug/release. I can think of other build options that will allow code to run faster, but not always bfb.
c) Add -fp-model=consistent to debug/release and just deal with the slower performance.
d) Look for certain files that can be compiled with -fp-model=consistent and use build Dependency.
e) Leave as-is for now

mt5555 · 2017-07-11T23:37:26Z

my vote is for a (b)-like option. Use -fp-model=consistent for testing, but not the default for production runs. I agree with your suggestion that BFB reproducibility is a critical feature to maintain, but should not be a requirement for all production runs.

And then, as a low priority, work on (d) so that eventually the production runs can also pass this test.

worleyph · 2017-07-12T02:56:05Z

@ndk and @mt5555 , just browsing the performance data from

 In /global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m36n0169p10816t01 
 I ran two runs for 5 days each. First as normal, the second I changed Macros and did a build
 clean and rebuild.

Lots of timers appear to be slower, but also

  "a:caar_bexchV"     -    28860    -      25.388393     0.084374     0.000255 ...

vs.

  "a:caar_bexchV"     -    28860    -      57.876507     0.078818     0.000279  ...

so halo exchanges, which should not be affected, directly, by '-fp-model=consistent '. Could be due to an increase in load imbalance preceding the halo exchange, due to larger computational cost, but the global statistics are

 "a:caar_bexchV"      ...   125.239 ( 10788      0)    13.170 (  9408      0)

vs.

 "a:caar_bexchV"      ...   164.555 ( 10790      0)    43.892 (  7671      0)

so even the minimum times are much slower. Not sure that I trust that '-fp-model=consistent ' is the only source of the performance difference in these two experiments.

Have a question though ... another timer with a large difference is

 "a:eus_2d_advec"     ...     342.087799     0.055931     0.017093         0.014062

vs.

 "a:eus_2d_advec"     ...     400.628357     0.060689     0.019248         0.013781

However, "a:eus_2d_advec" calls

   call t_startf('eus_bexchV')
   call bndry_exchangeV( hybrid , edgeAdvp1 )
   call t_stopf('eus_bexchV')

but I don't see the "eus_bexchV" timer in the timing data, so don't know how much of the performance difference is in bndry_exchangeV . Any idea why this timer is missing?

ndkeen · 2017-07-12T05:23:24Z

I'm not familiar with this new flag, so not sure what it might affect.

I ran the experiments again with my favorite June 1st master. This time it's only 4.7% slower.

normal:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m38n0169p10816t02b

 tStamp_write: model date =    10102       0 wall clock = 2017-07-11 21:50:05 avg dt =   364.31 dt =   364.31
 tStamp_write: model date =    10103       0 wall clock = 2017-07-11 21:56:01 avg dt =   359.98 dt =   355.65
 tStamp_write: model date =    10104       0 wall clock = 2017-07-11 22:01:56 avg dt =   358.28 dt =   354.87
 tStamp_write: model date =    10105       0 wall clock = 2017-07-11 22:07:49 avg dt =   357.14 dt =   353.74
 tStamp_write: model date =    10106       0 wall clock = 2017-07-11 22:13:43 avg dt =   356.44 dt =   353.60

and with -fp-model consistent:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/SMS.ne120_ne120.m38n0169p10816t02fpconsistent

 tStamp_write: model date =    10102       0 wall clock = 2017-07-11 21:53:53 avg dt =   386.75 dt =   386.75
 tStamp_write: model date =    10103       0 wall clock = 2017-07-11 22:00:09 avg dt =   381.37 dt =   375.98
 tStamp_write: model date =    10104       0 wall clock = 2017-07-11 22:06:25 avg dt =   379.70 dt =   376.38
 tStamp_write: model date =    10105       0 wall clock = 2017-07-11 22:12:41 avg dt =   378.69 dt =   375.66
 tStamp_write: model date =    10106       0 wall clock = 2017-07-11 22:18:56 avg dt =   377.99 dt =   375.18

singhbalwinder · 2017-07-12T15:49:06Z

I agree that (b) would be a fine solution until we find a better solution. I think (d) will be most time consuming to figure out but is also most desirable. Loosing BFB reproducibility is okay in production runs as long as we are sure that non-BFB nature is roundoff level noise (which I think is the case here) and not systematic. Sometimes non-BFB behavior makes it hard to debug an issue due to obvious issue of loosing reproducibility.

worleyph · 2017-07-12T18:49:42Z

Note that in the latest test the performance hit is less than 5% (for F case). How much of a performance hit is too much (for option (c))?

singhbalwinder · 2017-07-12T19:28:03Z

Note that in the latest test the performance hit is less than 5% (for F case). How much of a performance hit is too much (for option (c))?

I don't know the answer to that but I would be fine with a 5% loss for maintaining BFB, which I believe is very critical to maintain and helps a lot if something goes wrong.

ndkeen · 2017-07-28T03:59:27Z

Quick minor update: after edison upgrade, I ran the 4 tests again (force-1-thread/normal and debug/opt) and they all passed.

singhbalwinder · 2017-07-28T15:28:39Z

That's great! It might just be a compiler bug which is fixed in the newest version.

ndkeen · 2017-10-10T21:26:47Z

Using most recent Intel compiler on cori-knl (18.0.0.128, which is the current NERSC default) the PEM tests pass. This is with no other change.

ndkeen · 2017-10-23T19:20:25Z

Add PEM_Ln9.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel as another example of a test that fails on cori-knl, but passes when I add fp-model consistent.

ndkeen · 2017-11-10T19:00:01Z

I think I have a "solution" here. We use -fp-model consistent as mentioned above.
Note that I've learned that while this flag is new in version17, it's really just an abbreviation: -fp-model consistent is equivalent to -fp-model precise -fimf-arch-consistency=true -no-fma.
This allows several tests (I can make a table) to pass that would otherwise fail on KNL (and I don't see any new issues). It does slow things down a bit (and I can quantify) -- however, Intel version 18 has a new flag that gives us some of that lost performance back. The new flag is -fimf-use-svml and explained here:

New option -fimf-use-svml to force the usage of SVML

New option forces use of SVML where currently LIBM is used, for scalar math. This guarantees 
bitwise-same result of computations made with vectorized code vs computations made with
 scalar code. With this feature the compiler vectorizes math functions in -fp-model precise FP
 model and vectorized code produces results consistent with scalar code results.

https://software.intel.com/en-us/articles/intel-fortran-compiler-180-for-linux-release-notes-for-intel-parallel-studio-xe-2018#imf_use_svml

Also of interest:
https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

I will make a PR and provide more detailed data from my experiments unless there is objection.
ndk/machinefiles/intel-fp-consistent-use-svml

… restrictive and allow more tests to pass. For Intel 17, instead of -fp-model source, use -fp-model consistent. For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml. Fixes #1477

…t (PR #1919) For cori-knl and Intel compiler, adjust the compiler flags to be more restrictive and allow more tests to pass. For Intel 17, instead of -fp-model source, use -fp-model consistent. For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml. Fixes #1477

… restrictive and allow more tests to pass. For Intel 17, instead of -fp-model source, use -fp-model consistent. For Intel 18, instead of -fp-model source, use -fp-model consistent -fimf-use-svml. Fixes #1477

ndkeen added the Cori label May 1, 2017

rljacob assigned ndkeen May 2, 2017

amametjanov mentioned this issue Oct 10, 2017

Add -fp-model fast to 7 HOMME fortran builds to improve performance (Intel compiler) #1761

Merged

ndkeen mentioned this issue Oct 19, 2017

Master fails PET_Ln9.ne30_oECv3_ICG.A_WCYCL1850S.cori-knl_intel #1850

Closed

ndkeen mentioned this issue Nov 11, 2017

HOMME test fails when using Intel v17 on edison #1680

Closed

ndkeen mentioned this issue Nov 21, 2017

Adjust fp-model flags for Intel compiler on cori-knl (use fp-model consistent) #1919

Merged

ndkeen closed this as completed in #1919 Jan 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails compare on cori-knl, but passes in DEBUG #1477

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails compare on cori-knl, but passes in DEBUG #1477

ndkeen commented May 1, 2017 •

edited

Loading

rljacob commented May 2, 2017

ndkeen commented May 5, 2017 •

edited

Loading

ndkeen commented May 17, 2017

singhbalwinder commented May 17, 2017

ndkeen commented May 18, 2017 •

edited

Loading

singhbalwinder commented May 18, 2017

ndkeen commented May 18, 2017 •

edited

Loading

singhbalwinder commented May 18, 2017

ndkeen commented May 18, 2017 •

edited

Loading

singhbalwinder commented May 18, 2017

ndkeen commented May 19, 2017

singhbalwinder commented May 19, 2017

ndkeen commented May 24, 2017 •

edited

Loading

ndkeen commented Jul 10, 2017

mt5555 commented Jul 11, 2017

worleyph commented Jul 12, 2017

ndkeen commented Jul 12, 2017

singhbalwinder commented Jul 12, 2017

worleyph commented Jul 12, 2017

singhbalwinder commented Jul 12, 2017

ndkeen commented Jul 28, 2017

singhbalwinder commented Jul 28, 2017

ndkeen commented Oct 10, 2017

ndkeen commented Oct 23, 2017

ndkeen commented Nov 10, 2017 •

edited

Loading

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails compare on cori-knl, but passes in DEBUG #1477

PEM_Ld3.ne4_ne4.FC5AV1C-04P2 fails compare on cori-knl, but passes in DEBUG #1477

Comments

ndkeen commented May 1, 2017 • edited Loading

rljacob commented May 2, 2017

ndkeen commented May 5, 2017 • edited Loading

ndkeen commented May 17, 2017

singhbalwinder commented May 17, 2017

ndkeen commented May 18, 2017 • edited Loading

singhbalwinder commented May 18, 2017

ndkeen commented May 18, 2017 • edited Loading

singhbalwinder commented May 18, 2017

ndkeen commented May 18, 2017 • edited Loading

singhbalwinder commented May 18, 2017

ndkeen commented May 19, 2017

singhbalwinder commented May 19, 2017

ndkeen commented May 24, 2017 • edited Loading

ndkeen commented Jul 10, 2017

mt5555 commented Jul 11, 2017

worleyph commented Jul 12, 2017

ndkeen commented Jul 12, 2017

singhbalwinder commented Jul 12, 2017

worleyph commented Jul 12, 2017

singhbalwinder commented Jul 12, 2017

ndkeen commented Jul 28, 2017

singhbalwinder commented Jul 28, 2017

ndkeen commented Oct 10, 2017

ndkeen commented Oct 23, 2017

ndkeen commented Nov 10, 2017 • edited Loading

ndkeen commented May 1, 2017 •

edited

Loading

ndkeen commented May 5, 2017 •

edited

Loading

ndkeen commented May 18, 2017 •

edited

Loading

ndkeen commented May 18, 2017 •

edited

Loading

ndkeen commented May 18, 2017 •

edited

Loading

ndkeen commented May 24, 2017 •

edited

Loading

ndkeen commented Nov 10, 2017 •

edited

Loading