Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shift in multithreaded benchmarks due to new setting of CPU scaling governor on phi3 #232

Closed
srlantz opened this issue Aug 7, 2019 · 7 comments

Comments

@srlantz
Copy link
Collaborator

srlantz commented Aug 7, 2019

Summary:

On phi3, sometime between July 9 and 15, root ran the command "cpupower frequency-set -g performance", which affects CPU scaling behavior. The previous setting was "powersave". This change caused a shift in the multithreaded benchmark results on phi3; multithreaded timings are now significantly lower (better), and the biggest changes occur at high thread counts.

Action item:

We need to decide whether to accept this as our new baseline, so that the new setting should be preserved going forward; and if so, we need to decide whether we should apply the same change to all test machines.

Details:

As noted in PR #229 on July 16, the customary benchmarks had showed an unexplained improvement in multithreaded performance on phi3. This applies only to phi3; performance on phi2 is unchanged. So far everyone has assumed that this is somehow due to the Intel compiler update that took place on July 15, which upgraded icc from 19.0.0 to 19.0.4 on phi3. However, @tresreid found that when he uses icc 19.0.0 to compile the HEAD of devel (currently the same as in PR #229 ), the timing performance does not return to the previous baseline -

http://areinsvo.web.cern.ch/areinsvo/MkFit/Benchmarks/July2019_AddValOpt/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png (PR #229 = HEAD, with 19.0.4 compiler)
http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/benchmark_head_19-4/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png (HEAD with 19.0.4)
http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/benchmark_head_19-0v2/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png (HEAD with 19.0.0)

In fact, on phi3, it seems it is currently impossible to reproduce any of our older multithreaded benchmarks from on or before July 3 (when phi3 still had icc 19.0.0). Historically, on phi3, the multithreaded benchmark looks more like the following -

http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/208-Optional-HitSorting-Prefetch-Cleanup/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png
http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_06_07_19/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png
http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_7_3_19_saferdel/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png (PR #227 = cb3538a, with 19.0.0 compiler)

At present, when @tresreid re-runs the last of these older benchmarks on phi3, the timings again show the same kind of improvement for high thread counts, and again regardless of the compiler version -

http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/benchmark_cb35381_19-4/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png (cb3538a with 19.0.4)
http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/benchmark_cb35381_19-0/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_TH_time.png (cb3538a with 19.0.0)

Note that @tresreid switched to the older compiler by editing init-env.sh so it sources the compilervars.sh setup script from the 19.0.0 release. As far as we can tell, this truly does switch both icc and tbb to the earlier release. as expected, when building and running the benchmarks.

Today, @dan131riley and I finally sleuthed out what happened on phi3: root issued "cpupower frequency-set -g performance" sometime between July 9 and 15. The timeframe was derived by examining root's command history and comparing it to yum history and to an email thread with @osschar (both of which have dates). The yum history shows that a non-Intel TBB was installed and then uninstalled on July 9; the cpupower change appears after the yum commands. Then, from email, I know that @osschar downloaded Intel Parallel Studio 19 Update 4 on July 15; the cpupower change appears before the corresponding wget.

Side note: yum logs on phi3 show that there was a microcode update that came in with the "yum update" on July 1. This "yum update" was associated with the installation of the NVIDIA V100 and CUDA on that date. Furthermore, from the boot logs, a reboot took place on July 1 as well. ("uptime" seems not to know about the reboot, so the OS's state must have been preserved on disk during the reboot.) In spite of these substantial changes, the usual, slower multithreaded performance was still observed in benchmarks for PR #227 on July 3. So the suspicious-looking microcode update turns out to be a red herring.

To clinch the case that the cpupower command is responsible for the change in behavior on phi3, @tresreid ran the benchmarks twice on lnx7188 (at Cornell): first with the CPU scaling governor set to "powersave" setting, then to "performance". The compiler is icc 19.0.4 and the code version is the current HEAD of devel -

http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/benchmark_head_19-4/Benchmarks/LNX-G_CMSSW_TTbar_PU70_TH_time.png ("powersave")
http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/lnx_performance_19-4/Benchmarks/LNX-G_CMSSW_TTbar_PU70_TH_time.png ("performance")

With "performance", the multithreaded timings drop to levels comparable to what is observed on phi3. Note that lnx7188 contains Skylake Gold processors like those in phi3, with the same numbers of cores, but with better clock speed and scaling properties.

I suspect that this improved performance comes about because at high thread counts, each benchmark trial finishes in < 6 msec. In "powersave" mode, the cores have to ramp up from an idle state to reach full clock speed, and this transient affects the overall timing.

Finally, @dan131riley has determined that the following command will make the CPU scaling governor settings persist across reboots: "tuned-adm profile throughput-performance". This command also tweaks some other settings for optimized throughput.

@slava77
Copy link
Collaborator

slava77 commented Aug 8, 2019

I suspect that this improved performance comes about because at high thread counts, each benchmark trial finishes in < 6 msec. In "powersave" mode, the cores have to ramp up from an idle state to reach full clock speed, and this transient affects the overall timing.

would this also imply that longer tests, at least the ones using all cores should have the same CPU performance with and without powersave?

Can it also be that in the powersave mode we don't even get scheduled to run on the second CPU until we fill up the (hyper)threads on the first one?

@dan131riley
Copy link
Collaborator

would this also imply that longer tests, at least the ones using all cores should have the same CPU performance with and without powersave?

Can it also be that in the powersave mode we don't even get scheduled to run on the second CPU until we fill up the (hyper)threads on the first one?

Both are interesting questions. It seems at least possible that we could be seeing the clock rate reduced during tests when we aren't fully utilizing the cores. A plot of clock rate vs. time in the different modes might give some answers, but I don't know if we have an easy way to collect that.

In general, the "performance" mode compresses the clock speed range on both ends--it raises the minimum, but also lowers the max because the raised minimum rules out turbo mode. We don't generally enable it because it rules out turbo, but I hadn't revisited that since we decided to turn off turbo on our test platforms.

@srlantz
Copy link
Collaborator Author

srlantz commented Aug 9, 2019

@slava77 Yes, I think longer tests with the scaling governor set to "powersave" might well approach the same multithreaded timings that we see with the "performance" setting - unless the CPU clocks are ramping down intermittently during quiet intervals of a run, as Dan hypothesizes. Such intervals might happen during serial sections of the code, for instance.

How would we find out? Here's a quote from John McCalpin in the Intel forums: 'I recommend running "perf stat" on a real program with a running time of more than a few seconds to see what the governor actually does.' (https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/742952)

Note that we do have an open issue (#220) to add an option to use more events per thread, at least for when we run benchmarks for conferences. This may reduce edge effects from the scaling governor, as well as other causes.

@dan131riley The situation with the intel_pstate driver (which we are using on all machines) seems to have changed with Skylake. Basically the driver cedes nearly all control over CPU frequencies to the hardware. Here is a good reference: https://www.kernel.org/doc/html/v4.12/admin-guide/pm/intel_pstate.html. I have verified that phi3, lnx7188, and lnx4108 (our Skylake servers) are running in "Active Mode with HWP". Later, if I have time, I can post details of how to check this.

Apparently "performance" does still allow a lower-frequency idle state (you were right). If you do "grep MHz /proc/cpuinfo" on phi3 and lnx7188 while they're quiet, you will see that most cores are sitting at around 1000 MHz, well below their nominal base frequencies (2.1 and 2.6 GHz respectively), Likewise, lnx4108 typically sits at around 800 MHz on most cores (2.1 GHz is the nominal base). All these machines currently have the "performance" setting.

@tresreid
Copy link
Collaborator

tresreid commented Aug 9, 2019

FYI: I've moved all the benchmarks for this here: http://mireid.web.cern.ch/mireid/Benchmarks/benchmarks_performanceissues_08_08_19/

@srlantz
Copy link
Collaborator Author

srlantz commented Aug 12, 2019

Fixed links in my first post based on the new location of Tres's benchmark plots.

@srlantz
Copy link
Collaborator Author

srlantz commented Aug 14, 2019

(Reminder, this issue is meant to provide the technical background for the solutions being discussed in issue #233.)

Here is how to check that a server's intel_pstate driver is running in "Active Mode with HWP", based on https://www.kernel.org/doc/html/v4.12/admin-guide/pm/intel_pstate.html#operation-modes and https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt:

  1. cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver to confirm that the scaling driver is intel_pstate (assuming here it is the same for all CPUs beyond cpu0)
  2. cat /sys/devices/system/cpu/intel_pstate/status to confirm that intel_pstate is operating in active mode (default; "passive" can only be chosen on the kernel command line at boot time, and the latter can be inspected at /proc/cmdline)
  3. sudo rdmsr 0x770 to confirm that an MSR known as IA32_PM_ENABLE has the value 1; if so, then HWP (Hardware-managed P-states) is enabled, and many settings in sysfs will have no effect whatsoever.

The upshot is that if a processor passes checks (1)-(3) above, then the ONLY setting in sysfs that has any effect on frequency scaling is /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor (for all cores, not just cpu0); and the intel_pstate driver ONLY accepts the values "performance" or "powersave" for the scaling_governor setting on each core.

From what I can gather, HWP = Speed Shift, a feature that was introduced with Skylake. When the processor has this feature, it will be enabled by default. Thus in step (3) above, I find that lnx7188 (Skylake) returns 1, but phi1 (Sandy Bridge) returns a "cannot read" message. Further evidence that HWP only applies to Skylake and later is that grep hwp /proc/cpuinfo returns the "flag" (feature) lines from all the cores on lnx7188, but it returns nothing on phi1.

Reference on the correct hex code for IA32_PM_ENABLE:
https://github.com/tianocore/edk2/blob/master/MdePkg/Include/Register/Intel/ArchitecturalMsr.h

@kmcdermo
Copy link
Collaborator

I am going to close this for now, since the cause of this shift has been determined, and "Performance" is now the default setting on all machines checked by the scripts from PR #238 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants