Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAPI_TOT_INS randomly off on AMD EPYC 7352 #160

Open
Flamefire opened this issue Feb 14, 2024 · 9 comments
Open

PAPI_TOT_INS randomly off on AMD EPYC 7352 #160

Flamefire opened this issue Feb 14, 2024 · 9 comments

Comments

@Flamefire
Copy link

We test the installation with the usual make test which runs ctests/zero. But that fails seemingly randomly but very often

Passing output is:

PAPI Error: Couldn't open hw_instructions in exclude_guest=0 test
Test case 0: start, stop.
-----------------------------------------------
Default domain is: 1 (PAPI_DOM_USER)
Default granularity is: 1 (PAPI_GRN_THR)
Using 200 iterations 1 million instructions
-------------------------------------------------------------------------
Test type    : 	           1
PAPI_TOT_CYC : 	    140513836
PAPI_TOT_INS : 	    200001620
IPC          : 	         1.42
Real usec    : 	       359225
Real cycles  : 	    826204390
Virt usec    : 	       352658
Virt cycles  : 	    811113400
-------------------------------------------------------------------------
Verification: PAPI_TOT_INS should be roughly 200000000
PASSED

Failing output:

PAPI Error: Couldn't open hw_instructions in exclude_guest=0 test
Test case 0: start, stop.
-----------------------------------------------
Default domain is: 1 (PAPI_DOM_USER)
Default granularity is: 1 (PAPI_GRN_THR)
Using 200 iterations 1 million instructions
-------------------------------------------------------------------------
Test type    : 	           1
PAPI_TOT_CYC : 	    140556202
PAPI_TOT_INS : 	   3200025872
IPC          : 	        22.77
Real usec    : 	       359115
Real cycles  : 	    825951988
Virt usec    : 	       352894
Virt cycles  : 	    811653900
-------------------------------------------------------------------------
Verification: PAPI_TOT_INS should be roughly 200000000
PAPI_TOT_INS Error of 1500.01%
FAILED!!!
Line # 161 Error: Instruction validation

This is with PAPI 7.1.0 on a "AMD EPYC 7352 24-Core Processor" system (4 CPUs)

It looks like it randomly picks up an additional "3000000000" instructions which looks rather like an error as the remainder makes sense.

Any ideas?

@dbarry9
Copy link
Contributor

dbarry9 commented Feb 14, 2024

Hi Alexander,

I am looking into this issue and will keep you updated.

Thank you,
Daniel

@dbarry9
Copy link
Contributor

dbarry9 commented Feb 14, 2024

I am unable to reproduce this issue on our AMD Zen2 testbed. I also do not get the error message:

PAPI Error: Couldn't open hw_instructions in exclude_guest=0 test

Could you please provide the output from the 'papi_component_avail' utility in addition to the value in the file '/proc/sys/kernel/perf_event_paranoid'?

Thank you,
Daniel

@Flamefire
Copy link
Author

Here the requested output:

Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.1.0.0
Operating system         : Linux 4.18.0-425.19.2.el8_7.x86_64
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD EPYC 7352 24-Core Processor (49, 0x31)
CPU revision             : 0.000000
CPUID                    : Family/Model/Stepping 23/49/0, 0x17/0x31/0x00
CPU Max MHz              : 2300
CPU Min MHz              : 1500
Total cores              : 96
SMT threads per core     : 2
Cores per socket         : 24
Sockets                  : 2
Cores per NUMA region    : 24
NUMA regions             : 4
Running in a VM          : no
Number Hardware Counters : 5
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: Insufficient permissions for uncore access.  Set /proc/sys/kernel/perf_event_paranoid to 0 or run as root.
Name:   rapl                    Linux RAPL energy measurements
   \-> Disabled: Can't open fd for cpu0: Permission denied
Name:   sysdetect               System info detection component

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 141, Preset: 17, Counters: 5
                                PMUs supported: perf, perf_raw, amd64_fam17h_zen2

Name:   sysdetect               System info detection component
                                Native: 0, Preset: 0, Counters: 0


--------------------------------------------------------------------------------
$ cat /proc/sys/kernel/perf_event_paranoid
2

Note that the failure doesn't always happen. After the just failed make test (failing 2 times in a row) I ran ctests/zero manually and it suceeded 17 times until it failed, but then it fails many times in a row until it succeeds again. It doesn't seem to matter if I wait some time or not. E.g. I just tried for i in {1..20}; do ctests/zero 2>&1 | grep 'PAPI_TOT_INS :'; done and got:

PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025984
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200026016
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200026032
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025920
PAPI_TOT_INS : 	   3200025888
PAPI_TOT_INS : 	   3200025952
PAPI_TOT_INS : 	   3200025888
PAPI_TOT_INS : 	   3200025888
PAPI_TOT_INS : 	   3200025920
PAPI_TOT_INS : 	   3200025872
PAPI_TOT_INS : 	   3200025904
PAPI_TOT_INS : 	    200001618

Repeating the same loop twice I got all success (200001617 - 200001627) and then all fails (3200025856 - 3200026016)

@dbarry9
Copy link
Contributor

dbarry9 commented Feb 16, 2024

Hi Alexander,

Could you please change the value in perf_event_paranoid to 0, re-run the ctest, and post the results?

@dbarry9
Copy link
Contributor

dbarry9 commented Feb 26, 2024

Hi Alexander,

Are there any updates to this issue? Has it been resolved?

@Flamefire
Copy link
Author

The issue I observe happens on a HPC system. So I don't have sufficient privileges to change that flag. I asked the admins and am waiting for a reply.

@satishskamath
Copy link

For zen2 and zen4 I can confirm that the test passes even with perf_event_paranoid=2

  • zen4:
$ for i in {1..20}; do ctests/zero 2>&1 | grep 'PAPI_TOT_INS :'; done
PAPI_TOT_INS :      200001300
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001300
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001300
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001300
PAPI_TOT_INS :      200001300
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301
PAPI_TOT_INS :      200001301

  • zen2:
$ for i in {1..20}; do ctests/zero 2>&1 | grep 'PAPI_TOT_INS :'; done
PAPI_TOT_INS :      200001305
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001305
PAPI_TOT_INS :      200001305
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001305
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001305
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001306
PAPI_TOT_INS :      200001305

@Flamefire
Copy link
Author

perf_event_paranoid doesn't seem to have much influence

Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.1.0.0
Operating system         : Linux 4.18.0-425.19.2.el8_7.x86_64
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD EPYC 7702 64-Core Processor (49, 0x31)
CPU revision             : 0.000000
CPUID                    : Family/Model/Stepping 23/49/0, 0x17/0x31/0x00
CPU Max MHz              : 2183
CPU Min MHz              : 1500
Total cores              : 256
SMT threads per core     : 2
Cores per socket         : 64
Sockets                  : 2
Cores per NUMA region    : 32
NUMA regions             : 8
Running in a VM          : no
Number Hardware Counters : 5
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: Insufficient permissions for uncore access.  Set /proc/sys/kernel/perf_event_paranoid to 0 or run as root.
Name:   rapl                    Linux RAPL energy measurements
   \-> Disabled: Can't open fd for cpu0: Permission denied
Name:   sysdetect               System info detection component

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 141, Preset: 17, Counters: 5
                                PMUs supported: perf, perf_raw, amd64_fam17h_zen2

Name:   sysdetect               System info detection component
                                Native: 0, Preset: 0, Counters: 0

Test with ctOK=0; ctERR=0; for i in {1..1000}; do if ctests/zero &>/dev/null; then ((ctOK++)); else ((ctERR++)); fi; done; echo "OK: $ctOK; FAIL: $ctERR; perf_event_paranoid: $(cat /proc/sys/kernel/perf_event_paranoid)"

  • OK: 594; FAIL: 406; perf_event_paranoid: 0
  • OK: 468; FAIL: 532; perf_event_paranoid: 1
  • OK: 581; FAIL: 419; perf_event_paranoid: 2

Even as user with root it doesn't look much different

@Flamefire
Copy link
Author

Flamefire commented Nov 29, 2024

Any updates here?
I did a quick calculation on my previous observation:

Repeating the same loop twice I got all success (200001617 - 200001627) and then all fails (3200025856 - 3200026016)

The failing ones are off by a factor of exactly 15,99999992 . A constant of ~16 makes me think there is an issue with the measuring logic and/or scheduling does something different sometimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants