Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative/large values in topdown perf counters #207

Open
narru19 opened this issue Jul 24, 2024 · 12 comments
Open

Negative/large values in topdown perf counters #207

narru19 opened this issue Jul 24, 2024 · 12 comments

Comments

@narru19
Copy link

narru19 commented Jul 24, 2024

  • System: Intel Sapphire Rapids 8480+
  • OS: Red Hat Enterprise Linux 9.2
  • PAPI Version: 7.1.0

I am trying to read topdown counters that are reported to be available in the system above:

$ papi_native_avail | grep topdown
| perf::topdown-retiring                                                       |
| perf::topdown-bad-spec                                                       |
| perf::topdown-fe-bound                                                       |
| perf::topdown-be-bound                                                       |

Going by the description of these metrics (must be used in a group with the other topdown- events with slots as leader), I created an EventSet with perf::slots as the first event in the set. However, the values I obtain through PAPI_read are negative. I assumed this could be due to an overflow based on the granularity of the workload.

Investigating further, I found that the perf utility seems to give reasonable results when polling these counters:

$ perf stat --topdown -I1000 --td-level=1 --no-metric-only
#           time             counts unit events
     1.001028171          670042878      slots
     1.001028171          102988906      topdown-retiring                 #     15.4% Retiring
     1.001028171           18685968      topdown-bad-spec                 #      2.8% Bad Speculation
     1.001028171           49436879      topdown-fe-bound                 #      7.4% Frontend Bound
     1.001028171          499046991      topdown-be-bound                 #     74.5% Backend Bound
     1.001028171           14145047      topdown-heavy-ops
     1.001028171           15961096      topdown-br-mispredict
     1.001028171           28771199      topdown-fetch-lat
     1.001028171          406127949      topdown-mem-bound

However when attempting to read them through the papi_command_line utility, I get similar results compared to when I poll them manually through my library. For instance, trying to read the 5 events at once reports the following:

$ papi_command_line -x perf::slots perf::topdown-bad-spec perf::topdown-retiring perf::topdown-be-bound perf::topdown-fe-bound

Successfully added: perf::slots
Successfully added: perf::topdown-bad-spec
Successfully added: perf::topdown-retiring
Successfully added: perf::topdown-be-bound
Successfully added: perf::topdown-fe-bound

perf::slots : 	0XFFFF8000394F75D3
perf::topdown-bad-spec : 	0X5CA000035
perf::topdown-retiring : 	0X5CA000035
perf::topdown-be-bound : 	0X5CA000035
perf::topdown-fe-bound : 	0X5CA000035
----------------------------------

What's strange is if I only poll perf::slots, the reported numbers seem to correspond to the lower part of the number reported when polling all of them at the same time:

$ papi_command_line -x perf::slots

Successfully added: perf::slots

perf::slots : 	0X3951DCF8

Which leads me to believe perhaps this counter is reading extra data. Moreover, the 4 categories seem to consistently give the same value using the utility, and it seems rather large. Are these reported values supposed to be percentages if PAPI obtains them directly through the processed metrics given by perf? Or are they metrics that have to be combined with masks to obtain the percentages of each category?

Based on the event descriptions (for instance, for the perf::topdown-retiring event: "topdown useful slots retiring uops"), I expected integer values such as the ones reported by the perf stat utility that I could then correlate to the slots value to obtain the percentages myself.

@narru19
Copy link
Author

narru19 commented Jul 24, 2024

Upon further delving into the issue, it seems there may be an inner issue when copying the values of some perf structures. In src/components/perf_event/perf_helpers.h:

301			/* count is the value of the counter the last time */
302			/* the kernel read it */
303			/* If we don't sign extend it, we get large negative */
304			/* numbers which break if an IOC_RESET is done */
305			width = pc->pmc_width;
306			count = pc->offset;
307			if (user_reset_flag == 1) {
308				count = 0;
309			}

Setting a watchpoint before and after line 306, where the value of pc->offset is copied into count:

306			count = pc->offset;
(gdb) p count
$11 = 140737488314272
(gdb) p pc.offset 
$12 = 927852
(gdb) n    

Thread 1 "affinity.x" hit Hardware watchpoint 7: count

Old value = 140737488314272
New value = 18446603336222136721
(gdb) p count
$13 = 18446603336222136721
(gdb) p count & 0xfffffff
$14 = 940433

It seems as if count is obtaining a seemingly okay value in the lower 32 bits, but the rest is filled with leftovers:

(gdb) p /x count
$15 = 0xffff8000000e5991

@jagode
Copy link
Collaborator

jagode commented Jul 24, 2024

Thanks for reporting and digging into this further. I'm adding Vince Weaver (@deater) to this thread, as he is most familiar with the perf_event core and uncore implementations in PAPI.

@jagode
Copy link
Collaborator

jagode commented Jul 24, 2024

As an alternative, at least as a temporary workaround, you might be able to make some progress using the other topdown events provided through PAPI:

$>./papi_command_line -x TOPDOWN:SLOTS TOPDOWN:SLOTS_P TOPDOWN:MEMORY_BOUND_SLOTS TOPDOWN:BR_MISPREDICT_SLOTS TOPDOWN:BACKEND_BOUND_SLOTS TOPDOWN:BAD_SPEC_SLOTS

This utility lets you add events from the command line interface to see if they work.

Successfully added: TOPDOWN:SLOTS
Successfully added: TOPDOWN:SLOTS_P
Successfully added: TOPDOWN:MEMORY_BOUND_SLOTS
Successfully added: TOPDOWN:BR_MISPREDICT_SLOTS
Successfully added: TOPDOWN:BACKEND_BOUND_SLOTS
Successfully added: TOPDOWN:BAD_SPEC_SLOTS

TOPDOWN:SLOTS : 0X395C8DD0
TOPDOWN:SLOTS_P : 0X395C9BF2
TOPDOWN:MEMORY_BOUND_SLOTS : 0
TOPDOWN:BR_MISPREDICT_SLOTS : 0
TOPDOWN:BACKEND_BOUND_SLOTS : 0
TOPDOWN:BAD_SPEC_SLOTS : 0X9370

@deater
Copy link
Contributor

deater commented Jul 24, 2024

so do you get results you expect when using the Linux "perf" tool?

it sounds like the top-down events are completely unlike any other events.
I have to admit I haven't really looked at them before. There seems to be some documentation here: https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/topdown.txt

From what I gather, what you are reading isn't a count, but rather four or eight 8-bit values shoved into a counter read? PAPI probably won't handle this well, as we assume the value being read is a single incrementing integer.

Maybe the results you are getting are OK, if you split them up into the component bytes yourself?

It might be the rdpmc code interacting poorly with topdown as well. You can try rebuilding PAPI with --disable-perfevent-rdpmc to see if that helps at all

@narru19
Copy link
Author

narru19 commented Jul 25, 2024

Thanks for the quick responses!

@jagode: As an alternative, at least as a temporary workaround, you might be able to make some progress using the other topdown events provided through PAPI:

I had already integrated counters into our library in order to build a TopDown profile with lower level metrics. I am mainly trying to use these newer counters in the architecture to test how accurate our built profiles are compared to these counters. With these, we would be able to discard all the computations on our side and simply read these events as-is.

@deater: so do you get results you expect when using the Linux "perf" tool?

Through perf I do indeed see values that make sense, both the raw values I posted above and their percentages (correlating the raw values of each category with the "slots" counter to obtain percentages).

I have to admit I haven't really looked at them before. There seems to be some documentation here: https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/topdown.txt
From what I gather, what you are reading isn't a count, but rather four or eight 8-bit values shoved into a counter read?

As far as I understand, perf allows polling both the "slots" and "metrics" (from the documentation you posted). As you mentioned I also understand that the metrics are four 8-bit values shoved into a read, and then through shifts one can obtain each of the categories.

PAPI probably won't handle this well, as we assume the value being read is a single incrementing integer.
Maybe the results you are getting are OK, if you split them up into the component bytes yourself?

I have attempted the following:

  • Discarding the seemingly "corrupted" part of the slots counter applying a 32bit mask before assigning pc->offset into count
  • Applying the shifts in the documentation to the four metrics corresponding to each of the categories.
    Nonetheless, I am still obtaining large values.
  • Changing the type of count from int64_t to uint64_t (based on the documentation in perf).

I am not fully aware of which values PAPI exactly reads with the rpdmc call. If PAPI is reading the raw values reported by perf, in theory it should work fine like you mentioned, as they would be integers corresponding to each of the categories. Debugging a few executions I found out that both the pc->offset and the values returned by rpdmc seem to be the integer values PAPI would expect, but this is my own assumption as I am not familiar with the internals of the library. Moreover, somewhere in the code these integers seem to somehow overflow, or read extra data that should not be read.

Otherwise, if PAPI is reading the registers themselves and not the raw values that perf reports, I assume the read values will need bit-handling to turn into valid info.

I will attempt rebuilding PAPI with --disable-perfevent-rdpmc and report back.

@jagode
Copy link
Collaborator

jagode commented Jul 25, 2024

It might be the rdpmc code interacting poorly with topdown as well. You can try rebuilding PAPI with --disable-perfevent-rdpmc to see if that helps at all

Indeed, @deater , disabling rdpmc seems to take care of the issue.

$./papi_command_line perf::slots

Successfully added: perf::slots

perf::slots : 	961989654
$./papi_command_line perf::slots perf::topdown-retiring

Successfully added: perf::slots
Successfully added: perf::topdown-retiring

perf::slots : 	962042148
perf::topdown-retiring : 	199953920

@narru19
Copy link
Author

narru19 commented Jul 26, 2024

(Edit: After having read the mentioned paper further)
Confirmed, configuring with --disable-perfevent-rdpmc seems to get rid of the issue for me as well.
I have been trying to understand the implications of this. As far as I understand, and correct me if I'm wrong, it is related to how fast counters are read using rdpmc from user space rather than kernel reads. Based on this paper, I understand I am to expect worse performance when configuring PAPI without this option, and about 5 times as worse.

If it has been identified before, could you give insight on what the underlying issue is? I don't understand the reason behind this code failing to obtain correct values yet.

@deater
Copy link
Contributor

deater commented Jul 26, 2024

yes, with that disabled there will be higher overhead when doing measurements. How much that affects your analysis is going to vary with your workload.

I haven't had time to fully analyze what's going on here. Using rdpmc is always a bit fragile because PAPI is one of the few tools that uses the interface and so it is not maintained well by the Linux kernel developers.

I'm guessing the problem here might be the sign-extension that is done in the rdpmc code. In theory the linux-kernel provides a "libperf" that is the official code for doing rdpmc reads but we possibly can't include it in PAPI because it's GPL licensed. You can see if you can find that code and see if it is doing things differently.

This top-down stuff is complicated, including the fact you're supposed to don non-rdpmc reads about once a second otherwise the results will start to drift. PAPI doesn't really have a mechanism for doing that kind of thing.

For now you might be better off just disabling rdpmc until we have time to figure out exactly what is going on. Ideally we'd also modify PAPI to automatically disable things if top-down is being used so that other people don't run into the issues you have, but that's tricky and I can't think of a clean way to do it without hardcoding a lot of special cases in the PAPI code.

@narru19
Copy link
Author

narru19 commented Jul 29, 2024

Fair enough, thank you both for the help and detailed answers!

@narru19
Copy link
Author

narru19 commented Jul 29, 2024

Regarding this, should you delve into it any further and would want me to test anything else in our system let me know. I will assess the overhead comparing both modes and decide whether it is worth using or revert back to using lower level events to build the categories manually.

@willowec
Copy link
Contributor

willowec commented Aug 8, 2024

So I've been taking a look at the level one topdown events on a raptorlake machine (TOPDOWN:SLOTS, TOPDOWN:RETIRING_SLOTS, TOPDOWN:BACKEND_BOUND_SLOTS, and TOPDOWN:BAD_SPEC_SLOTS). The raptorlake architecture appears to only support topdown events on the performance cores, so I have been running all of my tests with taskset -c 0 ... to make sure they aren't broken by being scheduled on the wrong core type. At least in my case, I can get these topdown events to work just fine both with rdomc disabled and with rdpmc enabled.

When rdpmc is enabled, I am able to extract sane values using the methods described in perf's topdown documentation. Be aware that they forget to convert their values into percentages in the pseudocode. On the other hand, when PAPI is configured with --disable-perfevent-rdpmc each event returns a fraction of the TOPDOWN:SLOTS counter value. This means that you need to handle topdown events differently when rdpmc is enabled than when it isn't.

Here's a summary of the code I wrote to figure this out:

#include <stdio.h>
#include <stdlib.h>

#include "papi.h"
#include "papi_test.h"
#include "testcode.h"

float rdpmc_get_metric(u_int64_t m, int i)
{
    return (float)(((m) >> (i * 8)) & 0xff) / 0xff * 100.0;
}

void rdpmc_print_metrics(u_int64_t m)
{
    printf("Metrics:\n\tretiring:\t%02f\n\tbadspec:\t%02f\n\tfrontend bound:\t%02f\n\tbackend bound:\t%02f\n",
           rdpmc_get_metric(m, 0), rdpmc_get_metric(m, 1), rdpmc_get_metric(m, 2), rdpmc_get_metric(m, 3));
}

void non_rdpmc_print_percentages(u_int64_t slots, u_int64_t retired_val, u_int64_t backend_val, u_int64_t badspec_val)
{
    double rep, bep, bsp, fep, sum;
    rep = (float)retired_val / slots * 100.0;
    bep = (float)backend_val / slots * 100.0;
    bsp = (float)badspec_val / slots * 100.0;
    sum = rep + bep + bsp;
    fep = 100 - sum;
    printf("Non-rdpmc topdown percentages:\n\tretiring:\t%02f\n\tbad spec:\t%02f\n\tfe bound:\t%02f\n\tbe bound:\t%02f\n",
           rep, bep, bsp, fep);
}

int main()
{
    // set up PAPI...

    // create an event set and add TOPDOWN:SLOTS, TOPDOWN:RETIRING_SLOTS, 
    //    TOPDOWN:BACKEND_BOUND_SLOTS, and TOPDOWN:BAD_SPEC_SLOTS...

    // instrument your code. I just did a for loop calling the instructions_million() 
    //    function of the testcode.h file in src/validation_tests
    
    // now look at the results from both perspectives
    printf("interpret as if rdpmc were disabled:\n");
    non_rdpmc_print_percentages(values[0], values[1], values[2], values[3]);
    printf("interpret as if rdpmc were enabled:\n");
    rdpmc_print_metrics(values[1]);
}

With rdpmc enabled I (usually) get sane results with the perf documentation's method - more on the 'usually' later:

interpret as if rdpmc were disabled:
Non-rdpmc topdown percentages:
	retiring:	1624.960327
	bad spec:	1624.960327
	fe bound:	1624.960327
	be bound:	-4774.880981
interpret as if rdpmc were enabled:
Metrics:
	retiring:	16.862745
	badspec:	0.000000
	frontend bound:	17.254902
	backend bound:	65.882355

And with rdpmc disabled I get sane results with the other method, although they are notably swapped around a bit:

interpret as if rdpmc were disabled:
Non-rdpmc topdown percentages:
	retiring:	16.470589
	bad spec:	65.882355
	fe bound:	0.000000
	be bound:	17.647056
interpret as if rdpmc were enabled:
Metrics:
	retiring:	99.607841
	badspec:	96.470589
	frontend bound:	94.509804
	backend bound:	0.784314

I haven't yet looked deep enough into the PAPI source to understand why it is handled so differently depending on if rdpmc is enabled or not, but in summary at least on my machine I can get good results with both rdpmc enabled and disabled, you just need to apply different methods.

Which method is preferred? The way the events are handled with rdpmc disabled seems more intuitive. I plan on looking into this further and maybe trying to find a fix.

On the rdpmc method only usually working when rdpmc is enabled: 8/100 times I ran the script with rdpmc enabled the results were nonsensical anyway. There is clearly something off with this.

@willowec
Copy link
Contributor

willowec commented Aug 15, 2024

Alright, I've done more digging and have identified some problems, and some answers/potential solutions

Problems:

  1. With RDPMC disabled in PAPI, topdown event metrics do not agree with perf stat and _rdpmc().
    I have not yet explored this problem in depth.
  2. With RDPMC enabled in PAPI, topdown event metrics only agree with perf stat and _rdpmc() for small code blocks.
    I have found a potential solution for this problem and will expand on it below.
  3. With RCPMC enabled, topdown percentages should be extracted with percent = (float)(value >> (i * 8) & 0xff) / 0xff * 100.0;, while with RDPMC disabled they should be extracted with percent = (float)value / (float)slots * 100.0;
    If we want the event values to be interpreted the same way whether RDPMC is enabled or not, this may require some code to explicitly handle the edge case of topdown events.

Potential solution for problem 2:

After a lot of debugging, I tracked down the reason why PAPI's values disagree with _rdpmc() occasionally on small code blocks, and all the time on large code blocks. It appears that in the cases where PAPI fails to get good values, it is because the counter values were not successfully reset between reads.

When PAPI_start() is called, PAPI's perf_event component's _pe_start() function calls _pe_reset(), which sends the PERF_EVENT_IOC_RESET ioctl and sets pe_ctl->reset_flag == 1:

	// perf_event.c line 1064 in _pe_reset()
	for( i = 0; i < pe_ctl->num_events; i++ ) {
		if (_perf_event_vector.cmp_info.fast_counter_read) {
			ret = ioctl( pe_ctl->events[i].event_fd, 
					PERF_EVENT_IOC_RESET, NULL );
			pe_ctl->reset_counts[i] = mmap_read_reset_count(
					pe_ctl->events[i].mmap_buf);
			pe_ctl->reset_flag = 1;

However, right after that in _pe_start(), pe_ctl->reset_flag is set back to 0:

        // perf_event.c line 1464 in _pe_start()
	for( i = 0; i < pe_ctl->num_events; i++ ) {
		if (pe_ctl->events[i].group_leader_fd == -1) {
			SUBDBG("ioctl(enable): fd: %d\n",
				pe_ctl->events[i].event_fd);
			ret=ioctl( pe_ctl->events[i].event_fd,
				PERF_EVENT_IOC_ENABLE, NULL) ;
			if (_perf_event_vector.cmp_info.fast_counter_read) {
				pe_ctl->reset_counts[i] = 0LL;
				pe_ctl->reset_flag = 0;
			}

This means that later in mmap_read_self(), the code value is never reset:

                // perf_event.c line 308 in mmap_read_self()
		count = pc->offset;
		if (user_reset_flag == 1) {
			count = 0;        // this line never executes because user_reset_flag is always 0
		}

I guess for small periods of time between PAPI_start() and PAPI_stop(), there is not enough time after the PERF_EVENT_IOC_RESET ioctl call is made for the counter to start accumulating again, so the additional reset is not needed? Regardless of the exact cause, commenting out the line pe_ctl->reset_flag = 0; in _pe_start() completely fixes the problem, at least for topdown metrics.

I still need to test if removing this line breaks anything, and that may take a while to verify. I also still need to look into why the counter values always disagree with perf stat and _rdpmc() when RDPMC is enabled in PAPI. But, this is some progress.

Edit: Good news! The disagreement between perf stat / _rdpmc() and PAPI was user error on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants