Skip to content

Conversation

Raimo33
Copy link

@Raimo33 Raimo33 commented Sep 2, 2025

Goal

This PR refactors the benchmarking functions as per #1701, in order to make benchmarks more deterministic and less influenced by the environvment.

This is achieved by replacing Wall-Clock Timer with Per-Process CPU Timer when possible.

@Raimo33 Raimo33 marked this pull request as draft September 2, 2025 14:58
@Raimo33 Raimo33 mentioned this pull request Sep 2, 2025
2 tasks
@real-or-random
Copy link
Contributor

real-or-random commented Sep 2, 2025

Just some quick comments:

[x] remove the number of runs (count) in favor of simpler cleaner approach with just number of iterations (iter).

I think there's a reason to have this. Some benchmarks take much longer than others, so it probably makes sense to run fewer iters for these.

[x] remove min and max statistics in favor of simpler approach with just avg.

I think min and max are useful. For constant-time code, you can also compare min. And max gives you an idea if there were fluctuations or not.

[x] remove needless fixed point conversion in favor of simpler floating point divisions.

Well, okay, that has a history; see #689. It's debatable if it makes sense to avoid floating point math, but as long as it doesn't get in your way here, it's a cool thing to keep it. :D

@real-or-random
Copy link
Contributor

It will be useful to split your changes into meaningful and separate commits, see https://github.com/bitcoin/bitcoin/blob/master/CONTRIBUTING.md#committing-patches.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

I think min and max just complicate things. let me explain:
first of all, as it is right now, they don't even measure the min and max, they just measure the min/max of the averages of all runs. aka not the absolute. Furthermore, in order to have them, one would need to run all the iterations 10 times more. benchmarks are already slow, adding this min/max slows them by 10 fold. imho it's completely unnecessary. @real-or-random

@sipa
Copy link
Contributor

sipa commented Sep 2, 2025

If we're going to rework this, I'd suggest using the stabilized quartiles approach from https://cr.yp.to/papers/rsrst-20250727.pdf:

  • StQ1: the average of all samples between 1st and 3rd octile
  • StQ2: the average of all samples between 3rd and 5th octile
  • StQ3: the average of all samples between 5th and 7th octile

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

I think there's a reason to have this. Some benchmarks take much longer than others, so it probably makes sense to run fewer iters for these.

right now all benchmarks are run with count=10 and fixed iters (apart from ecmult_multi which adjusts the number of iters, not count).

therefore count is only useful to extrapolate min and max

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

Well, okay, that has a history; see #689. It's debatable if it makes sense to avoid floating point math, but as long as it doesn't get in your way here, it's a cool thing to keep it. :D

I disagree with #689. It overcomplicate things for the sake of not having floating point math. those divisions aren't even in the hot path, they're outside the benchmarks.

@sipa
Copy link
Contributor

sipa commented Sep 2, 2025

Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.

what is the usefulness of measuring min/max when we are removing OS interference & thermal throttling out of the equation? min/max will be extremely close to the avg no matter how bad the benchmarked function is.

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 3 times, most recently from 97e5264 to 254a014 Compare September 2, 2025 16:08
@Raimo33 Raimo33 changed the title [WIP] Refactor benchmark [WIP] refactor: remove system interference from benchmarks Sep 2, 2025
@Raimo33 Raimo33 force-pushed the benchmark-precise branch 2 times, most recently from 1d9d6d0 to 4c9a074 Compare September 2, 2025 16:26
@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

by the way, gettimeofday() is officially discouraged since 2008 in favor of clock_gettime(). The POSIX standard marks it as obsolescent but still provides the API for backward compatibility.

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 2 times, most recently from ddeaede to 71dff3f Compare September 2, 2025 17:45
@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

even though the manual says that CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core, modern CPUs have largely addressed this issue. So I think it is fair to compile CLOCK_PROCESS_CPUTIME_ID even though we don't have the guarantee that the user has pinned the benchmarking process to a core. The worst case scenario is a unreliable benchmark, which the current repo has anyways.

I added a line in the README.md for best practices to run the benchmarks.

I also tried adding a function to pin the process to a core directly in C, but there's no standard POSIX compliant way to do so. There is pthread_setaffinity_np() on linux, where 'np' stands for 'not portable'

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 4 times, most recently from 3e43c75 to ef9e40e Compare September 2, 2025 20:31
@real-or-random
Copy link
Contributor

Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.

what is the usefulness of measuring min/max when we are removing OS interference & thermal throttling out of the equation? min/max will be extremely close to the avg no matter how bad the benchmarked function is.

The point is exactly having a simple way of verifying that there's indeed no interference. Getting rid of sources of variance is hard to get right, and it's impossible to get a perfect solution. (This discussion shows this!) So we better have a way of spotting if something is off.

I like the stabilized quartiles idea.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

I like the stabilized quartiles idea.

tbh it scares me a bit, will see what I can do. Maybe in a future PR.

@Raimo33 Raimo33 marked this pull request as ready for review September 2, 2025 23:56
Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://www.man7.org/linux/man-pages/man3/clock_gettime.3.html:

Link with -lrt (only for glibc versions before 2.17).

I think this check, and adding the -lrt flag if necessary, should be included in both build systems.

@Raimo33
Copy link
Author

Raimo33 commented Sep 22, 2025

I think this check, and adding the -lrt flag if necessary, should be included in both build systems.

Agreed, good catch! what do you think about something like this?

include(CheckFunctionExists)
check_function_exists(clock_gettime HAVE_CLOCK_GETTIME)
if(NOT HAVE_CLOCK_GETTIME)
    # On some platforms, clock_gettime requires librt
    target_link_libraries(your_target PRIVATE rt)
endif()

@hebasto
Copy link
Member

hebasto commented Sep 22, 2025

Friendly ping @martinus, benchmark connoisseur :)

Could you please give a rough estimate of these changes at the concept level?

@real-or-random
Copy link
Contributor

Could you please give a rough estimate of these changes at the concept level?

And what's your take on using nanobench instead, even though this is a pure C library instead of C++?

@hebasto
Copy link
Member

hebasto commented Sep 22, 2025

Goal

This PR refactors the benchmarking functions as per #1701, in order to make benchmarks more deterministic and less influenced by the environvment.

This is achieved by replacing Wall-Clock Timer with Per-Process CPU Timer when possible.

I've tested 8925b95 by running it alongside stress --cpu $(nproc) and observed the same effect on benchmark average values as on the master branch.

Could you please clarify how one would measure or observe the claimed "less influenced by the environment" behaviour?

@Raimo33
Copy link
Author

Raimo33 commented Sep 25, 2025

stress --cpu spawns workers that use 100% of the cpu. your test pegs every core to 100% utilization. and each gets probably pinned to a physical core. therefore you won't see any difference in terms of variance.

a better example would be starting various threads with random usage spikes at random times on the same core the benchmark is running on.

basically, if you have no other processes running on that CPU, then this PR won't show any improvement.

this PR helps in the more realistic scenario where a user has background processes running, and the scheduler assigns multiple of them on the same CPU that's running the benchmark, putting the thread to sleep, which the wall clock timer doesn't account for.

also, this PR is not merely a replacement of wall clock time with CPU time, but it also modernizes the clock function as per the Unix standard.

@hebasto
Copy link
Member

hebasto commented Sep 25, 2025

I think this check, and adding the -lrt flag if necessary, should be included in both build systems.

Agreed, good catch! what do you think about something like this?

include(CheckFunctionExists)
check_function_exists(clock_gettime HAVE_CLOCK_GETTIME)
if(NOT HAVE_CLOCK_GETTIME)
    # On some platforms, clock_gettime requires librt
    target_link_libraries(your_target PRIVATE rt)
endif()

Feel free to grab f59c45f from https://github.com/hebasto/secp256k1/commits/pr1732/0925.cmake.

UPD. @real-or-random do we need a CI job to build on some old system with old glibc?

Comment on lines 193 to 200
static void print_clock_info(void) {
#if defined(CLOCK_PROCESS_CPUTIME_ID)
printf("INFO: Using per-process CPU timer\n\n");
#else
printf("WARN: Using wall-clock timer instead of per-process CPU timer.\n\n");
#endif
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5fe7c01:

Do these messages make sense on Windows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while on windows there's no option for per-process cpu clock (or at least not high precision), I still think issuing the warning is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit of a stretch to call the native Windows QPC framework a "wall-clock timer".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this?
"WARN: Using global timer instead of per-process CPU timer."

This commit improves the reliability of benchmarks by removing some of the influence of other background running processes. This is achieved by using CPU bound clocks that aren't influenced by interrupts, sleeps, blocked I/O, etc.
@Raimo33
Copy link
Author

Raimo33 commented Sep 25, 2025

Feel free to grab f59c45f from https://github.com/hebasto/secp256k1/commits/pr1732/0925.cmake.

This seems overcomplicated and convoluted. but if you manage to simplify it I'll include your commit.

Copy link
Member

@hebasto hebasto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach NACK cce0147.

The current implementation breaks compatibility with systems using glibc versions prior to 2.17:

$ ldd --version | head -1
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.23) 2.15
$ cmake -B build
$ cmake --build build -t bench
<snip>
[100%] Linking C executable ../bin/bench
CMakeFiles/bench.dir/bench.c.o: In function `gettime_us':
/secp256k1/src/bench.h:45: undefined reference to `clock_gettime'
collect2: ld returned 1 exit status
make[3]: *** [bin/bench] Error 1
make[2]: *** [src/CMakeFiles/bench.dir/all] Error 2
make[1]: *** [src/CMakeFiles/bench.dir/rule] Error 2
make: *** [bench] Error 2

@Raimo33
Copy link
Author

Raimo33 commented Sep 28, 2025

Fixed by unconditionally adding -lrt when available.

root@5997b2bd1aca:/workspace# ldd --version | head -1
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.23) 2.15
root@5997b2bd1aca:/workspace# cmake --build build -t bench
[ 33%] Built target secp256k1_precomputed
[ 66%] Built target secp256k1
[100%] Built target bench

Comment on lines +128 to +132
find_library(RT_LIBRARY rt)
add_library(optional_rt INTERFACE)
if(RT_LIBRARY)
target_link_libraries(optional_rt INTERFACE rt)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extract this logic into a find-module, like FindRT.cmake? That module should provide an IMPORTED target with a namespace in the name.

Have a look at the FindIconv.cmake module as an example. This module also provides an interface library for functionality that may be found in an actual library or built into the C standard library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks compilicated, almost overcomplicated, without much benefits, I won't be able to replicate FindIconv.cmake on my own. but you're welcome to provide a commit and I'll see if it should be cherrypicked.

@Raimo33
Copy link
Author

Raimo33 commented Oct 1, 2025

As per @purpleKarrot suggestion, benchmarks now can be run via ctest as well (as opposed to ./bench_name). This would be the preferable way to run them as it automatically handles CPU pinning and affinity. This means that, if both SECP256K1_BUILD_BENCHMARK and SECP256K1_BUILD_TESTS flags are set, the standard ctest command will run benchs+tests.
This can be avoided by selectively choosing which "tests" to run with the following ctest syntax:
ctest -R bench or ctest -R test (which uses regex).

Otherwise we can evaluate the possibility of using labels for groups of tests.

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 6 times, most recently from ed8a799 to 7df6023 Compare October 1, 2025 19:17
@Raimo33 Raimo33 force-pushed the benchmark-precise branch from 7df6023 to 6f0e5d4 Compare October 2, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants