Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking guidelines #606

Closed
DarkDimius opened this issue Mar 29, 2017 · 16 comments
Closed

Benchmarking guidelines #606

DarkDimius opened this issue Mar 29, 2017 · 16 comments

Comments

@DarkDimius
Copy link

I'm writing this as a guide on how to make contributed benchmarks "more valid" and actually measure what they are supposed to. Note that as of current state benchmarks in this repo don't follow them.

  • make sure to benchmark for small collections(<4 elems), they are the most common
  • make sure to include sizes which are not "magical" for your collection. E.g. Degrees of 2. I propose to standardise on those sizes: 0, 1, 2, 3, 4, 7, 8, 15, 16, 17, 39, 282, 73121, 7312102.
  • Use Blackhole.consume instead of returning a value from a method
  • Use Blackhole.consume to ensure that data you are accessing is actually accessed
  • don't use foreach as a way to run multiple iterations of benchmark. Use a hand-written while cycle and @OperationsPerInvocation, eg https://github.com/DarkDimius/DelayingArrays/blob/rewrite/src/main/scala/me/d_d/delaying/Benchmarks.scala#L151
  • If you are accessing the same data in every benchmark iteration, eg. list.head, make sure to have a set-up run to ellide it from memory. This is hard to get right and this is why I recommend to not include such benchmarks.
  • If you are accessing random data, make sure not to use random number generator provided in Java. It's slow. You'll be benchmarking RNG instead of your collection. A good idea would be to pre-fill an array with indexes that you are going to access(suggested by @Ichoran).
  • make sure you're not measuring boxing\string allocation. It may be slower than the actual operaion being performed in benchmark.
  • best way to ensure this, read assembly resulting from JIT-ting your code;

If you want to get reliable numbers when running on your machine , make sure that you have:

  • a physical machine. Getting reliable number in vitalized environment is hard;
  • everything that can do some background computation disabled, including:
    • cron;
    • ntpd;
    • OS updates, not only installation, but even checks for updates;
    • web browsers\music
    • filesystems and SSD drives can sometimes do operations that are similar to garbage collection in VM. Make sure not to use them(I recommend launching entire OS in ramdisk).
  • cpu fixed to the same speed:
    • disabled turbo-boost
    • cpu scheduler fixed to a fixed frequency
    • to ensure that cpu doesn't clockdown due to overheating, fix it to a lower frequency,
    • disable hpet.
  • absence of memory swapping.

Additions and suggestions are welcome.

@Ichoran
Copy link

Ichoran commented Mar 29, 2017

Mostly I think this is good advice. A few quibbles:

  • Some collections are heavily used with small numbers of elements (e.g. List). Others only make sense if you plan to have a larger number.
  • There's nothing wrong with returning values from methods. Most of the time I find this gives me lower error than using Blackhole.consume. But the value must depend on the element in the collection in a nontrivial way. For instance, .length on a String is fine thing to sum up in a benchmark.
  • Instead of repeatedly accessing head or trying to elide it from memory (a process that usually makes timing too unreliable to be worth anything), create an array and benchmark accessing the head of each thing. (For bonus points: you can choose which cache level you're going to pull from based on the size of the array.)
  • Don't use the RNG listed for anything important. It's fine for "kind of random" random access. The distribution is awful, though. Better is to scramble an index array in random order as part of setup, then benchmark how long it takes to walk the index array in order as a correction. (Generally: pretty close to zero.) This has the advantage of letting you set up particular patterns also if you want (e.g. walk up by 10s, etc.).

@DarkDimius
Copy link
Author

DarkDimius commented Mar 29, 2017

There's nothing wrong with returning values from methods. Most of the time I find this gives me lower error than using Blackhole.consume. But the value must depend on the element in the collection in a nontrivial way. For instance, .length on a String is fine thing to sum up in a benchmark.

There's a pleora of ways to get it wrong. The notion of depend changes over vm implementations and even versions of the same vm. That's why I propose to use a (more) reliable way to make sure computations are actually performed which doesn't require reading assembly.

For instance, .length on a String is fine thing to sum up in a benchmark.

I know a VM for which this might be false depending on a benchmark as it can do deep speculations on values. Graal.

It's fine for "kind of random" random access. The distribution is awful, though

I'm not convinced. Would you elaborate?
I have nothing against your alternative though.

@Ichoran
Copy link

Ichoran commented Mar 29, 2017

There's lots of stuff to get wrong with microbenchmarking, including trying to benchmark extremely inexpensive operations with overhead that dominates. Blackhole.consume used to, at least, take a few ns as opposed to the < 1 ns you usualy get from a field lookup plus arithmetic operation.

So it's not exactly more reliable, it just shifts what you need to worry about, and renders more things unobservable. I suppose that it's a better default, but I wouldn't select it when the person writing the benchmark has some reason to think that some real operation is reasonably appropriate.

Also, if Graal is really good at deeply speculating on values but Blackhole.consume prevents the speculation is that actually more representative of running real code? Again, I'm not sure it's really better, just different.

Most of the time these things are going to matter much less than your microbenchmark being run in the wrong optimization context and without the normal GC load.

As far as the RNG goes, uhm, have you checked what it does mod 8, for instance?

@Ichoran
Copy link

Ichoran commented Mar 29, 2017

(I mean--you haven't even implemented it right. If you do implement it right, it's still pretty lousy, especially with differences which matter due to caching effects at least on array lookups.)

@DarkDimius
Copy link
Author

There's lots of stuff to get wrong with microbenchmarking, including trying to benchmark extremely inexpensive operations with overhead that dominates. Blackhole.consume used to, at least, take a few ns as opposed to the < 1 ns you usualy get from a field lookup plus arithmetic operation.

My benchmarks suggest that Blackhole.consume takes measurably less time than a single memory access. If we're benchmarking collecitons, we're mostly benchmarking memory accesses.

Also, if Graal is really good at deeply speculating on values but Blackhole.consume prevents the speculation is that actually more representative of running real code?

Because we shouldn't be benchmarking the rare "good" case of our collection. We should provide collections that are guaranteed to perform well in common case, when optimizations fail.

@Ichoran
Copy link

Ichoran commented Mar 29, 2017

Fair enough. Maybe consume used to be slower? As long as it's fast, there's no reason not to use it. (Also--benchmarking in-cache and out-of-cache small collections are two different things. Both come up a lot in real-world applications.)

@DarkDimius
Copy link
Author

(I mean--you haven't even implemented it right. If you do implement it right, it's still pretty lousy, especially with differences which matter due to caching effects at least on array lookups.)

If you mean that I didn't go into longs as the benchmark suggests then yes. I'm not arguing that my RNG is the best RNG. I'm trying to see your reasoning behind your critique of using fast RNGs here. Or did you argue that this particular one is bad one?

@Ichoran
Copy link

Ichoran commented Mar 30, 2017

This is a particularly bad choice given the behavior for modulus of small powers of two. A fast RNG is in principle fine; the trick is to get one that doesn't make you measure something other than what you think you're measuring. If you care about random stride length, none of the LCGs are safe. If you just care about hitting arbitrary elements, some LCGs are safe, but the particular implementation you chose (mod 2^32 instead of 2^32-1) is really not good for small powers of two.

@DarkDimius
Copy link
Author

DarkDimius commented Mar 30, 2017 via email

@biboudis
Copy link

biboudis commented Apr 20, 2017

Very nice recommendations by @DarkDimius and @Ichoran!

I'd like to build upon this list with some more points/questions (just food for thought):

  1. use for-loops for initialization of collections/streams, because higher-order combinators used there may turn megamorphic and skew the actual results, e.g., [0]
  2. Blackhole is indeed not needed when returning from a method in principle, but since I haven't witnessed a case of weird dependencies myself, I have to accept @DarkDimius's good point
  3. size does matter [1]: for functional arrays, we have to use large arrays and measure iteration costs, use small for some collection ops (instantiations) and large (to measure copying costs)
  4. use specialized and reference types and control optimizations for autobox elimination on the JVM -XX:-EliminateAutoBox to distinguish between scala specialization and JVM autoboxing (just in case)
  5. measure on the millisecond scale, nanoscale if needed. What about flops/sec? would it be better to make hypotheses about flops/sec and then test via benchmarking? [7]
  6. should we use macro-benchmarks?: e.g., port various from RiverTrail [2] as Niko does in Rayon [3].
  7. use variable-sized data to stress TLAB allocation ratio (also relevant -XX:MinTLABSize) [8]
  8. we should able to do detailed cache analysis, how do we investigate data alignment to multiples of cache line sizes?
  9. what do you think about a series of benchmarks with escalating difficulty e.g., in Views: sum, sum_squares, sum_squares_even, cartesian_product, many_maps, many_filters, dot_product, flatMap_after_zipWith, zipWith_after_flatMap, flat_map_take, zipWith_flat_mapped (I have drafted an example in a new repo here [9] and in this chart I present avg ms/op for several benchmarks of the above, just to motivate comparisons between the new and the old design - just for lazy views in this repo at the moment)
  10. be able to to benchmark the same collections for various versions of scalac and/or dotty versions and/or JVM as Aleksey does [4] would be nice starting point for a benchmark suite in the CI
  11. benchmark (extensively) for any "dynamically-typed” optimizations, e.g., like in stream characteristics as in Java 8 Streams [5]
  12. should we pollute the type profile for worst case scenarios? what information this will give to collections' developers? [6] (will be needed in the future to test context-dependency/call-site/link-time opt in dotty)
  13. for memory footprint investigations measure using Memory Analyzer (MAT), VisualVM, Solaris Studio Performance Analyzer, the latter has different characteristics concerning non-biased Java/native profiling
  14. what could be other directions for microbenchmarking collection operations apart from lookups, concatenation, construction, deconstruction and transformations from one datatype to the other?

BTW, do you know whether we have support for JMH's -perfasm (printing assembly) profiler in MacOS?

WDYT @DarkDimius?

References

  1. PaulSandoz/lambda-devoxx13
  2. Arrays of Wisdom of the Ancients
  3. RiverTrail
  4. Rayon
  5. Arrays of Wisdom of the Ancients:Historical Perspective
  6. Characteristics
  7. MethodData and Profile pollution
  8. Performance in Numerical Computing
  9. JVM Anatomy Park #4: TLAB allocation
  10. biboudis/stream-benchmarks

@DarkDimius
Copy link
Author

use for-loops for initialization of collections/streams, because higher-order combinators used there may turn megamorphic and skew the actual results, e.g., [0]

That's the exact opposite of what you want. You actually want it to become megamorhic because this is the common case.

should we pollute the type profile for worst case scenarios? what information this will give to collections' developers?

I'd say a strong "yes"

BTW, do you know whether we have support for JMH's -perfasm (printing assembly) profiler in MacOS?

Linux only.

@lrytz
Copy link
Member

lrytz commented Apr 26, 2017

Linux only.

Doesn't it work if you build hsdis https://github.com/AdoptOpenJDK/jitwatch/wiki/Building-hsdis ?

@DarkDimius
Copy link
Author

@lrytz, there is no perf on mac. You are able to get assembly printout, but it won't include hardware counters.
Related: https://groups.google.com/forum/#!topic/mechanical-sympathy/am3EB0SrBFg https://twitter.com/shipilev/status/690298532058808320

@lrytz
Copy link
Member

lrytz commented Apr 27, 2017

Ah yes, thanks. I remember now seeing this tweet :)

@SethTisue

This comment has been minimized.

@SethTisue SethTisue transferred this issue from scala/collection-strawman Feb 20, 2019
@SethTisue SethTisue modified the milestones: 2.13.0, 2.13.1 Feb 20, 2019
@szeiger szeiger modified the milestones: 2.13.1, 2.13.2 Sep 9, 2019
@SethTisue SethTisue modified the milestones: 2.13.2, 2.13.3 Feb 6, 2020
@SethTisue SethTisue modified the milestones: 2.13.3, 2.13.4 May 15, 2020
@dwijnand
Copy link
Member

Closing, as we're not going to directly take action from this - but it's good info and it's still reachable through search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants