Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel-letter-frequency: Add benchmark #542

Merged

Conversation

unmanbearpig
Copy link
Contributor

@unmanbearpig unmanbearpig commented May 11, 2017

Fixes #519

I've added a benchmark test that makes sure that frequency function runs at least 10% faster (arbitrary number) with multiple workers that with a single worker.

It increases the time it takes to run the tests by about 5 seconds on my machine, not sure if it's acceptable. It might be possible to make it faster by tweaking Criterion settings.

Another concern is the benchmark test would only work correctly on a machine with at least 2 cores.

It prints the benchmark results into the console:

frequency
  no texts mean no letters
  one letter
  case insensitivity
  many empty texts still mean no letters
  many times the same text gives a predictable result
  punctuation doesn't count
  numbers don't count
  all three anthems, together, 1 worker
  all three anthems, together, 4 workers
Run with a single worker
benchmarking...
time                 4.487 ms   (4.398 ms .. 4.544 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 4.566 ms   (4.557 ms .. 4.577 ms)
std dev              17.05 μs   (11.71 μs .. 25.01 μs)

Run with 4 workers
benchmarking...
time                 1.425 ms   (1.345 ms .. 1.490 ms)
                     0.977 R²   (0.951 R² .. 0.992 R²)
mean                 1.556 ms   (1.539 ms .. 1.585 ms)
std dev              50.19 μs   (33.35 μs .. 71.29 μs)
variance introduced by outliers: 13% (moderately inflated)

  multiple workers run faster than 1

Finished in 4.9143 seconds
10 examples, 0 failures

I guess another option would be to add a separate benchmark and have users check the speed themselves, not sure which option is better. If it looks good I'm going to add some information about it into HINTS.md.

Maybe we could make it optional somehow?

@unmanbearpig
Copy link
Contributor Author

I'm not certain on whether I should use nf or whnf here https://github.com/exercism/xhaskell/pull/542/files#diff-63fa4ff5bfd229a3aed31a4232a2f3f8R110

@rbasso
Copy link
Contributor

rbasso commented May 11, 2017

Thanks for the PR!

I think we shouldn't should make here a clear distinction about what belongs to tests and what belongs to benchmarks. I usually divide like this:

  • Tests give us a reasonable assurance that our functions have the correct behavior.
  • Benchmarks give feedback about how fast the code runs.

I guess another option would be to add a separate benchmark and have users check the speed themselves, not sure which option is better.

I don't knows what the other maintainers think about it, but I definitely think it would be a better solution.

Maybe we could make it optional somehow?

If it where not inside the test suite, but in a separate benchmark suite in package.yaml, it would be automatically optional.

Because we still need to decide where does the benchmark code belongs, I'll wait for the other maintainers position and not review the benchmark code a the moment.

@unmanbearpig
Copy link
Contributor Author

@rbasso, I get what you're saying. My reasoning for having it as a test instead of a separate benchmark is: since the exercise is to write parallel code then tests should check if it runs in parallel. Sequential code shouldn't pass the tests, because users might think that their solution is correct when it actually is not. A benchmark is the only way to test it in Haskell, as far as I know.
It might be more straightforward for users, as they wouldn't have to think about running a separate stack bench command. Instead they would be able to keep running stack test until it passes, like in any other exercise. Seems like a bit less friction and slightly better user experience.

I agree that we should wait for other people's opinions on it.

@petertseng
Copy link
Member

For any other exercise I would suggest the benchmarks go in stack bench. Convince me why it should be otherwise here! Ah, because the exercise is specifically about parallelism? That does in fact make the exercise slightly unique. I have noted it as such in x-common.

@rbasso
Copy link
Contributor

rbasso commented May 16, 2017

Another concern is the benchmark test would only work correctly on a machine with at least 2 cores.

This is really problematic, because there are multiples situations where the tests could be run on a single core:

  • Users with single core machines (uncommon nowadays)
  • Development environment inside a virtual machine (some developers like that)
  • Continuous Integration in single core machines (I guess that Travis-CI gives us two cores ATM)

Even if the number of cores where not a problem, I think that testing for speed improvements misses the point:

  • It is in general impossible, because parallel execution isn't a property of the code. Haskell is non-strict, so it is up to the compiler and run-time system to decide when/how to evaluate expressions.
  • Testing speed is a proxy for testing parallel execution, not the same thing.
  • A busy host/guest OS could make a normally faster parallel benchmark be scored as slower than the non-parallel version.
  • A slower, parallel solution is correct IMHO, so it shouldn't be refused in any circumstance.
  • It is a really smart hack to measure the variation in speed, but hacks tend to compose in exponentially harder to manage projects.

While I'm willing to accept a benchmark suite, I strongly oppose using the tests suite to check for speed or parallel execution.

measuredMetric = B.estPoint . anMean . reportAnalysis

makeBench :: [Text] -> Int -> Benchmarkable
makeBench texts n = nf (frequency n) texts -- am I right that I should use nf and not whnf?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nf seems preferable, but it accepts a smaller set of types than whnf.

In this case, because the returned value is a lazy Map, I guess that using whfn could give artificially lower and incorrect benchmark values.

I didn't tested though.

@unmanbearpig unmanbearpig force-pushed the parallel-letter-frequency-benchmark branch 4 times, most recently from 222d359 to a976182 Compare May 17, 2017 16:20
@unmanbearpig
Copy link
Contributor Author

unmanbearpig commented May 17, 2017

I've added a benchmark with cases for running it on 4 and 500 anthems. Each number of anthems would run with a single worker and physical-number-of-cores number of workers. All cases run first on a single core and then on all available cores, as I found that results are often different.

It runs about 45 seconds on my machine. It's probably too many cases, but I'm not sure which we cases could get rid of.

Here is the results of the sample solution
Note that it runs faster on a single core in all cases except for 500 anthems on 8 workers.

And here is my solution
I've added a case with 256 workers as it's sometimes faster even than the best case of the sample solution, even on 1 core, which is really surprising to me. Similar situation with 1 core being faster in some cases regardless of the number of workers.

Copy link
Contributor

@rbasso rbasso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all the work in this PR @unmanbearpig. Performance is a hard subject in Haskell, so we really appreciate the effort. 👍

Sorry for taking so long to review it. I'm not being very constant as a maintainer here 😁 , and I have very little experience with the criterion package.

The code seems great, but some fixes are still needed. I got some really interesting results playing with it here. 😄

I wrote a lot of comments, but please try not to take them too hard. I know very little about criterion, so my comments should be read more like questions than as critics.


@exercism/haskell, we need to decide if we should compile - not run - the test suites as part of the Travis-CI check. This PR fails when building the benchmarks, but it passes Travis because we are ignoring them.

@@ -1,4 +1,5 @@
name: parallel-letter-frequency
version: 0.1.0.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we didn't documented it anywhere yet, but maybe we should avoid adding the version to the example's package.yaml. Doing that will decrease the amount of file updates when bumping a version.

For now I think it is fine to leave it like that. This will probably be checked after we add the documentation in #538.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend not adding it. You never know what might look upon this as an example. So we should be consistent if we're not adding it.


benchmarks:
bench:
main: Benchmarks.hs
Copy link
Contributor

@rbasso rbasso May 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cabal-simple...: can't find source for Benchmarks in ...

Vide comments below in package.yaml.


benchmarks:
bench:
main: Benchmarks.hs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename here doesn't match the Benchmark.hs file in bench/. This causes an error when running the benchmark:

Cabal-simple...: can't find source for Benchmarks in ...

benchmarks:
bench:
main: Benchmarks.hs
source-dirs: bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vide comments below in package.yaml concerning ghc-options.

benchmarks:
bench:
main: Benchmarks.hs
source-dirs: bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I had to add...

    ghc-options: -threaded -with-rtsopts=-N

... to make it run multi-threaded and with multiple processors.


-- run on 1 core
setNumCapabilities 1
defaultMain $ benchGroup 1 numsOfWorkers <$> numsOfAnthems
Copy link
Contributor

@rbasso rbasso May 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs about 45 seconds on my machine. It's probably too many cases, but I'm not sure which we cases could get rid of.

I would drop this block for two reasons:

  1. The second defaultMain would overwrite the HTML output file from the first one. I guess Criterion was not designed to be run like that.
  2. IMHO, the performance variation caused by numOfWorkers and numsOfAnthems should give us enough information to see how performance changes depending on the size of the task and the number of threads.

Also, consider these ideas:

  • Instead of nub $ sort [1, processors], consider [1..processors]. This would increase the number of benchmarks a lot, but give a nice curve showing how does performance changes for each additional processor. The last one is usually an interesting case.
  • Do we get any additional information from the 4 anthems cases?

defaultMain $ benchGroup 1 numsOfWorkers <$> numsOfAnthems

-- run on all cores
setNumCapabilities processors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have almost no experience with criterion, so I'm not sure of it, but I guess that, if you decide to remove the first defaultMain group, we wouldn't need this line anymore, right?

@rbasso
Copy link
Contributor

rbasso commented May 21, 2017

Just as a reference to anyone interested in reviewing this or another PR with benchmarks, the following commands may help:

Clean the cache
stack clean
Build everything with -Wall -Werror
stack build --pedantic --test --no-run-tests --bench --no-run-benchmarks
Run the tests
stack test
Run the benchmarks and generate a HTML report
stack bench --benchmark-arguments '--output=benchmark.html'

If you are using Stack with Nix, you may also need to add --no-nix-pure to the last command.

Run hlint on all the sources
hlint .

@unmanbearpig unmanbearpig force-pushed the parallel-letter-frequency-benchmark branch 2 times, most recently from a8705b8 to d00ad33 Compare May 22, 2017 14:49
bin/test-example Outdated
runstack "test"
if ! runstack "test"; then
exit 1
fi
Copy link
Contributor

@rbasso rbasso May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petertseng, could you please take a look at these changes in the test scripts when you get some time. I have some kind of fever right now, so my brain isn't 100%, and I don't want to merge anything that could break the tests scripts you wrote.

Copy link
Member

@petertseng petertseng May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it has preserved the intent, but in that case the comments above need to be removed since now it's no longer an implicit exit value.

bin/test-example Outdated
@@ -50,7 +58,9 @@ if [ "$exampletype" = "success" ]; then
# Implicit exit value: this is last thing in this path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments now must be removed, since they are otherwise inaccurate.

@unmanbearpig
Copy link
Contributor Author

I think I've fixed all of the issues. I've also added a note about the benchmark to HINTS.md. What do you think?

@petertseng
Copy link
Member

I'm personally thinking things are OK. At least I don't see anything that I would ask to change. Haven't run it yet to try it out, that'll have to happen tomorrow. But because of this, no need to wait for me if someone beats me to it.

Copy link
Contributor

@rbasso rbasso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems great!

I'll just ask for a final change... 😁

I was having problems with Travis-CI while reviewing this PR, so I took some time to fix a few things with #550 and #551. While reading the logs, I noticed that your didn't changed test-stub to check if the benchmarks would compile with it, but it is desirable to assert that, so that we can be more confident that the users will be able to compile and run the benchmarks.

Considering that enabling the tests in Travis-CI wasn't the primary objective of this PR, I created #552 to solve that, and now we don't need more changes in the scripts in this PR.

Sorry for all the trouble, but I'll have to ask you to rebase this PR on master removing any changes to test-example. Besides that, I think we are good to go! 👍

bin/test-example Outdated
else
BENCH=false
fi

Copy link
Contributor

@rbasso rbasso May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to enable running the benchmarks was generating the following warning when testing the stub solution for parallel-letter-frequency in Travis-CI:

WARNING: Specified source-dir "bench" does not exist
Warning: Directory listed in parallel-letter-frequency.cabal file does not exist: bench

This was happening because the test-stub shell script wasn't changed to check the stub solutions against the benchmarks.


### Benchmark

Check how changing number of workers affects performance of your solution by running the benchmark. Use `stack bench` to run it. Feel free to modify `bench/Benchmark.hs` to explore your solution's performance on different inputs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems great! 👍

@@ -1,5 +1,7 @@
name: parallel-letter-frequency

ghc-options: -threaded -with-rtsopts=-N -O2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vide comment for package.yaml.

version: 0.1.0.2
version: 0.1.0.3

ghc-options: -threaded -with-rtsopts=-N -O2
Copy link
Contributor

@rbasso rbasso May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line causes the following warnings:

Configuring parallel-letter-frequency-0.1.0.3...
Warning: 'ghc-options: -threaded' has no effect for libraries. It should only be used for executables.
Warning: 'ghc-options: -with-rtsopts' has no effect for libraries. It should only be used for executables.

Besides the warnings, the only real side effect of this is enabling -threaded -with-rtsopts=-N -O2 for the test suite, which doesn't need it, but the slower compilation from -O2 probably has an insignificant effect in the overall building time.

@rbasso rbasso changed the title WIP: parallel-letter-frequency: Test parallelism speed increase parallel-letter-frequency: Add benchmark May 25, 2017
I've added cases for running it on 4 and 500 anthems. Each number of
anthems would run with a single worker and physical-number-of-cores
number of workers. All cases run first on a single core and then on
all available cores, as I found that results are often different.
@unmanbearpig unmanbearpig force-pushed the parallel-letter-frequency-benchmark branch from b5b452c to 145b7b0 Compare May 25, 2017 14:48
@unmanbearpig
Copy link
Contributor Author

@rbasso, I've just rebased it on top of master, deleted my script changes, and moved ghc-options into bench section.
Should I clean up and squash some commits before we do the merge into master or should we keep the history as it is?

@rbasso
Copy link
Contributor

rbasso commented May 26, 2017

Should I clean up and squash some commits before we do the merge into master or should we keep the history as it is?

We usually don't care about the PRs history, and leave only a few commits to make the repository's history nice and clean. In this case, I guess all the changes belong together, so a single commit seems a good idea.

To avoid giving you more trouble, I'll squash-merge it here. 👍

Thanks a lot!

@rbasso rbasso merged commit af5f05b into exercism:master May 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants