Speed up Rust #1

bugaevc · 2021-07-28T12:26:14Z

Hello!

I've done some optimizations on the Rust implementation.

On a data set consisting of 500 000 phone numbers, the running time went from 7.167 s ± 0.239 s down to 2.237 s ± 0.033 s (as measured with Hyperfine — I didn't bother to port benchmark_runner to work on non-macOS).

The largest speed-up is from dropping BigUints, which are slow. Please see individual commits for details.

I also tried to make the code a little more idiomatic Rust, but there's still a long way to go.

This makes it a little bit nicer to check whether the last thing we added was a digit or not, and also lets us skip allocating an extra string when adding a digit.

It never actually returned any errors; so why return anything at all.

Don't just check if something worked fine, propagate the error otherwise.

This gives us a slight, but noticeable speed-up, since storing a char requires four times as much space.

Instead of cloning the vector to create a new partial_solutions vector, reuse the very same vector, and simply pop the item after the recursive call. This gives us another noticeable speed-up.

Instead of passing around the reference to the whole vector of digits and the start index separately, make use of a Rust idiom: slices. We can pass around a &[u8], which is a slice of the original vector. We can slice it further, such as passing &digits[i+1..] to a recursive invocation.

BigUints are slow. Since we don't actually have to do any arithmetics with them, we can get away with simply using string slices (or actually byte slices). This gives us a *huge* speed-up, because we can now directly slice a digit sequence out of a phone number instead of doing costly BigUint arithmetics.

This was clearly meant to ignore the *files* at the root of the tree, not the whole source directories. In particular, ./src/rust/phone_encoder/ is no longer ignored, while ./phone_encoder (in the project root directory) still is.

bugaevc · 2021-07-28T13:06:41Z

It can be easily made 2x faster still by switching the hasher from the default supposedly cryptographically secure one to something like aHash:

git diff

diff --git a/src/rust/phone_encoder/Cargo.toml b/src/rust/phone_encoder/Cargo.toml
index 6a9390c..8cccb62 100644
--- a/src/rust/phone_encoder/Cargo.toml
+++ b/src/rust/phone_encoder/Cargo.toml
@@ -4,3 +4,6 @@ version = "0.1.0"
 edition = "2018"
 authors = ["Renato Athaydes <renato@athaydes.com>"]
 license = ""
+
+[dependencies]
+ahash = "0.7.3"
diff --git a/src/rust/phone_encoder/src/main.rs b/src/rust/phone_encoder/src/main.rs
index 9c755ea..fff6997 100644
--- a/src/rust/phone_encoder/src/main.rs
+++ b/src/rust/phone_encoder/src/main.rs
@@ -5,7 +5,7 @@ use std::io::{self, BufRead};
 use std::path::Path;
 use std::fmt::{self, Display, Formatter};
 
-type Dictionary = HashMap<Vec<u8>, Vec<String>>;
+type Dictionary = HashMap<Vec<u8>, Vec<String>, ahash::RandomState>;
 
 #[derive(Debug, Copy, Clone)]
 enum WordOrDigit<'a> {
@@ -101,7 +101,10 @@ fn print_solution(num: &str, words: &[WordOrDigit<'_>]) {
 }
 
 fn load_dict(words_file: String) -> io::Result<Dictionary> {
-    let mut dict: Dictionary = HashMap::with_capacity(100);
+    let mut dict: Dictionary = HashMap::with_capacity_and_hasher(
+        100,
+        ahash::RandomState::default()
+    );
     for line in read_lines(words_file)? {
         let word = line?;
         let key = word_to_number(&word);

I also tried buffering output & getting rid of constant locking:

git diff

diff --git a/src/rust/phone_encoder/src/main.rs b/src/rust/phone_encoder/src/main.rs
index fff6997..6841cf9 100644
--- a/src/rust/phone_encoder/src/main.rs
+++ b/src/rust/phone_encoder/src/main.rs
@@ -1,7 +1,7 @@
 use std::collections::HashMap;
 use std::env::args;
 use std::fs::File;
-use std::io::{self, BufRead};
+use std::io::{self, Write, BufRead, BufWriter, StdoutLock};
 use std::path::Path;
 use std::fmt::{self, Display, Formatter};
 
@@ -35,13 +35,16 @@ fn main() -> io::Result<()> {
 
     let dict = load_dict(words_file)?;
 
+    let stdout = io::stdout();
+    let mut writer = BufWriter::new(stdout.lock());
+
     for line in read_lines(input_file)? {
         let num = line?;
         let digits: Vec<u8> = num.chars()
             .filter(char::is_ascii_digit)
             .map(|ch| ch as u8)
             .collect();
-        print_translations(&num, &digits, &mut Vec::new(), &dict);
+        print_translations(&num, &digits, &mut Vec::new(), &dict, &mut writer);
     }
     Ok(())
 }
@@ -51,9 +54,10 @@ fn print_translations<'a>(
     digits: &[u8],
     words: &mut Vec<WordOrDigit<'a>>,
     dict: &'a Dictionary,
+    writer: &mut BufWriter<StdoutLock>,
 ) {
     if digits.is_empty() {
-        print_solution(num, &words);
+        print_solution(num, &words, writer);
         return;
     }
     let mut found_word = false;
@@ -63,7 +67,7 @@ fn print_translations<'a>(
             for word in found_words {
                 found_word = true;
                 words.push(WordOrDigit::Word(word));
-                print_translations(num, rest_of_digits, words, dict);
+                print_translations(num, rest_of_digits, words, dict, writer);
                 words.pop();
             }
         }
@@ -78,25 +82,29 @@ fn print_translations<'a>(
     if !last_is_digit {
         let digit = digits[0] - b'0';
         words.push(WordOrDigit::Digit(digit));
-        print_translations(num, &digits[1..], words, dict);
+        print_translations(num, &digits[1..], words, dict, writer);
         words.pop();
     }
 }
 
-fn print_solution(num: &str, words: &[WordOrDigit<'_>]) {
+fn print_solution(
+    num: &str,
+    words: &[WordOrDigit<'_>],
+    writer: &mut BufWriter<StdoutLock>
+) {
     // do a little gymnastics here to avoid allocating a big string just for printing it
-    print!("{}", num);
+    write!(writer, "{}", num).unwrap();
     if words.is_empty() {
-        println!(":");
+        write!(writer, ":").unwrap();
         return;
     }
-    print!(": ");
+    write!(writer, ": ").unwrap();
     let (head, tail) = words.split_at(words.len() - 1);
     for word in head {
-        print!("{} ", word);
+        write!(writer, "{} ", word).unwrap();
     }
     for word in tail { // only last word in tail
-        println!("{}", word);
+        write!(writer, "{}", word).unwrap();
     }
 }

This brings me down to 1.126 s ± 0.021 s. If you'd like me to, I can push these patches as commits as well.

sstadick · 2021-07-28T13:50:31Z

@bugaevc you could also try using https://docs.rs/grep-cli/0.1.6/grep_cli/ in place of stdout to avoid line-buffering as seen here: rust-lang/rust#60673

Adding the following to Cargo.toml generally gets another 2-5% bump:

[profile.release]
lto = "fat"
codegen-units = 1

bugaevc · 2021-07-28T14:01:17Z

Using grep-cli + fat LTO + 1 codegen unit gives me 1.138 s ± 0.016 s, which is... not that different from the timings I got without those, and actually slightly slower. Can you reproduce it?

sstadick · 2021-07-28T14:11:28Z

I will give it a try later today, but that could just be the case for this! Thanks for trying it at all 👍

OvermindDL1 · 2021-07-29T16:42:09Z

lto = "thin" will often be faster than fat interestingly (depends on the code), so that might be why.

And yeah, using rust-bigint is a naive bigint implementation, built for correctness but it is slow, unsure why that was chosen over rug, which is the usual bigint implementation used in rust programs if the need it, but that still gets to a bigint isn't needed anyway as rust as native integers up to 128 bits in size and in this case a vector of some form is better.

And second, lazy_static was entirely useless here, even if rust-bigint was kept the values were being cloned anyway, so unsure why that was done?

And yeah, the benchmark runner won't compile because it seems to be using some mac'isms? So that's very odd?
I edited the benchmark.sh to use hyperfine and ran the variations here on my system after copying the rust phone_encoder with this PR to a phone_encoder_pr1 version (with lto thin and codegen 1 as well, as well as my system-wide native compiler flags, which is just a user-wide defined RUSTFLAGS="-Ctarget-cpu=native" environment variable), here's the hyperfine summary reports first:

Summary
  './phone_encoder_pr1 dictionary.txt phones_1000.txt' ran
    1.34 ± 0.29 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_1000.txt'
    1.83 ± 0.37 times faster than './phone_encoder dictionary.txt phones_1000.txt'
    5.16 ± 1.08 times faster than 'java -cp build/java Main dictionary.txt phones_1000.txt'
    5.91 ± 1.10 times faster than 'java -cp build/java Main2 dictionary.txt phones_1000.txt'

Summary
  './phone_encoder_pr1 dictionary.txt phones_10_000.txt' ran
    1.70 ± 0.25 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt'
    2.44 ± 0.34 times faster than './phone_encoder dictionary.txt phones_10_000.txt'
    4.09 ± 0.57 times faster than 'java -cp build/java Main dictionary.txt phones_10_000.txt'
    5.75 ± 0.79 times faster than 'java -cp build/java Main2 dictionary.txt phones_10_000.txt'

Summary
  './phone_encoder_pr1 dictionary.txt phones_50_000.txt' ran
    2.04 Â± 0.08 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_50_000.txt'
    2.57 Â± 0.21 times faster than 'java -cp build/java Main dictionary.txt phones_50_000.txt'
    3.16 Â± 0.13 times faster than './phone_encoder dictionary.txt phones_50_000.txt'
    4.77 Â± 0.23 times faster than 'java -cp build/java Main2 dictionary.txt phones_50_000.txt'

Summary
  './phone_encoder_pr1 dictionary.txt phones_100_000_with_empty.txt' ran
    2.16 ± 0.22 times faster than 'java -cp build/java Main dictionary.txt phones_100_000_with_empty.txt'
    2.24 ± 0.07 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_100_000_with_empty.txt'
    3.40 ± 0.11 times faster than './phone_encoder dictionary.txt phones_100_000_with_empty.txt'
    4.15 ± 0.14 times faster than 'java -cp build/java Main2 dictionary.txt phones_100_000_with_empty.txt'

And the hyperfine exported markdown format (unsorted):

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`java -cp build/java Main dictionary.txt phones_1000.txt`	370.9 ± 41.2	313.1	436.3	5.16 ± 1.08
`java -cp build/java Main2 dictionary.txt phones_1000.txt`	425.2 ± 24.0	381.8	455.6	5.91 ± 1.10
`sbcl --script src/lisp/main.lisp dictionary.txt phones_1000.txt`	96.2 ± 12.5	79.0	109.9	1.34 ± 0.29
`./phone_encoder dictionary.txt phones_1000.txt`	131.8 ± 12.8	110.9	143.9	1.83 ± 0.37
`./phone_encoder_pr1 dictionary.txt phones_1000.txt`	71.9 ± 12.8	57.7	87.2	1.00

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`java -cp build/java Main dictionary.txt phones_10_000.txt`	431.9 ± 24.9	403.7	478.5	4.09 ± 0.57
`java -cp build/java Main2 dictionary.txt phones_10_000.txt`	607.2 ± 32.6	562.5	653.5	5.75 ± 0.79
`sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt`	179.4 ± 12.7	169.5	200.4	1.70 ± 0.25
`./phone_encoder dictionary.txt phones_10_000.txt`	257.2 ± 14.7	244.1	274.7	2.44 ± 0.34
`./phone_encoder_pr1 dictionary.txt phones_10_000.txt`	105.6 ± 13.3	92.9	123.2	1.00

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`java -cp build/java Main dictionary.txt phones_50_000.txt`	734.8 ± 52.7	662.1	822.3	2.57 ± 0.21
`java -cp build/java Main2 dictionary.txt phones_50_000.txt`	1362.4 ± 39.9	1319.1	1424.9	4.77 ± 0.23
`sbcl --script src/lisp/main.lisp dictionary.txt phones_50_000.txt`	581.3 ± 7.8	575.2	601.7	2.04 ± 0.08
`./phone_encoder dictionary.txt phones_50_000.txt`	901.8 ± 15.1	881.8	919.3	3.16 ± 0.13
`./phone_encoder_pr1 dictionary.txt phones_50_000.txt`	285.6 ± 11.1	267.1	299.4	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`java -cp build/java Main dictionary.txt phones_100_000_with_empty.txt`	1.052 ± 0.100	0.891	1.245	2.16 ± 0.22
`java -cp build/java Main2 dictionary.txt phones_100_000_with_empty.txt`	2.023 ± 0.029	1.986	2.059	4.15 ± 0.14
`sbcl --script src/lisp/main.lisp dictionary.txt phones_100_000_with_empty.txt`	1.092 ± 0.010	1.076	1.107	2.24 ± 0.07
`./phone_encoder dictionary.txt phones_100_000_with_empty.txt`	1.656 ± 0.023	1.623	1.707	3.40 ± 0.11
`./phone_encoder_pr1 dictionary.txt phones_100_000_with_empty.txt`	0.487 ± 0.015	0.469	0.508	1.00

The the benchmark script I just added the dictionary file to each command along with a {file} argument at the end as so:

CHECK_FILE="proc_out.txt"
DEFAULT_INPUT="input.txt"
DEFAULT_OUTPUT="output.txt"
DICTIONARY="dictionary.txt"

COMMANDS=(
  "java -cp build/java Main ${DICTIONARY} {file}"          # Java 1
  "java -cp build/java Main2 ${DICTIONARY} {file}"         # Java 2
  "sbcl --script src/lisp/main.lisp ${DICTIONARY} {file}"  # Common Lisp
  "./phone_encoder ${DICTIONARY} {file}"                   # Rust
  "./phone_encoder_pr1 ${DICTIONARY} {file}"               # Rust PR1
)

And updated the correctness block to handle that by changing it to:

echo "Checking all programs for correctness"
for CMD in "${COMMANDS[@]}"
do
  echo "Checking: $CMD"
  ${CMD//{file\}/${DEFAULT_INPUT}} > $CHECK_FILE
  diff -q <(sort $CHECK_FILE) <(sort $DEFAULT_OUTPUT)
  echo "OK"
done

And replaced that big benchmarking loop with looping over just the files instead to compare how they do on different inputs:

echo "Benchmarking..."
for f in "${INPUTS[@]}"; do
        echo "==> $f"
        hyperfine --export-markdown "results-$f.md" -L file "$f" "${COMMANDS[@]}"
done

And also added phone_encoder_pr1 to the rm section of course.

And of course there easily can be more optimizations done, such as reducing memory allocations and all such things, jemalloc maybe depending on the use load (or an arena), and other things, etc...

bugaevc · 2021-07-29T16:53:11Z

And of course there easily can be more optimizations done, such as reducing memory allocations and all such things

Which remaining allocations can easily be optimized away?

The remaining allocations are:

The strings read from the dictionary (these could be direct slices if the file was mmaped)
The strings read from the dictionary converted to numbers
The hash table storing them
The words vector for building answers
The digits vector for input
The buffered writer for output

Also, does it seem to you that the remaining allocations are the bottleneck? From the flamechart, it appears that the majority of time is spent looping in print_translations(), so that would be what I'd try to optimize next.

renatoathaydes · 2021-07-30T15:59:55Z

@bugaevc

The largest speed-up is from dropping BigUints

That's unfortunate. I would like to keep BigUint to keep the algorithms comparable at all.

I also tried to make the code a little more idiomatic Rust, but there's still a long way to go.

I've been doing Rust for a few years.... are you sure it's that bad??

BTW I won't merge the Pull Request, will just keep it open for future reference by people who are interested in see where Rust problems were originally... that might help them know what NOT to do in Rust to not make their programs slow.

Thanks for all the effort! Very interesting to see how people manage to squeeze performance out of Rust :)

@OvermindDL1

unsure why that was chosen over rug, which is the usual bigint implementation used in rust programs if the need it...

Where do you find this kind of information? If you look around for BigInt Rust implementation, num is what you find, with num-bigint apparently being the version containing only big-int?

And second, lazy_static was entirely useless here, even if rust-bigint was kept the values were being cloned anyway, so unsure why that was done?

The value was not cloned, it was passed as a reference to the BigUint methods:

n = &n * (&*TEN) + &nth_digit(digits, i);

And yeah, the benchmark runner won't compile because it seems to be using some mac'isms?

Wow that's very unexpected, I thought it was hard to get problems like this in Rust. Can you tell me exactly where the problem is?

@sstadick

you could also try using https://docs.rs/grep-cli/0.1.6/grep_cli/ in place of stdout to avoid line-buffering

Wow, you seriously want to win :D what's the point of comparing with other languages when you go to the effort of replacing stdout?

In my benchmark_runner I threw away output to stdout, so I think it's fair to keep the programs using stdout as a real app would've done... it's pointless to remove that kind of thing.

sstadick · 2021-07-30T16:07:55Z

RE grep_cli, it's a known Rust bug that it will write a line at a time even when not writing to a TTY, so replacing stdout just brings it in line with what other languages are doing. Apparently it's not a slowdown here anyways 🤷

bugaevc · 2021-07-30T16:21:09Z

That's unfortunate. I would like to keep BigUint to keep the algorithms comparable at all.

I went out of my way to simply improve the algorithm implementation and not change the actual algorithm 🙂

As you can see, the change is very small. The algorithm still works the exact same way, it's just that it now stores digit sequences as b'3542' instead of BigUint(3542), because getting the former is far more efficient in Rust. If it's also more efficient in the other languages to store digit sequences as byte strings and not big ints (which I doubt), then by all means, make the change there too.

I also tried to make the code a little more idiomatic Rust, but there's still a long way to go.

I've been doing Rust for a few years.... are you sure it's that bad??

Well... here are a few things I consider unidiomatic:

Vec<char>
All the cloning, like here:

prechelt-phone-number-encoding/src/rust/phone_encoder/src/main.rs

Lines 58 to 59 in 29019dd

let mut partial_solution = words.clone();

partial_solution.push(word);
Swallowing I/O errors, for instance writing

prechelt-phone-number-encoding/src/rust/phone_encoder/src/main.rs

Line 95 in 29019dd

if let Ok(word) = line {

and never handling the error case.
Returning io::Result<()> when you're not actually propagating any errors

Parsing arguments with

prechelt-phone-number-encoding/src/rust/phone_encoder/src/main.rs

Lines 22 to 25 in 29019dd

    
           // drop itself from args 
        
           let mut args: Vec<_> = args().skip(1).collect(); 
        
           let words_file: String = if !args.is_empty() { args.remove(0) } else { "tests/words.txt".into() }; 
        
           let input_file: String = if !args.is_empty() { args.remove(0) } else { "tests/numbers.txt".into() };

If you don't want to use a dedicated args parser such as clap or structopt or argh, the idiomatic way to do this would be to call .next().unwrap_or("default value") instead of if !args.is_empty() { args.remove(0) } else { "default value" }

BTW I won't merge the Pull Request

I don't really understand your motivation here; but it's of course up to you. But please update your blog post with new measurements, charts, and conclusions 🙂

vks · 2021-07-30T16:58:35Z

@renatoathaydes

Where do you find this kind of information? If you look around for BigInt Rust implementation, num is what you find, with num-bigint apparently being the version containing only big-int?

num-bigint is a decent implementation, especially if you are looking for something that does not use any unsafe Rust, in which case it's the only option AFAIK. It is however one of the slowest bigint implementations. I don't think this information is easy to find, it is tribal knowledge of people who compared the different bigint implementations in Rust.

If you are willing to use an implementation in unsafe Rust, ibig is currently the best choice. This would be more comparable to e.g. Python's native bigints, which are implemented in (unsafe) C. All Rust implementations faster than ibig are using bindings to GMP, which is implemented in C and the "state of the art" for large integers. (It takes great care to provide asymptotically fast algorithms, which is overkill for most use cases.)

Edit: I have to correct myself: num-bigint has some unsafe code nowadays, although significantly less than the other contenders.

renatoathaydes · 2021-07-30T19:07:40Z

I don't really understand your motivation here; but it's of course up to you. But please update your blog post with new measurements, charts, and conclusions 🙂

@bugaevc thanks for your review. I will respond to your objections soon, but just want to let you know that the article's intention was never to see how fast a language could be, but how fast it would be when using a certain algorithm and written in a certain amount of time (I had to assume the same algorithm would take roughly the same amount of time in any language, which IMO is pretty close to reality).

Using BigInt or equivalent is absolutely core to the solution, a language that doesn't implement big int well is not going to be appropriate in a lot of domains (I do agree that it was not the optimal solution for this particular problem, but that's beyond the point I was trying to make).

It's great to know there are many library implementations of big int in Rust, but to be honest, this is a weakness in Rust: you don't have a big standard library, and knowning which crate to use is next to impossible. You need to either believe other's suggestions or spend lots of time yourself to figure it out: both of which are not good in a business environment (not to mention that most Rust libraries are 0.x, have dozens of 0.x dependencies, and belong to personal accounts, which is really hard to trust).

I don't think my blog post misrepresented Rust at all, in fact it was unfair to other languages in that it was the only one that used libraries (later, I also used libraries in Dart, but in that case it was only for some ASCII constants, not to optimise anything).

Your suggestion that my Rust code is not perfect might stand, but do you think every Rust developer is perfect and will write highly optimised code? Do you think the code in the other languages was perfect? That's not in the spirit of my article to assume that is the case.

The only purpose of measuring speed and memory was to see how a port of Norvig's solution would perform in other languages, thus discounting the big differences that would certainly arise due to different programs arriving at a solution in completely different ways ... I don't know why some people think it's useful to compare languages that way :) it's just not useful at all... any language can implement any algorithm, so if we want to update the Rust code with optimisations like this, fair enough, but I will also have to optimise the other languages and spend as much effort at least as you have spent on this for Rust. I will also have to update how long it took people to come up with those optimisations... Only then may I update anything on my results. But unfortunately I don't have enough free time (I already spent far too much on this problem) to engage in that right now.

But I will definitely update the spreadsheet with the optimisations in all languages once I am able to collect them (there's a few improvements for Julia, and I have several ideas to make Java and Dart much faster as well).

Finally, I said clearly in the README that I wouldn't accept pull requests: https://github.com/renatoathaydes/prechelt-phone-number-encoding#adding-a-new-program-to-the-benchmarks

Will post them here as soon as I can.

Thanks again for taking the time to make your points.

OvermindDL1 · 2021-07-30T20:18:14Z

Also, does it seem to you that the remaining allocations are the bottleneck? From the flamechart, it appears that the majority of time is spent looping in print_translations(), so that would be what I'd try to optimize next.

+1

Where do you find this kind of information? If you look around for BigInt Rust implementation, num is what you find, with num-bigint apparently being the version containing only big-int?

I actually didn't run across num and num-bigint until long after I'd been using rug. rug was the first result I got when I searched arbitrary precision integer on crates.io a while back, though since then I've seen num-bigint defined as a 'correct-but-not-fast' implementation of arbitrary precision integers useful if you need a platform where gmp is not supported. gmp-mpfr-sys is useful if you just want low level gmp bindings, but rug is much nicer to use (it can use gmp-mpfr-sys as a dependent if so wanted but its optional).

I would like to keep BigUint to keep the algorithms comparable at all.

They aren't really comparable as it is though, num-bigint doesn't do any of those low level performance enhancements that lisp, python, gmp, etc... do in their arbitrary precision integer implementations.

BTW I won't merge the Pull Request, will just keep it open for future reference by people who are interested in see where Rust problems were originally... that might help them know what NOT to do in Rust to not make their programs slow.

Regardless, having the implementation in the repo using num-bigint is not being fair to the rust implementation as it is not an implementation focused on performance but instead focused on readability and correctness to the detriment of performance.

The value was not cloned, it was passed as a reference to the BigUint methods:

It clones it internally to make a new extended copy. You can save that cost even within it by just using platform native integers with the num-bigint libary.

Wow that's very unexpected, I thought it was hard to get problems like this in Rust. Can you tell me exactly where the problem is?

It's because it's using pidrusage, which as seen from the docs is specifically a Mac OS X 10.9+ binding using the rusage subsystem on mac's. The libproc libary is a set of low level binding to some low level interfaces to linux and mac systems, and there's not a lot of overlap between them.

Well... here are a few things I consider unidiomatic:

+1

num-bigint is a decent implementation, especially if you are looking for something that does not use any unsafe Rust, in which case it's the only option AFAIK. It is however one of the slowest bigint implementations. I don't think this information is easy to find, it is tribal knowledge of people who compared the different bigint implementations in Rust.

I don't think it was tribal in my case, I just grabbed the first result on crates for my use, which was rug, I only found out about the others (and their performance issues) later... ^.^;

I don't think my blog post misrepresented Rust at all, in fact it was unfair to other languages in that it was the only one that used libraries (later, I also used libraries in Dart, but in that case it was only for some ASCII constants, not to optimise anything).

It can be done perfectly fine in rust without libraries however. Using a bigint seems a bit odd to store the individual numbers in.

Your suggestion that my Rust code is not perfect might stand, but do you think every Rust developer is perfect and will write highly optimised code? Do you think the code in the other languages was perfect? That's not in the spirit of my article to assume that is the case.

Not highly optimized no, but more traditionally rust, reaching for an arbitrary precision integer implementation is never something I would have looked for for a problem such as this. In lisp that is more expected due to how it's built, but not every language is built the same and its generally best to compare implementations written as if they were natively written in a given language, not something highly optimized, but something readable and "expected" for the language. I'd agree with the same regardless of the language (like even though java has arbitrary precision integers that's still not something a java program would reach for for a task such as this as it's just not the "java way").

bugaevc · 2021-07-30T20:34:44Z

I see, thank you for explaining.

the article's intention was never to see how fast a language could be, but how fast it would be when using a certain algorithm and written in a certain amount of time

That wasn't clear to me from the article at all.

While it undoubtedly talks about developer productivity and the time it took for the studies' participants to prepare their submissions, the actual implementations being benchmarked, introduced in the Benchmarks section, are described as:

To see what would’ve happened if I had chosen Norvig’s strategy instead of using a Trie, I decided to port his CL code as closely as possible to Java.

I did the same thing targeting Rust as well in order to get a baseline for how fast the algorithm could go (or so I thought).

You can see the source code for each implementation on GitHub:

Java 1 - my original trie-based implementation.

Java 2 - port of Norvig’s solution to Java.

Lisp - Norvig’s solution written in Common Lisp.

Rust - port of Norvig’s solution to Rust.

i.e. it says "port of Norvig’s solution to Rust" "to get a baseline for how fast the algorithm could go", not "as good of a Rust implementation as I managed to write in 3.5 hours, at which point I stopped in order to be fair to other languages". A bit later you even talk about spending time to improve the Rust version:

I asked Rust developers on the Rust Discord Channel for help figuring out why it was so slow, but after several suggestions, the results did not improve significantly (the result shown above includes some suggestions). I even generated a flamegraph to profile the Rust code, but was unable to find any quick fix.

Of course, it's your post, so you know better; it's just that my understanding was incorrect; and thank you again for explaining it. But that also shows that it's easy to mistake your post for a usual post that compares the performance of idiomatic, reasonably optimized versions of the same program in different languages. So perhaps it wouldn't hurt to add a note along these lines:

"Note: This Rust implementation is not as fast as it could be: with some small changes it can be made 3 to 6 times faster. But this article's intention is not to see how fast a language could be, but how fast it would be when using a certain algorithm and written in a certain amount of time."

OvermindDL1 · 2021-07-30T20:54:27Z

I am very curious at looking at the discussion of people that looked at that code as I'm curious how such obvious things were missed, I hopped in the rust discord and searched for "norvig" (last result 2 years ago), "lisp" (waaaay too many results, as well as a few variations of lisp), "bigint" (found a lot of people complaining about num-bigint being slow and that they should use a gmp based library instead like rug, mostly sql related messages though), "phone" (whoa way too many results), "phone number" (lot of complaints about advertisers taking people's phone numbers and how bad it is for 2fa), "prechelt" (no results at all), and a few other things including looking at many messages of accounts of a similar name to the github name here (though none identical), but not finding anything related to anything like this at all?

renatoathaydes · 2021-07-30T21:52:48Z

This discussion is actually great.

Every language community that has discussed my blog post so far has had some similar arguments (which is expected): the code should be optimised, it shouldn't use BigInt, it's not fair to our language because blah blah...

But this is the first time anyone has come up with significant improvements.

As I mentioned before, removing BigInt is a problem for me, but I am still interested in learning what makes Rust programs fast or slow, and this was very illuminating for that.

Here's the results I got after running the Rust code in this Pull Request VS a Java 2 implementation that uses the same techniques for optimisation:

Rust is now much closer to what everyone would expect: the fastest language and the lowest memory consumption :)

But I have to note that CL is still, amazingly, very close.

I must say I still do believe that a student in the same circumstances as the other students in the study might write sloppy Rust code (yes, I admit my Rust code was sloppy!) and that BitInt is slow in Rust, and because of that I think my article's points remain mostly accurate (Rust was just a "side" story in the article anyway, I barely discussed it, as the purpose for which I tried to use it - as a proxy for the C/C++ fastest implementations - didn't work, as you know). Specially because Rust is so hard to write (well) compared to other languages (opinions may differ, but I know a lot of people that agree with that, and very few who don't - and this very discussion kind of proves the point).

The Java 2 code has been pushed to the fast-java branch, and you can see it here.

The Java 1 and CL codes are the exact same as in the original benchmark. I tried to apply some optimisations to the Java 1 implementation but it backfired :D ... sometimes Java can be tricky to optimise because the JIT "likes" certain code better than others that might look more peformant. However, it still does have a lot of room for improvement, maybe some other time I will look again into it.

I am very curious at looking at the discussion of people that looked at that code as I'm curious how such obvious things were missed

What may be obvious to you is not obvious to everyone. The only "obvious" thing, I think, was replacing BigInt which was something I intentionally didn't want to do!

... implementation focused on performance but instead focused on readability and correctness to the detriment of performance.

This is actually a great comment: it shows that Rust is not magically fast, and not expected to be by its proponents, and that it still requires you to make a tradeoff. It takes effort to make Rust fast. Which is something very important to consider when choosing a language in a professional setting where time is of the essence.

I've already published the data in the spreadsheet (linked above) and I will add a note to the original article mentioning a small summary of this discussion as I think readers may indeed arrive at the wrong conclusions based on what the article currently says about Rust (notice that I am a Rust proponent myself and use it at work, although as only a third or fourth language, and a few hobby projects).

I will post the update tomorrow as it's late now!

Thanks everyone for taking the time to scrutinize my article.

philipc · 2021-07-30T22:02:17Z

See https://github.com/philipc/prechelt-phone-number-encoding/blob/rust/src/rust/phone_encoder/src/main.rs if you want a version that still uses num-bigint and is reasonably fast. These are the changes I previously discussed on reddit. And interestingly, I tried swapping to rug and ibig in that and it was slower.

renatoathaydes · 2021-07-30T22:26:36Z

@philipc that's amazing. It's nearly as performant as the highly optimised code in this PR, but much closer to the approach of the original!

Your program:

===> ./phone_encoder
Proc,Memory(bytes),Time(ms)
./phone_encoder,24567808,105
./phone_encoder,24600576,176
./phone_encoder,24596480,428
./phone_encoder,24580096,745

This PR code:

===> ./phone_encoder
Proc,Memory(bytes),Time(ms)
./phone_encoder,23511040,101
./phone_encoder,23515136,153
./phone_encoder,23539712,332
./phone_encoder,23543808,568

Very well done... I wish I had used that in the study :(

renatoathaydes · 2021-07-31T14:47:04Z

Hey guys, here I am again probably going to get in trouble!

I decided to lose one more day of my life on this problem, and wrote a whole new article about this:

https://renato.athaydes.com/posts/how-to-write-slow-rust-code.html

I also edited the original with a note right next to where the Rust chart is shown for the first time.

The conclusions are NOT what anyone expected, including me, again?! Did I do something wrong?

I hope not... anyway, let me know if you would like me to change anything, or give more credit to you guys or something...

Thanks!

renatoathaydes · 2021-08-01T14:08:50Z

Please note that the version of my changes you appeared to benchmark (the fast-rust-no-bigint branch) does not include the two changes I proposed in this comment.

Oh,that's true... If I applied that, I would need to do the same in Java as Java obviously also has faster hash-map implementations. I don't think that doing that would be reasonable of me! There's always going to be something else that coulde be modified in both versions (and let's not forget Common Lisp is still not optimised at all, not really fair to them).

I also tried buffering output & getting rid of constant locking:

That's fair enough, I am going to update the article with results when both Java and Rust buffer stdout (currently, neither seems to do as once I use buffering, both get a big speed up on the larger duration problems).

I will keep you posted once results are in.

renatoathaydes · 2021-08-01T14:17:15Z

This reduced the run to further from 1 minute to... 1.6 seconds. Yes, really.

I don't know, to me it looks like the whole thing was just optimized away... there seems to be no plausible way it was spending all the rest of that minute just trying to print (none of the flamegraphs I've seen shows anything like that).

renatoathaydes · 2021-08-01T14:19:31Z

An acceptable alternative in my opinion, would to consume the output of the program on a separate thread, and let the second thread print the results using the buffer... you can then see if the first thread terminates much earlier than the printer thread does, in which case we can count only the time taken up until the first thread terminates (as we don't really care as much about how long it takes to push to stdout).

bugaevc · 2021-08-01T14:28:50Z

Java obviously also has faster hash-map implementations

Please note that:

The thing I changed in Rust was use a different hasher (hashing algorithm), not a different hash map implementation. I believe in Java a specific hashing algorithm is fixed (or rather, each object's hashCode() decides on its own scheme and returns a plain integer); but in Rust Hash and Hasher implementations are orthogonal, which is why you can easily replace the hash algorithm.

The default hasher used in Rust is designed to not degrade in case of attacker-controlled data (so the attacker would be unable to force all keys to hash to the same value). That is, generally, a good choice, but it means the default hasher is not as fast as it could be. I don't think the hash functions implemented in Java for the standard types do that.
The Apache Commons hash map you've linked is explicitly documented to be optimized for mostly-read multithread workflows. Furthermore, it states:

If you are creating and accessing a HashMap only within a single thread, you should use java.util.HashMap directly (with no synchronization), for maximum performance.

so I don't think it would be an improvement to use it here.

I don't know, to me it looks like the whole thing was just optimized away...

Are you saying that having looked at the generated assembly and the flamechart, or just from the general idea that this much of a speed-up is unexpected?

An acceptable alternative in my opinion, would to consume the output of the program on a separate thread, and let the second thread print the results using the buffer...

See, even simply buffering all that output takes a lot of time (apparently most of the 1 minute run time), which was clearly visible on the flamechart when using BufWriter, which is exactly why I wondered what would happen if we don't do the buffering either.

I'm no JVM expert, but a quick web search turned up Blackhole from JMH as Java's counterpart to Rust's black_box() (as far as consuming the value goes). Can you benchmark the Java version with that instead of outputting results?

renatoathaydes · 2021-08-01T15:33:47Z

@bugaevc if the solution to the problem is not visible, it's useless to measure anything, so not printing, or even at least keeping all found solutions for later printing, is not done, there's no point measuring anything.

Are you saying that having looked at the generated assembly and the flamechart, or just from the general idea that this much of a speed-up is unexpected?

I don't care at all about the assembly as even that is impossible to profile in your head. I care about the results being found and used. Also, figuring out the solutions should take a substantial amount of time as the number of combinations becomes enormous with longer phone numbers... so yes, it's very unrealistic the problem is actually being solved in just 1 second.

Can you try to make the buffering/printing run on a separate Thread or not?

In Java it's trivial to do that, I did it, it appears that even for very large inputs, the actual printing takes up less than 1/3 of the time even without any buffering. That's substantial, but I wouldn't say that's making the benchmark measure the entirely wrong thing because it is absolutely part of the problem that you need to finish outputing all solutions.

bugaevc · 2021-08-01T16:28:37Z

if the solution to the problem is not visible, it's useless to measure anything, so not printing, or even at least keeping all found solutions for later printing, is not done, there's no point measuring anything.

How so? It makes a ton of sense to me to measure the time it takes to compute the solutions, not output them — especially in case outputting the solutions far exceeds the time to compute them. In other words, what you're currently benchmarking seems to be output performance, not the actual Prechelt’s problem or Norvig's solution.

The only issue with this is ensuring that the steps to compute the solutions are actually taken if we don't print them out, because the optimizers are so damn smart. That's what the black_box() gives us. So now, we're computing the solutions and throwing them away immediately, but to the optimizer it still appears we're using them somehow, so it cannot optimize anything away. And the measured time reflects how long it take to actually compute the solutions (≈1.6s), not how much time it takes to output/buffer them (≈59s).

Unless, of course, I'm entirely wrong, and the optimizer did optimize the whole thing away without me noticing. I don't rule out this possibility, but as of now there's way more evidence that it works as I expect it to, than otherwise.

I don't care at all about the assembly as even that is impossible to profile in your head. I care about the results being found and used.

You don't have to profile the assembly in your head; you simply have to verify that the calls to print_translations(), hashing, etc. are all still there, not optimized away. Here's a Godbolt (with some slight modifications to make the code build on Godbolt; don't actually benchmark that version) where you can do just that.

Making the results "used" is precisely what black_box() does. (The other thing it does is it prevents the optimizer from seeing its output is the same value as its input, but I'm not using that here.) It by itself produces no assembly code, but makes the optimizer believe the value is used and so cannot be optimized away (and must be materialized in a register or in memory).

Also, figuring out the solutions should take a substantial amount of time as the number of combinations becomes enormous with longer phone numbers... so yes, it's very unrealistic the problem is actually being solved in just 1 second.

There seems to be specifically 71 998 979 solutions in the words.txt phones_2000.txt case; it's not that unrealistic to check 72 million values in about a second. For comparison, (each core of) my CPU runs at 1.60GHz, so it can do 1.6 billion cycles per second.

Outputting 72 million values, on the other hand, can be quite slow.

Can you try to make the buffering/printing run on a separate Thread or not?

I sure can, but as said, that would mostly measure the time it takes to buffer the results, not the time it takes to compute them. It doesn't matter much that there would be two threads, as it would still be the first thread that does the buffering.

So I've done that, tried to run it, and... ran into a little problem: the first thread generated solutions far faster than the second thread is able to consume and print them, so the process keeps allocating memory and eventually gets OOM-killed. Therefore I was unable to run it to the end, but runs for at least 30 seconds for sure.

The code, in case you want to try this yourself

use std::collections::HashMap;
use std::env::args;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::fmt::{self, Display, Formatter};
use std::sync::mpsc::{Sender, channel};

type Dictionary = HashMap<Vec<u8>, Vec<String>, ahash::RandomState>;

#[derive(Debug, Copy, Clone)]
enum WordOrDigit<'a> {
    Word(&'a str),
    Digit(u8),
}

impl Display for WordOrDigit<'_> {
    fn fmt(&self, formatter: &mut Formatter<'_>) -> fmt::Result {
        match self {
            WordOrDigit::Word(s) => s.fmt(formatter),
            WordOrDigit::Digit(d) => d.fmt(formatter),
        }
    }
}


/// Port of Peter Norvig's Lisp solution to the Prechelt phone-encoding problem.
///
/// Even though this is intended as a port, it deviates quite a bit from it
/// due to the very different natures of Lisp and Rust.
fn main() -> io::Result<()> {
    // drop itself from args
    let mut args: Vec<_> = args().skip(1).collect();
    let words_file: String = if !args.is_empty() { args.remove(0) } else { "tests/words.txt".into() };
    let input_file: String = if !args.is_empty() { args.remove(0) } else { "tests/numbers.txt".into() };

    let dict = load_dict(words_file)?;

    let (sender, receiver) = channel();
    std::thread::spawn(|| {
        for value in receiver {
            println!("{}", value);
        }
    });

    for line in read_lines(input_file)? {
        let num = line?;
        let digits: Vec<u8> = num.chars()
            .filter(char::is_ascii_digit)
            .map(|ch| ch as u8)
            .collect();
        print_translations(&num, &digits, &mut Vec::new(), &dict, &sender);
    }
    Ok(())
}

fn print_translations<'a>(
    num: &str,
    digits: &[u8],
    words: &mut Vec<WordOrDigit<'a>>,
    dict: &'a Dictionary,
    sender: &Sender<String>,
) {
    if digits.is_empty() {
        print_solution(num, &words, sender);
        return;
    }
    let mut found_word = false;
    for i in 0..digits.len() {
        let (key, rest_of_digits) = digits.split_at(i + 1);
        if let Some(found_words) = dict.get(key) {
            for word in found_words {
                found_word = true;
                words.push(WordOrDigit::Word(word));
                print_translations(num, rest_of_digits, words, dict, sender);
                words.pop();
            }
        }
    }
    if found_word {
        return;
    }
    let last_is_digit = match words.last() {
        Some(WordOrDigit::Digit(_)) => true,
        _ => false,
    };
    if !last_is_digit {
        let digit = digits[0] - b'0';
        words.push(WordOrDigit::Digit(digit));
        print_translations(num, &digits[1..], words, dict, sender);
        words.pop();
    }
}

fn print_solution(num: &str, words: &[WordOrDigit<'_>], sender: &Sender<String>) {
    sender.send(num.to_owned()).unwrap();
    for word in words {
        sender.send(word.to_string()).unwrap();
    }
}

fn load_dict(words_file: String) -> io::Result<Dictionary> {
    let mut dict: Dictionary = HashMap::with_capacity_and_hasher(
        100,
        ahash::RandomState::default()
    );
    for line in read_lines(words_file)? {
        let word = line?;
        let key = word_to_number(&word);
        let words = dict.entry(key).or_default();
        words.push(word);
    }
    Ok(dict)
}

// The output is wrapped in a Result to allow matching on errors
// Returns an Iterator to the Reader of the lines of the file.
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
    where P: AsRef<Path>, {
    let file = File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

fn word_to_number(word: &str) -> Vec<u8> {
    word.chars()
        .filter(char::is_ascii_alphabetic)
        .map(char_to_digit)
        .map(|d| d + b'0')
        .collect()
}

fn char_to_digit(ch: char) -> u8 {
    match ch.to_ascii_lowercase() {
        'e' => 0,
        'j' | 'n' | 'q' => 1,
        'r' | 'w' | 'x' => 2,
        'd' | 's' | 'y' => 3,
        'f' | 't' => 4,
        'a' | 'm' => 5,
        'c' | 'i' | 'v' => 6,
        'b' | 'k' | 'u' => 7,
        'l' | 'o' | 'p' => 8,
        'g' | 'h' | 'z' => 9,
        _ => panic!("invalid input: not a digit: {}", ch)
    }
}

This could be solved by using a bounded buffer, but that again would massively slow down the first thread once it exhausts the buffer size and will have to wait for the second thread to consume more values.

Here's a suggestion in case you don't believe in black_box(). Let's compute some sort of hash of all the solutions and print that (single hash value) in the end. That ensures that the solutions are obviously used, without you having to believe in back_box(). (The specific details of hashing algorithm don't matter and it's OK if Java and Rust output different values — what matters is that the output is obviously used.)

I've implemented that for Rust, here's the code:

use std::collections::HashMap;
use std::env::args;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::fmt::{self, Display, Formatter};
use std::hash::{Hash, Hasher};

type Dictionary = HashMap<Vec<u8>, Vec<String>, ahash::RandomState>;

#[derive(Debug, Copy, Clone, Hash)]
enum WordOrDigit<'a> {
    Word(&'a str),
    Digit(u8),
}

impl Display for WordOrDigit<'_> {
    fn fmt(&self, formatter: &mut Formatter<'_>) -> fmt::Result {
        match self {
            WordOrDigit::Word(s) => s.fmt(formatter),
            WordOrDigit::Digit(d) => d.fmt(formatter),
        }
    }
}


/// Port of Peter Norvig's Lisp solution to the Prechelt phone-encoding problem.
///
/// Even though this is intended as a port, it deviates quite a bit from it
/// due to the very different natures of Lisp and Rust.
fn main() -> io::Result<()> {
    // drop itself from args
    let mut args: Vec<_> = args().skip(1).collect();
    let words_file: String = if !args.is_empty() { args.remove(0) } else { "tests/words.txt".into() };
    let input_file: String = if !args.is_empty() { args.remove(0) } else { "tests/numbers.txt".into() };

    let dict = load_dict(words_file)?;
    let mut hasher = ahash::AHasher::default();

    for line in read_lines(input_file)? {
        let num = line?;
        let digits: Vec<u8> = num.chars()
            .filter(char::is_ascii_digit)
            .map(|ch| ch as u8)
            .collect();
        print_translations(&num, &digits, &mut Vec::new(), &dict, &mut hasher);
    }
    println!("{}", hasher.finish());
    Ok(())
}

fn print_translations<'a>(
    num: &str,
    digits: &[u8],
    words: &mut Vec<WordOrDigit<'a>>,
    dict: &'a Dictionary,
    hasher: &mut impl Hasher,
) {
    if digits.is_empty() {
        print_solution(num, &words, hasher);
        return;
    }
    let mut found_word = false;
    for i in 0..digits.len() {
        let (key, rest_of_digits) = digits.split_at(i + 1);
        if let Some(found_words) = dict.get(key) {
            for word in found_words {
                found_word = true;
                words.push(WordOrDigit::Word(word));
                print_translations(num, rest_of_digits, words, dict, hasher);
                words.pop();
            }
        }
    }
    if found_word {
        return;
    }
    let last_is_digit = match words.last() {
        Some(WordOrDigit::Digit(_)) => true,
        _ => false,
    };
    if !last_is_digit {
        let digit = digits[0] - b'0';
        words.push(WordOrDigit::Digit(digit));
        print_translations(num, &digits[1..], words, dict, hasher);
        words.pop();
    }
}

fn print_solution(
    num: &str,
    words: &[WordOrDigit<'_>],
    hasher: &mut impl Hasher
) {
    num.hash(hasher);
    for word in words {
        word.hash(hasher);
    }
}

fn load_dict(words_file: String) -> io::Result<Dictionary> {
    let mut dict: Dictionary = HashMap::with_capacity_and_hasher(
        100,
        ahash::RandomState::default()
    );
    for line in read_lines(words_file)? {
        let word = line?;
        let key = word_to_number(&word);
        let words = dict.entry(key).or_default();
        words.push(word);
    }
    Ok(dict)
}

// The output is wrapped in a Result to allow matching on errors
// Returns an Iterator to the Reader of the lines of the file.
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
    where P: AsRef<Path>, {
    let file = File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

fn word_to_number(word: &str) -> Vec<u8> {
    word.chars()
        .filter(char::is_ascii_alphabetic)
        .map(char_to_digit)
        .map(|d| d + b'0')
        .collect()
}

fn char_to_digit(ch: char) -> u8 {
    match ch.to_ascii_lowercase() {
        'e' => 0,
        'j' | 'n' | 'q' => 1,
        'r' | 'w' | 'x' => 2,
        'd' | 's' | 'y' => 3,
        'f' | 't' => 4,
        'a' | 'm' => 5,
        'c' | 'i' | 'v' => 6,
        'b' | 'k' | 'u' => 7,
        'l' | 'o' | 'p' => 8,
        'g' | 'h' | 'z' => 9,
        _ => panic!("invalid input: not a digit: {}", ch)
    }
}

This runs in 5.7s, which is 6× slower than 1s (clearly, computing additional hashes takes time), but still 10× faster than the original 1 minute (which also confirms that the version with black_box() is legit).

renatoathaydes · 2021-08-01T16:51:08Z

You have good points but I think they are mostly of limited interest to the readers of my blog post.

I think what you're suggesting is that Java may be much slower to solve the problems, but so much faster to print them that it still wins. I don't see how that matters at all if that's the case. The user of the program wants to use the results of the program, they don't care where the time is being spent.

Feel free to leave a criticism of my blog post here if you still feel like I was unfair on my analysis. I don't think I was.

The problem was: find and print all solutions.

Now you're solving a different problem entirely which I think doesn't make sense.

vks · 2021-08-01T22:08:47Z

@bugaevc I agree that it does not make much sense to use black box, because you still need to output your results somewhere. You could try to push your results to a string (or a Vec if you want to keep the data binary) instead of printing them, and print the string at the end.

bugaevc · 2021-08-02T15:56:51Z

I think what you're suggesting is that Java may be much slower to solve the problems, but so much faster to print them that it still wins.

That's not at all what I'm saying.

In fact, I don't think I have made any claims in this thread on relative performance of the different languages. All I've been doing is benchmarking different variants of the Rust implementation against each other, and suggesting that some changes may or may not make sense to do in the Java version too, based on what I have observed about the Rust version.

My main finding has been that in the words.txt phones_2000.txt benchmark, the time it takes the Rust implementation to compute the 72m solutions is absolutely dwarfed by the time it takes it to buffer and/or output the resulting 2.8 Gb of data. I'm speculating that the Java version also faces a similar situation. I absolutely don't know whether Java is faster or slower to compute the solutions, which is exactly why I was suggesting that you benchmark that.

I might be biased by my background in competitive programming, but it's a very well known fact that outputting large amounts of data is slow. Problems that generate a huge number of data ("Find all possible xxx that...") typically require the solutions to only output the number of items found and not the items themselves; this is done specifically to ensure that the solution has to find all the valid items, but does not have to print them.

If I was to publish a benchmark that was meant to compare the performance of a few solutions to an algorithmic problem but in fact turned out to mostly measure the time it takes to output the answers, I'd be greatly embarrassed, and would rush to publish a rebuttal & fix the benchmark.

That is not to suggest that it never makes sense to benchmark output performance/throughput of some system; such benchmarks can also be useful for their own purposes. It's just that it sounds very wrong to me to claim that you're benchmarking some solution to some problem when in fact what you're measuring is how long it takes to output 2.8 Gb of data.

But of course, it's your blog, you make the rules. If you say that

The problem was: find and print all solutions. Now you're solving a different problem entirely which I think doesn't make sense.

then OK, let's see how fast we can output the 2.8 Gb of data.

Increasing the buffer size to a megabyte seems to help a lot, bringing down the run time to ≈29s. (Increasing the buffer size further doesn't seem to improve the run time further — meaning the time is already dominated by buffering and not the actual write() syscalls.) Whether we use a fast hashing algorithm or a slow one doesn't matter much at this point.

diff relative to 7f1e5ee

diff --git a/src/rust/phone_encoder/src/main.rs b/src/rust/phone_encoder/src/main.rs
index 9c755ea..f5b184f 100644
--- a/src/rust/phone_encoder/src/main.rs
+++ b/src/rust/phone_encoder/src/main.rs
@@ -1,7 +1,7 @@
 use std::collections::HashMap;
 use std::env::args;
 use std::fs::File;
-use std::io::{self, BufRead};
+use std::io::{self, Write, BufRead, BufWriter, StdoutLock};
 use std::path::Path;
 use std::fmt::{self, Display, Formatter};
 
@@ -35,13 +35,20 @@ fn main() -> io::Result<()> {
 
     let dict = load_dict(words_file)?;
 
+
+    let stdout = io::stdout();
+    let mut writer = BufWriter::with_capacity(
+        1024 * 1024,
+        stdout.lock(),
+    );
+
     for line in read_lines(input_file)? {
         let num = line?;
         let digits: Vec<u8> = num.chars()
             .filter(char::is_ascii_digit)
             .map(|ch| ch as u8)
             .collect();
-        print_translations(&num, &digits, &mut Vec::new(), &dict);
+        print_translations(&num, &digits, &mut Vec::new(), &dict, &mut writer);
     }
     Ok(())
 }
@@ -51,9 +58,10 @@ fn print_translations<'a>(
     digits: &[u8],
     words: &mut Vec<WordOrDigit<'a>>,
     dict: &'a Dictionary,
+    writer: &mut BufWriter<StdoutLock>,
 ) {
     if digits.is_empty() {
-        print_solution(num, &words);
+        print_solution(num, &words, writer);
         return;
     }
     let mut found_word = false;
@@ -63,7 +71,7 @@ fn print_translations<'a>(
             for word in found_words {
                 found_word = true;
                 words.push(WordOrDigit::Word(word));
-                print_translations(num, rest_of_digits, words, dict);
+                print_translations(num, rest_of_digits, words, dict, writer);
                 words.pop();
             }
         }
@@ -78,25 +86,25 @@ fn print_translations<'a>(
     if !last_is_digit {
         let digit = digits[0] - b'0';
         words.push(WordOrDigit::Digit(digit));
-        print_translations(num, &digits[1..], words, dict);
+        print_translations(num, &digits[1..], words, dict, writer);
         words.pop();
     }
 }
 
-fn print_solution(num: &str, words: &[WordOrDigit<'_>]) {
+fn print_solution(num: &str, words: &[WordOrDigit<'_>], writer: &mut BufWriter<StdoutLock>) {
     // do a little gymnastics here to avoid allocating a big string just for printing it
-    print!("{}", num);
+    write!(writer, "{}", num).unwrap();
     if words.is_empty() {
-        println!(":");
+        writeln!(writer, ":").unwrap();
         return;
     }
-    print!(": ");
+    write!(writer, ": ").unwrap();
     let (head, tail) = words.split_at(words.len() - 1);
     for word in head {
-        print!("{} ", word);
+        write!(writer, "{} ", word).unwrap();
     }
     for word in tail { // only last word in tail
-        println!("{}", word);
+        writeln!(writer, "{}", word).unwrap();
     }
 }

You could try to push your results to a string (or a Vec if you want to keep the data binary) instead of printing them, and print the string at the end.

That is exactly what I mean by buffering (except you don't have to hold off printing all the way till the very end, doing it a few times in the middle is fine), and as you can see this is also very slow compared to actually finding the solutions. If we are indeed interested in the combined time it takes to find the solutions and output them (which would be reasonable if the output part wasn't this disproportionally slow), then yes, this seems to be our best shot.

OvermindDL1 · 2021-08-02T16:40:51Z

The Java 2 code has been pushed to the fast-java branch, and you can see it here.

Ooooo nice! Java's BigInt is also pretty simple and not at all nor designed to be fast, just accurate for business use but as this was a Rust thread I didn't bother mentioning a few things about Java that I noticed.

Unlike Rust, Java doesn't have a good binding with gmp or other such faster bigint libaries because the JNI just has so much overhead and I've never seen a native-java library that focuses on gmp-like performance.

What may be obvious to you is not obvious to everyone.

I like to read such conversations because it shows me how the more average people think and reason through things, it's a huge learning opportunity for me. I've very biased from working in C++ for 30+ years so I often can't really see things from a more new person perspective, hence why I like reading articles like yours and discussions like that. :-)

And interestingly, I tried swapping to rug and ibig in that and it was slower.

Yeah I had that fear that using it as a dynamic library (which rug does by default) would prevent a lot of inline optimizations that would be a huge help. I've never used ibig before, only found out about it because of this thread, may need to experiment sometime. ^.^

I really like that it got rid of the internal clone, I was thinking that would be one of the biggest speed hits of it and this does seem to corrolate considering its being created via .into() from a native integer. I wonder if there's someway to precalc the size it needs, hmm...

I decided to lose one more day of my life on this problem, and wrote a whole new article about this:

Oh hey, I quite enjoy reading your articles, awesome!

Perhaps not including those changes was intentional on your part (perhaps you consider it unfair to use an external hasher and additional output buffering?), but it sounds like this deserves at least a mention in the new post. It's going to, once again, affect the outcome pretty significantly 😄
Hmm, it might actually be the other way around. Correct me if I'm wrong; but it sounds like Java does both System.out buffering and non-DoS-proof hashing by default.

Java indeed does not have a DoS resistent hasher, Rust is one of the very very few languages I've ever run across that defaults to a DoS resistent hasher, but it's designed to be extremely simple to swap out the hasher as they know that, unlike most languages, Rust programmers often like to optimize for speed, but they want the default to be 'safe' unlike other languages hashers. ^.^

Of course, it wouldn't be fair to Java (or other languages) to make the Rust version do no output while they still do it, especially since there's now enough evidence to believe that the time to output the results takes up the absolute majority of the total execution time. It would be fair to give the Java version the same treatment, i.e. make it not print the results, but "use" them in some way that would ensure the JIT will not optimize the whole computation away.

Java doesn't have a good black_box equivilant function though, I've tried very hard and found nothing that is reliable. Even some supposed JVM-specific calls have failed and been optimized out...

Oh,that's true... If I applied that, I would need to do the same in Java as Java obviously also has faster hash-map implementations. I don't think that doing that would be reasonable of me! There's always going to be something else that coulde be modified in both versions (and let's not forget Common Lisp is still not optimised at all, not really fair to them).

Java does have faster hashmaps yes, but it's default hashmap is not DoS safe, so it should in general be faster than DoS resistent hashers regardless. But yeah, using the same hashing algorithm wouldn't be a bad idea, right now Rust and Java (and CL) all use different algorithms.

I don't know, to me it looks like the whole thing was just optimized away... there seems to be no plausible way it was spending all the rest of that minute just trying to print (none of the flamegraphs I've seen shows anything like that).

I'm curious about that as well, but the black_box usage does look correct, and if using nightly rust the black_box should work unless there's been a (very unlikely) regression in LLVM...

Hmm, let me run that above code (hyperfine again as before, the pr1 version is as before):

Summary
  './phone_encoder dictionary.txt phones_1000.txt' ran
    1.32 ± 0.34 times faster than './phone_encoder_pr1 dictionary.txt phones_1000.txt'
    1.66 ± 0.38 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_1000.txt'
    6.66 ± 1.25 times faster than 'java -cp build/java Main dictionary.txt phones_1000.txt'
    7.77 ± 1.49 times faster than 'java -cp build/java Main2 dictionary.txt phones_1000.txt'

Summary
  './phone_encoder dictionary.txt phones_10_000.txt' ran
    1.29 ± 0.26 times faster than './phone_encoder_pr1 dictionary.txt phones_10_000.txt'
    2.21 ± 0.39 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt'
    5.68 ± 0.99 times faster than 'java -cp build/java Main dictionary.txt phones_10_000.txt'
    7.34 ± 1.22 times faster than 'java -cp build/java Main2 dictionary.txt phones_10_000.txt'
	
Summary
  './phone_encoder dictionary.txt phones_10_000.txt' ran
    1.29 ± 0.26 times faster than './phone_encoder_pr1 dictionary.txt phones_10_000.txt'
    2.21 ± 0.39 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt'
    5.68 ± 0.99 times faster than 'java -cp build/java Main dictionary.txt phones_10_000.txt'
    7.34 ± 1.22 times faster than 'java -cp build/java Main2 dictionary.txt phones_10_000.txt'

Summary
  './phone_encoder dictionary.txt phones_100_000_with_empty.txt' ran
    1.91 ± 0.11 times faster than './phone_encoder_pr1 dictionary.txt phones_100_000_with_empty.txt'
    3.69 ± 0.28 times faster than 'java -cp build/java Main dictionary.txt phones_100_000_with_empty.txt'
    4.50 ± 0.22 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_100_000_with_empty.txt'
    8.47 ± 0.43 times faster than 'java -cp build/java Main2 dictionary.txt phones_100_000_with_empty.txt'

I'm not sure I'm seeing the minute sized difference, here it is with 300k entries:

> time ./phone_encoder_pr1 dictionary.txt <(cat ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt) >/dev/null
./phone_encoder_pr1 dictionary.txt  > /dev/null  1.27s user 0.03s system 99% cpu 1.299 total

> time ./phone_encoder dictionary.txt <(cat ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt) >/dev/null 
./phone_encoder dictionary.txt  > /dev/null  0.65s user 0.00s system 99% cpu 0.657 total

It's faster than the PR1 as given but not magnitudes slower, so it seems like black_box is working correctly?

Looking at the generated ASM (via cargo-asm) they also look expectedly similar except for the actual missing calls to the output stream.

The thing I changed in Rust was use a different hasher (hashing algorithm), not a different hash map implementation. I believe in Java a specific hashing algorithm is fixed (or rather, each object's hashCode() decides on its own scheme and returns a plain integer); but in Rust Hash and Hasher implementations are orthogonal, which is why you can easily replace the hash algorithm.

The default hasher used in Rust is designed to not degrade in case of attacker-controlled data (so the attacker would be unable to force all keys to hash to the same value). That is, generally, a good choice, but it means the default hasher is not as fast as it could be. I don't think the hash functions implemented in Java for the standard types do that.

Exactly yeah.

I'm no JVM expert, but a quick web search turned up Blackhole from JMH as Java's counterpart to Rust's black_box() (as far as consuming the value goes). Can you benchmark the Java version with that instead of outputting results?

Not reliable, it's one I've tried and I've had it optimized out in many cases... :-(

It is the 'most' reliable I've found though since it seems to work sometimes and I haven't found a better one yet short of just outputting data to a JNI call (which has its own overhead, but is doable if you want an actual reliable one). Although... you could just pass the entire calculated data to a single JNI call to black hole the entire thing? It won't be serialized or anything, it's direct memory access over the JNI call.

if the solution to the problem is not visible, it's useless to measure anything, so not printing, or even at least keeping all found solutions for later printing, is not done, there's no point measuring anything.

I do agree, printing should still be done, just need to be sure in every language to lock stdout and buffer the output to try to reduce its overhead as much as possible.

I'm not sure I'd start touching threads in programs, the way they synchronize can be extremely different (and honestly Rust would have an unnatural advantage here I would think...).

I do see the advantage to only benchmarking the data calculations, but that starts getting into interesting things that are hard to resolve between native and JIT languages. Like if it were all native then you could make dynamic libaries between all and have a program benchmark across them, but instead with java and such you need to do something like trust that they benchmarking themselves accurately and report the results correctly all without optimizing huge chunks out. This could be minimized perhaps by, say, using jni in a rust benchmarker or so to dynamically link in native libaries and use JNI to call across to the java version and so forth and that overhead would be barely measurable (and you can indeed measure it if you so wish), that would prevent things from being optimized out at that point and allow it to be pretty easily reproducible across them all, but that's getting into some more work (although less than one might initially think).

But yeah, in general benchmarking is really really really really hard to get right, that's why I dive into the assembly when I benchmark things so often (if anyone recognizes my benchmarking things randomly around the 'net), that way to compare the actual generated machine code, but you have to make sure you are testing what you actually want to test (generated code with identical algorithms? generated code from 'traditionally programmed' code for the given language? Etc...).

bugaevc · 2021-08-02T16:48:36Z

I'm not sure I'm seeing the minute sized difference

The specific benchmark I have been talking about in the few last posts is words.txt phones_2000.txt:

prechelt-phone-number-encoding/benchmark.sh

Line 74 in 7f1e5ee

./benchmark_runner $CMD words.txt phones_2000.txt

The others output a lot less data, which is why the effect is less pronounced and swapping the output for black box doesn't improve timing as much.

OvermindDL1 · 2021-08-02T17:00:21Z

As an aside, I wonder how the rust code would look when written in a more, at least what I consider, a traditionally rust style, with iterators and so forth rather than return variables, hmmm... I can't really see much if any performance advantage except perhaps in one place, but it would definitely read more clean (in my opinion anyway, but I'm a fan of functional styles)...

The others output a lot less data, which is why the effect is less pronounced and swapping the output for black box doesn't improve timing as much.

Even still it doesn't seem to be that huge, though let me duplicate it to 50 million:

> rm -f ./phones_50m.txt; for (( i=0; i<1000; ++i)); do cat ./phones_50000.txt >> ./phones_50m.txt; done

> wc phones_50m.txt # It is indeed 50 million lines, though each line has 1000 duplicates each split by 50k lines
  50000000   50000000 1328570000 phones_50m.txt

And run those again...

> time ./phone_encoder dictionary.txt phones_50m.txt > /dev/null
./phone_encoder dictionary.txt phones_50m.txt > /dev/null  96.31s user 0.28s system 99% cpu 1:36.59 total

? time ./phone_encoder_pr1 dictionary.txt phones_50m.txt > /dev/null
./phone_encoder_pr1 dictionary.txt phones_50m.txt > /dev/null  199.11s user 2.38s system 99% cpu 3:21.50 total

So yeah, that took a very long time to run but the black box seems to be working fine, no multiple magnitudes of difference, was twice as fast as the base PR1 version. :-)

bugaevc · 2021-08-02T17:01:52Z

You seem to have missed the words.txt part 🙂

OvermindDL1 · 2021-08-02T17:06:49Z

The only words.txt I have here is in tests/words.txt, and it only has 22 lines / 23 words, where dictionary is substantially larger:

> wc dictionary.txt 
 73113  73113 938582 dictionary.txt

> wc tests/words.txt 
 22  23 103 tests/words.txt

> find . -name words.txt
./tests/words.txt

Though running it with that file instead of dictionary results in:

> time ./phone_encoder tests/words.txt phones_50m.txt > /dev/null
./phone_encoder tests/words.txt phones_50m.txt > /dev/null  36.53s user 0.16s system 99% cpu 36.688 total

> time ./phone_encoder_pr1 tests/words.txt phones_50m.txt > /dev/null
./phone_encoder_pr1 tests/words.txt phones_50m.txt > /dev/null  72.81s user 0.50s system 99% cpu 1:13.32 total

So it is still the same roughly double speed of pr1.

Or did you generate a new words.txt and if so what generated it? Especially to generate your significantly huge difference in times?

bugaevc · 2021-08-02T17:10:17Z

prechelt-phone-number-encoding/benchmark.sh

Lines 11 to 14 in 7f1e5ee

    
           if [ ! -f words.txt ]; then 
        
             echo "Downloading English Words Dictionary." 
        
             ./download_english_words.sh 
        
           fi

prechelt-phone-number-encoding/download_english_words.sh

Lines 1 to 6 in 7f1e5ee

    
           #!/bin/bash 
        
           set -e 
        
           curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt --output tmp.txt 
        
           tail -n 100000 tmp.txt > words.txt 
        
           rm tmp.txt

OvermindDL1 · 2021-08-02T17:39:34Z

I don't have that in benchmark.sh in my copy here, was that added recently?! o.O

Ah yep, very recently.

Cool, grabbed it and other updates:

> curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt | tail -n 100000 > words.txt

And I can't get anything to run in a reasonable time (waited at least 10 minutes) on any of the programs...
Yet they do with 50k lines, so:

> head -n 50000 words.txt > words-50k.txt

And thus with 50k lines of words:

> time sbcl --script src/lisp/main.lisp words-50k.txt phones_2000.txt >/dev/null
sbcl --script src/lisp/main.lisp words-50k.txt phones_2000.txt > /dev/null  26.94s user 0.12s system 99% cpu 27.056 total

> time ./phone_encoder_pr1 words-50k.txt phones_2000.txt >/dev/null
./phone_encoder_pr1 words-50k.txt phones_2000.txt > /dev/null  12.79s user 0.12s system 99% cpu 12.903 tota

> time ./phone_encoder words-50k.txt phones_2000.txt >/dev/null 
./phone_encoder words-50k.txt phones_2000.txt > /dev/null  4.71s user 0.01s system 99% cpu 4.727 total

> time java -cp build/java Main words-50k.txt phones_2000.txt >/dev/null
java -cp build/java Main words-50k.txt phones_2000.txt > /dev/null  5.51s user 0.82s system 144% cpu 4.393 total

> time java -cp build/java Main2 words-50k.txt phones_2000.txt >/dev/null
java -cp build/java Main2 words-50k.txt phones_2000.txt > /dev/null  34.23s user 1.03s system 107% cpu 32.880 total

And these still seem reasonable. I'm guessing some really bad hashing is happening in all languages or something as all of their running times start going exponentially up with a bigger words.txt file (adding another 10k lines to 60k increases the running time by ~10 times, and adding another 10k I wasn't able to complete anything in less than 10 minutes). Nice to see java's Main doing well here. ^.^

bugaevc · 2021-08-02T17:57:32Z

And I can't get anything to run in a reasonable time (waited at least 10 minutes) on any of the programs...

I wonder if it would run in a reasonable time for you with the black box.

OvermindDL1 · 2021-08-02T17:58:49Z

I wonder if it would run in a reasonable time for you with the black box.

The phone_encoder version (no _pr1 on the name) is the black_box one you posted above, copied verbatim (and with my rust set to nightly of course).

EDIT: The PR1 one is this PR's code.

OvermindDL1 · 2021-08-02T18:05:41Z

Also, just for completion, my rust version is: rustc 1.56.0-nightly (2827db2b1 2021-08-01) with the nightly-x86_64-unknown-linux-gnu toolchain.

OvermindDL1 · 2021-08-02T20:21:41Z

So I had some time so I took a look at 'what' is so slow. Well print_translations (recursively as it is) is the slowest bit by far, but it's slowness varies. Apparently on the set that this java thing generated for the phones_2000 random phone numbers (generated with java -cp "build/util" util.GeneratePhoneNumbers 2000 false 12 > phones_2000.txt as per the benchmark script) managed to be... interesting. A single one of the calls took just over 3 seconds (out of the little over 5 seconds total running time in the current random set generated), the rest of the calls varied a whole lot, most were pretty instant, but a few were exceptionally slow, though not 1s kind of slow other than one of them. The slowest one was:

[src/main.rs:43] ("NewSlow", took, num) = (
    "NewSlow",
    3.122004442s,
    "364943-7643536938/--33734-89290451119/2875-949180",
)

So yeah, that's incredibly slow for just a single thing lookup, but considering it's exponential with lots of loops for every single character it seems expected. The order of the slowest ones relate directly to the length of the phones_2000.txt line, so that's expected as well.

I notice the new benchmark script generates the random 2000 numbers via java -cp "build/util" util.GeneratePhoneNumbers 2000 false 12 > phones_2000.txt but in the GeneratePhoneNumbers.java file doesn't use the third argument, was that supposed to cap the line length or so?

imron · 2021-08-03T01:24:11Z

So I had some time so I took a look at 'what' is so slow. Well print_translations (recursively as it is) is the slowest bit by far

A lot of the slowness comes from using fmt. Writing directly to the stream gave me a 3-4x performance improvement. See this PR for details.

After that, print_translations is still taking up the bulk of the flamegraph and not including recursion, about half the time is taken up by flushing the buffer, a quarter by make_hash, and a quarter by memmove.

It'd be interesting to see the difference between an iterative vs a recursive approach.

bugaevc · 2021-08-03T02:11:22Z

A lot of the slowness comes from using fmt. Writing directly to the stream gave me a 3-4x performance improvement. See this PR for details.

Great find!

about half the time is taken up by flushing the buffer, a quarter by make_hash, and a quarter by memmove.

Does the black box cut down the time significantly on your machine?

imron · 2021-08-03T02:34:22Z

DIdn't test with the black_box yet.

OvermindDL1 · 2021-08-03T14:05:11Z

A lot of the slowness comes from using fmt. Writing directly to the stream gave me a 3-4x performance improvement. See this PR for details.

Oh, for clarity I was testing on the black_box'd version, so there was no string formatting cost or flushing cost. The formatting and flushing adds a linear cost, but that's not the exponential cost by far.

Does the black box cut down the time significantly on your machine?

On mine it did a little bit, that's basically the difference between the pr1 version of mine and the non-pr1 version of mine, so it was noticeable, but that's not the heaviest cost by far, though still work working on.

bugaevc · 2021-08-03T14:09:55Z

On mine it did a little bit, that's basically the difference between the pr1 version of mine and the non-pr1 version of mine, so it was noticeable, but that's not the heaviest cost by far

As I understand it, you weren't running the full words.txt phones_2000.txt benchmark?

OvermindDL1 · 2021-08-03T14:15:25Z

Actually I ended up making a PFH version that used the full 100k words.txt version, but I was benchmarking and profiling with the 50k version since it would actually run to completion in less than 10 minutes.

I'm still unsure why it's taking so long on this system compared to your timings above, this is not a low end server by any stretch, an AMD Ryzen 7 1700 8/16 with 32gigs of DDR4 3000 ram. The file is running off the HDD and not the SSD but I timed the I/O and it was barely registerable compared to the rest so it didn't seem related.

OvermindDL1 · 2021-08-03T14:17:16Z

For note, the only part about it that takes so long is the print_translations (whether black_box'd or not), the phones_2000.txt list is not limited to 12 characters as I'm guessing was implied by the third argument to the java program (that the source code doesn't touch) but can be exceptionally long lines, and each extra 'useful' character in line makes it take about twice as long on the same size word list (and increasing the size of that word list, so increasing the number of words in a match, makes it take even much longer).

renatoathaydes · 2021-08-08T16:32:27Z

I finally got the time to write an extensive blog post about this: https://renato.athaydes.com/posts/how-to-write-slow-rust-code-part-2.html

Let me know what you think!

Summary: yes, you were correct: Rust runs twice as fast as Java when printouts do not become a bottleneck. Java only wins when printouts are the bottleneck, which is something less interesting than actually finding solutions.

bugaevc · 2021-08-08T16:53:35Z

Nice, thanks for the update!

Let me know what you think!

I, for one, think that it's cool that you've included the benchmarks with a trick to reduce output, like I suggested, although you have chosen a different way to do it than what I proposed.

Then, you've said:

Simply not printing out any solutions at all was just not acceptable to me because modern compilers are incredibly good at not doing any work if no one can observe the results anyway.

<...> while not simplifying the problem to something that would allow optimisers to avoid having to first calculate them all.

You're presenting this as a problem that no solution has been suggested for. I think it would be appropriate to mention the black box here, even if you're not convinced it works as I claim it does. For instance, something like the following would make sense to me: "@bugaevc has suggested that Rust's black_box() function could be used to ensure just that, but I was unconvinced that it actually prevents the compiler from optimizing the calculations away; in addition we have not found a reliable way to do the same trick in Java." So, making it clear that a solution has been proposed, but you're dissatisfied with it.

Finally, I hope the "rant" at the end was not directed at me. I never said or implied such things about you 🙂

renatoathaydes · 2021-08-08T17:06:37Z

@bugaevc sorry I didn't mention your suggestion :) but I did mention there were many "other ways" to do it and I just did something that I was confident myself that would work well.

Finally, I hope the "rant" at the end was not directed at me. I never said or implied such things about you 🙂

No, not you. My post generated a lot of heated conversations, much worse than this one :D on HackerNews specially, of all places, people just didn't bother reading most of what I had to say and just trashed me and my code.

bugaevc added 7 commits July 28, 2021 15:09

Use a WordOrDigit enum

f4a7fc0

This makes it a little bit nicer to check whether the last thing we added was a digit or not, and also lets us skip allocating an extra string when adding a digit.

Drop return value of print_translations()

62dc9b0

It never actually returned any errors; so why return anything at all.

Don't swallow I/O errors

8f49294

Don't just check if something worked fine, propagate the error otherwise.

Store digits as Vec<u8>, not Vec<char>

83e2b58

This gives us a slight, but noticeable speed-up, since storing a char requires four times as much space.

Reuse the words vector

50ead3d

Instead of cloning the vector to create a new partial_solutions vector, reuse the very same vector, and simply pop the item after the recursive call. This gives us another noticeable speed-up.

Small tweaks

bf9cabf

bugaevc force-pushed the speed-up-rust branch from ccd8b1b to 507dc15 Compare July 28, 2021 12:30

bugaevc added 3 commits July 28, 2021 15:32

Remove now-unused dependencies

d4158d4

Unignore phone_encoder and benchmark_runner

454db15

This was clearly meant to ignore the *files* at the root of the tree, not the whole source directories. In particular, ./src/rust/phone_encoder/ is no longer ignored, while ./phone_encoder (in the project root directory) still is.

bugaevc force-pushed the speed-up-rust branch from 507dc15 to 454db15 Compare July 28, 2021 12:35

Speed up Rust #1

Are you sure you want to change the base?

Speed up Rust #1

Conversation

bugaevc commented Jul 28, 2021

bugaevc commented Jul 28, 2021

sstadick commented Jul 28, 2021 • edited Loading

bugaevc commented Jul 28, 2021

sstadick commented Jul 28, 2021

OvermindDL1 commented Jul 29, 2021 • edited Loading

bugaevc commented Jul 29, 2021

renatoathaydes commented Jul 30, 2021

sstadick commented Jul 30, 2021

bugaevc commented Jul 30, 2021

vks commented Jul 30, 2021 • edited Loading

renatoathaydes commented Jul 30, 2021

OvermindDL1 commented Jul 30, 2021 • edited Loading

bugaevc commented Jul 30, 2021

OvermindDL1 commented Jul 30, 2021

renatoathaydes commented Jul 30, 2021

philipc commented Jul 30, 2021

renatoathaydes commented Jul 30, 2021

renatoathaydes commented Jul 31, 2021

renatoathaydes commented Aug 1, 2021

renatoathaydes commented Aug 1, 2021

renatoathaydes commented Aug 1, 2021

bugaevc commented Aug 1, 2021 • edited Loading

renatoathaydes commented Aug 1, 2021

bugaevc commented Aug 1, 2021

renatoathaydes commented Aug 1, 2021

vks commented Aug 1, 2021

bugaevc commented Aug 2, 2021 • edited Loading

OvermindDL1 commented Aug 2, 2021

bugaevc commented Aug 2, 2021

OvermindDL1 commented Aug 2, 2021

bugaevc commented Aug 2, 2021

OvermindDL1 commented Aug 2, 2021 • edited Loading

bugaevc commented Aug 2, 2021

OvermindDL1 commented Aug 2, 2021

bugaevc commented Aug 2, 2021

OvermindDL1 commented Aug 2, 2021 • edited Loading

OvermindDL1 commented Aug 2, 2021 • edited Loading

OvermindDL1 commented Aug 2, 2021

imron commented Aug 3, 2021

bugaevc commented Aug 3, 2021

imron commented Aug 3, 2021

OvermindDL1 commented Aug 3, 2021

bugaevc commented Aug 3, 2021

OvermindDL1 commented Aug 3, 2021

OvermindDL1 commented Aug 3, 2021 • edited Loading

renatoathaydes commented Aug 8, 2021 • edited Loading

bugaevc commented Aug 8, 2021

renatoathaydes commented Aug 8, 2021

sstadick commented Jul 28, 2021 •

edited

Loading

OvermindDL1 commented Jul 29, 2021 •

edited

Loading

vks commented Jul 30, 2021 •

edited

Loading

OvermindDL1 commented Jul 30, 2021 •

edited

Loading

bugaevc commented Aug 1, 2021 •

edited

Loading

bugaevc commented Aug 2, 2021 •

edited

Loading

OvermindDL1 commented Aug 2, 2021 •

edited

Loading

OvermindDL1 commented Aug 2, 2021 •

edited

Loading

OvermindDL1 commented Aug 2, 2021 •

edited

Loading

OvermindDL1 commented Aug 3, 2021 •

edited

Loading

renatoathaydes commented Aug 8, 2021 •

edited

Loading