Load models slightly more eagerly and reuse for all ngrams during detection. #82
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The library puts emphasis on lazy loading in order to be efficient in certain servless environments. My use-case requires maximum performance, and I can sacrify slower startup for better runtime performance. The
LanguageDetectorBuilder
has a flagwith_preloaded_language_models
that forces eager loading of models. However, during detection theLanguageDetector
loops through every ngram and calls the functionload_language_models
, that has to take read lock to check whether the model is loaded. Read locks are cheap, but not free. As an experiment I completely eliminated lazy loading and stored models in theLanguageDetector
. Single threaded benchmark improved by 1.2x-2x and multi-threaded by 2x-4x depending on the system.Patch
This patch works with lazy loading design. Instead of lazy-loading model for every ngram, I load models slightly more eagerly in
compute_sum_of_ngram_probabilities
and incount_unigrams
for the specified language, and reuse the models for ngrams in the loop. For this to work I had to useArc
instead ofBox
in theBoxedLanguageModel
Benchmark
For the benchmark I used the accuracy reports test data. The benchmark code is here. I tested both single-threaded and multi-threaded/parallel mode.
Results
I tested the patch on two machines
The numbers in the columns Before and After are throughput as detections per second.
Single threaded benchmark
cargo run --release --bin bench -- --max-examples 30000
Multi threaded benchmark
cargo run --release --bin bench -- --max-examples 50000 --parallel
I am really surprised by the numbers on M1 chip. Before, single-threaded benchmark runs faster than multi-threaded, which means RwLocks on M1-Mac are slow.
I ran accuracy reports and checked against the current main branch
diff -r accuracy-reports/lingua/ ../lingua-rs-main/accuracy-reports/lingua/
The command found no differences.