NgramRef is used instead of Ngram during the detection phase #148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Model stores Ngrams in HashMap, where Ngram is just a wrapper around String.
TestDataLanguageModel::from
and theIterator
ofNgramRange
allocate a lot of small temporary Strings on the heap during detection. I introduced NgramRef, which is just like Ngram, but holds a&str
intead ofString
. NowNgramRange
iterates over slices of the input string, which is more efficient. I usedBorrow
trait trick to borrowNgram
as&str
. Initially I tried borrowingNgram
asNgramRef
to discover that this is not possible. However, the API toLanguageModel
is still type safefn get_relative_frequency<'a>(&self, ngram: &NgramRef<'a>)
.Benchmark against current main branch
For the benchmark I used the accuracy reports test data. The benchmark code is here. I tested both single-threaded and multi-threaded/parallel mode.
Results
I tested the patch on two machines
The numbers in the columns Before and After are throughput as detections per second.
Single threaded benchmark
cargo run --release --bin bench -- --max-examples 30000
Multi threaded benchmark
cargo run --release --bin bench -- --max-examples 50000 --parallel
The numbers for multi-threaded benchmark are much higher if the change is applied on top of #82
Multi threaded benchmark
cargo run --release --bin bench -- --max-examples 50000 --parallel