Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations (up to 3518% faster language detection) #177

Closed
wants to merge 5 commits into from

Conversation

koute
Copy link

@koute koute commented May 8, 2023

I'm using Lingua for one of my projects and I've noticed that it was, well, extremely slow. After profiling it looked like my program was essentially spending 99% of its time in Lingua.

So here's a branch that I had lying around which speeds things up a little. With this branch my program's throughput increases from ~410 documents/s to ~14427 documents/s (a ~3518% increase in throughput!).

Changes made:

  • The models are now loaded once before iterating over the ngrams, instead of trying to load them for each and every processed ngram.
  • The ngrams are now constructed from words without any extra allocations.
  • Matching the alphabet doesn't use the regex crate anymore; instead it just uses a HashSet of characters for that. (The script.rs used for this was directly copied from the regex_syntax crate, so it should behave exactly the same, just faster.)
  • In places where performance matters I've switched the hash map hasher to ahash, which is faster that the default one in the standard library.

@@ -474,15 +475,15 @@ impl LanguageDetector {
let char_str = character.encode_utf8(&mut buffer);

for (alphabet, language) in self.one_language_alphabets.iter() {
if alphabet.matches(char_str) {
if alphabet.matches_char(character) {
self.increment_counter(&mut word_language_counts, language.clone());
is_match = true;
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but seems a bit biased on the list order to break here?

@pemistahl
Copy link
Owner

Hi @koute, thank you very much for these optimizations. I've been lagging behind to perform updates on the Rust version of Lingua due to a lack of spare time. The Go and Python versions already include further optimizations that are still missing in the Rust version.

After I have performed some long-planned performance optimizations myself, I will carefully evaluate your changes and gladly merge them if they hold their promises.

@koute
Copy link
Author

koute commented May 8, 2023

The Go and Python versions already include further optimizations that are still missing in the Rust version.

Somewhat offtopic, but have you considered just using the Rust version as the Python version? With pyo3 it's fairly trivial to expose a Rust library as a Python library; then you wouldn't have to maintain the same thing twice, and a well optimized Rust code is always going to be faster than equivalent Python code.

@koute
Copy link
Author

koute commented May 8, 2023

I was curious so I compared the performance to the Python version (I quickly made a Python extension with pyo3 to use the Rust Lingua from Python and benchmarked them using exactly the same Python script); here are the results:

Rust (before this PR): 0.50MB/s
Rust (this PR): 0.70MB/s
Python: 0.03MB/s

(In each case all languages were enabled. Python had all of the models preloaded while Rust didn't.)

The performance gain here was not as great as this was a simple singlethreaded benchmark. For my usecase I'm running 64 parallel threads which is where this PR shines. The current Rust version of Lingua has a really huge amount of lock contention, which only shows up when you're doing work on many threads, which is why this simple benchmark doesn't show the 3518% performance increase that I've seen in my program.

@serega
Copy link

serega commented May 8, 2023

Commits looks similar to other two pull requests

I didn't get as much improvement, but I don't have machine capable of running 64 parallel threads.

Another optimization that worked for me is using SmolStr for Ngram
serega@9eaa49f

@koute
Copy link
Author

koute commented May 8, 2023

Commits looks similar to other two pull requests

I didn't get as much improvement, but I don't have machine capable of running 64 parallel threads.

Yeah, sorry, I have only noticed those PRs after I've made all of my optimizations. :P This PR should essentially be a superset of those two.

Another optimization that worked for me is using SmolStr for Ngram serega@9eaa49f

Hm, that probably shouldn't be necessary now as I've made it use an &str, which is almost certainly going to be faster (as it's zero-copy).

@serega
Copy link

serega commented May 8, 2023

No problem. If this PR gets merged it would be very helpful for me. I think you can still try SmolStr for storing Ngram in memory during detector construction, and NgramRef during detection. Because ngrams are small strings, we can store them inline without additional allocation per string. With SmolStr the detection uses less memory and works faster because it avoids additional indirection.

@pemistahl
Copy link
Owner

Somewhat offtopic, but have you considered just using the Rust version as the Python version? With pyo3 it's fairly trivial to expose a Rust library as a Python library; then you wouldn't have to maintain the same thing twice, and a well optimized Rust code is always going to be faster than equivalent Python code.

Yes, I have considered it but there are still issues with converting Rust enums to Python enums that I want to be fixed before I do another attempt. See #154, for instance.

@koute
Copy link
Author

koute commented May 8, 2023

No problem. If this PR gets merged it would be very helpful for me. I think you can still try SmolStr for storing Ngram in memory during detector construction, and NgramRef during detection. Because ngrams are small strings, we can store them inline without additional allocation per string. With SmolStr the detection uses less memory and works faster because it avoids additional indirection.

Yeah, during construction it could potentially speed things up. For now I only touched detection. There are still other performance optimizations that one could do, but I didn't want to rewrite the whole crate. (: I might consider doing some more work if/when this PR gets merged. (e.g. model loading is something that's also ripe for improvement and I could easily make it significantly more efficient)

Somewhat offtopic, but have you considered just using the Rust version as the Python version? With pyo3 it's fairly trivial to expose a Rust library as a Python library; then you wouldn't have to maintain the same thing twice, and a well optimized Rust code is always going to be faster than equivalent Python code.

Yes, I have considered it but there are still issues with converting Rust enums to Python enums that I want to be fixed before I do another attempt. See #154, for instance.

Hmm.... this might be a silly question, so please forgive me if I'm missing something, but in this case have you considered having a thin Python shim around it? (: If you can't export a nice API from Rust directly then you could just write a thin Python wrapper that would use the clumsy pyo3 API and expose a nice Python API. And once the issues with pyo3 are fixed you could get rid of it. Wouldn't that be significantly less effort than maintaining a whole separate reimplementation? (: (Especially since even as-is without my optimizations the Rust implementation is 10x faster than the Python one.)

@pemistahl
Copy link
Owner

Hmm.... this might be a silly question, so please forgive me if I'm missing something, but in this case have you considered having a thin Python shim around it? (:

A silly question? Not at all. :) No, I have not thought about it yet because, on the one hand, I wanted to practice my Python skills and, on the other hand, some people were happy to find a pure Python implementation that makes compilation or deployment easier for them. There are still some Python environments out there (Android and IOS come to my mind) that accept only pure Python packages.

@pemistahl
Copy link
Owner

@koute I have applied most of your improvements now except for removing the regex crate in favor of a hash set of characters. I could not identify a significant performance improvement for this change. Additionally, I don't want to copy and maintain the script.rs file manually. So I leave that one out. But apart from this, your changes are great.

So thank you very much! (-:

@pemistahl pemistahl closed this May 26, 2023
@koute
Copy link
Author

koute commented May 27, 2023

I have applied most of your improvements now except for removing the regex crate in favor of a hash set of characters. I could not identify a significant performance improvement for this change.

That's actually the most important optimization of all of them! (: You didn't detect any changes because you were most likely testing only with one thread. It's an easy trap to fall into.

Here's a benchmark that will show you the difference:

  1. Set up a reproduction and download some test data (this will use my PR without the commit that replaced regex uses):
$ cargo new --bin repro
$ cd repro
$ cargo add lingua --git https://github.com/koute/lingua-rs.git --rev 92f8d5d1e078cfafcbc60d52b6f6821a65ded422
$ cargo add rayon
$ curl "https://www.gutenberg.org/cache/epub/345/pg345.txt" -o src/dracula.txt
  1. Paste this in src/main.rs:
use rayon::prelude::*;

fn main() {
    let data = include_str!("dracula.txt");
    let det = lingua::LanguageDetectorBuilder::from_all_languages().build();
    let xs: Vec<_> = std::iter::repeat(data).take(100000).collect();
    xs.into_par_iter().for_each(|s| {
        let len = s.chars().map(|ch| ch.len_utf8()).take(2048).sum::<usize>();
        std::hint::black_box(det.detect_language_of(&s[..len]));
    });
}
  1. Run it:
$ cargo build --release && time cargo run --release

real	2m0.792s
user	23m49.315s
sys	96m4.691s
  1. Now replace the Lingua dependency with one where regex is not used:
$ cargo remove lingua
$ cargo add lingua --git https://github.com/koute/lingua-rs.git --rev 304cb5a113ddbb968fab2ef45e41b686e42e5aa8
  1. Rerun it:
$ cargo build --release && time cargo run --release

real	0m5.737s
user	5m23.120s
sys	0m0.247s

As you can see that commit saves over an hour and half of CPU time on this benchmark.

@pemistahl
Copy link
Owner

@koute I've just done my own benchmark on the alphabet matching and I'm surprised that it's indeed significantly faster in multithreaded code. So I've decided to include your improvements. The regex crate performance docs mention the following:

Synchronization implies some amount of overhead. When a Regex is used from a single thread, this overhead is negligible. When a Regex is used from multiple threads simultaneously, it is possible for the overhead of synchronization from contention to impact performance.

I did not imagine that this would have such a significant impact on performance. Learning never stops. So again, thank you very much.

@pemistahl pemistahl added this to the Lingua 1.5.0 milestone Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants