Skip to content

Latest commit

 

History

History
106 lines (67 loc) · 4.87 KB

benchmark.md

File metadata and controls

106 lines (67 loc) · 4.87 KB

NodeJS Language Detection Benchmark 🚀

  • This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
  • Language evaluated in this benchmark:
    • Asia: jpn, cmn, kor, hin
    • Europe: fra, spa, por, ita, nld, eng, deu, fin, rus
    • Middle east: , tur, heb, ara
  • This page and graphs are auto-generated from the code

Libraries

Here is the list of libraries in this benchmark

Library Script Language Properly Identified Improperly identified Not identified Avg Execution Time Disk Size
TinyLD Heavy yarn bench:tinyld-heavy 64 99.249% 0.7478% 0.0032% 0.096ms. 2.0MB
TinyLD yarn bench:tinyld 64 98.5231% 1.3712% 0.1057% 0.1191ms. 580KB
TinyLD Light yarn bench:tinyld-light 24 97.8778% 1.9842% 0.138% 0.0947ms. 68KB
**langdetect yarn bench:langdetect 53 95.675% 4.325% 0% 0.3647ms. 1.8MB
node-cld yarn bench:cld 160 92.3654% 1.6213% 6.0133% 0.0711ms. > 10MB
franc yarn bench:franc 187 74.2577% 25.7423% 0% 0.2242ms. 267KB
franc-min yarn bench:franc-min 82 70.3891% 23.1888% 6.422% 0.084ms. 119KB
franc-all yarn bench:franc-all 403 66.7081% 33.2919% 0% 0.4763ms. 509KB
languagedetect yarn bench:languagedetect 52 65.2835% 11.2808% 23.4357% 0.1896ms. 240KB

Global Accuracy

Benchmark

We see two group of libraries

  • tinyld, langdetect and cld over 90% accuracy
  • franc and languagedetect under 75% accuracy

Per Language

Language

We see big differences between languages:

  • Japanese or Korean are almost at 100% for every libs (lot of unique characters)
  • Spanish and Portuguese are really close and cause more false-positive and an higher error-rate

Accuracy By Text length

Most libraries are using statistical analysis, so longer is the input text, better will be the detection. So we can often see quotes like this in those library documentations.

Make sure to pass it big documents to get reliable results.

Let's see if this statement is true, and how those libraries behave for different input size (from small to long) Size

So the previous quote is right, over 512 characters all the libs become accurate enough.

But for a ~95% accuracy threshold:

  • tinyld (green) reaches it around 24 characters
  • langdetect (cyan) and cld (orange) reach it around 48 characters

Execution Time

Size

Here we can notice few things about performance:

  • langdetect (cyan) and franc (pink) seems to slow down at a similar rate
  • tinyld (green) slow down but at a really flat rate
  • cld (orange) is definitely the fastest and doesn't show any apparent slow down

But we've seen previously that some of those libraries need more than 256 characters to be accurate. It means they start to slow down at the same time they start to give decent results.


Conclusion

Recommended 👍

- By platform 💻

  • For NodeJS: TinyLD, langdetect or node-cld (fast and accurate)
  • For Browser: TinyLD Light or franc-min (small, decent accuracy, franc is less accurate but support more languages)

- By usage 💬

  • Short text (chatbot, keywords, database, ...): TinyLD or langdetect
  • Long text (documents, webpage): node-cld or TinyLD

Not recommended 👎

  • franc-all is the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate.
  • languagedetect is light but just not accurate enough

Last word 🙋

Thanks for reading this article, those metrics are really helpful for the development of tinyld. It's used in the development to see the impact of every modification and features.

If you want to contribute or see another library in this benchmark, open an issue