- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
- Language evaluated in this benchmark:
- Asia:
jpn
,cmn
,kor
,hin
- Europe:
fra
,spa
,por
,ita
,nld
,eng
,deu
,fin
,rus
- Middle east: ,
tur
,heb
,ara
- Asia:
- This page and graphs are auto-generated from the code
Here is the list of libraries in this benchmark
Library | Script | Language | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
---|---|---|---|---|---|---|---|
TinyLD Heavy | yarn bench:tinyld-heavy |
64 | 99.249% | 0.7478% | 0.0032% | 0.096ms. | 2.0MB |
TinyLD | yarn bench:tinyld |
64 | 98.5231% | 1.3712% | 0.1057% | 0.1191ms. | 580KB |
TinyLD Light | yarn bench:tinyld-light |
24 | 97.8778% | 1.9842% | 0.138% | 0.0947ms. | 68KB |
**langdetect | yarn bench:langdetect |
53 | 95.675% | 4.325% | 0% | 0.3647ms. | 1.8MB |
node-cld | yarn bench:cld |
160 | 92.3654% | 1.6213% | 6.0133% | 0.0711ms. | > 10MB |
franc | yarn bench:franc |
187 | 74.2577% | 25.7423% | 0% | 0.2242ms. | 267KB |
franc-min | yarn bench:franc-min |
82 | 70.3891% | 23.1888% | 6.422% | 0.084ms. | 119KB |
franc-all | yarn bench:franc-all |
403 | 66.7081% | 33.2919% | 0% | 0.4763ms. | 509KB |
languagedetect | yarn bench:languagedetect |
52 | 65.2835% | 11.2808% | 23.4357% | 0.1896ms. | 240KB |
We see two group of libraries
tinyld
,langdetect
andcld
over 90% accuracyfranc
andlanguagedetect
under 75% accuracy
We see big differences between languages:
- Japanese or Korean are almost at 100% for every libs (lot of unique characters)
- Spanish and Portuguese are really close and cause more false-positive and an higher error-rate
Most libraries are using statistical analysis, so longer is the input text, better will be the detection. So we can often see quotes like this in those library documentations.
Make sure to pass it big documents to get reliable results.
Let's see if this statement is true, and how those libraries behave for different input size (from small to long)
So the previous quote is right, over 512 characters all the libs become accurate enough.
But for a ~95% accuracy threshold:
tinyld
(green) reaches it around 24 characterslangdetect
(cyan) andcld
(orange) reach it around 48 characters
Here we can notice few things about performance:
langdetect
(cyan) andfranc
(pink) seems to slow down at a similar ratetinyld
(green) slow down but at a really flat ratecld
(orange) is definitely the fastest and doesn't show any apparent slow down
But we've seen previously that some of those libraries need more than 256 characters to be accurate. It means they start to slow down at the same time they start to give decent results.
- For NodeJS:
TinyLD
,langdetect
ornode-cld
(fast and accurate) - For Browser:
TinyLD Light
orfranc-min
(small, decent accuracy, franc is less accurate but support more languages)
- Short text (chatbot, keywords, database, ...):
TinyLD
orlangdetect
- Long text (documents, webpage):
node-cld
orTinyLD
franc-all
is the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate.languagedetect
is light but just not accurate enough
Thanks for reading this article, those metrics are really helpful for the development of tinyld
.
It's used in the development to see the impact of every modification and features.
If you want to contribute or see another library in this benchmark, open an issue