TinyLD
Tiny Language Detector, simply detect the language of a unicode UTF-8 text:
- pure javascript, no api call, and no dependency (node and browser compatible)
- alternative to libraries like CLD
- blazing fast and low memory footprint (unlike ML methods)
- support 62 languages (30 for the web version)
- format ISO 639-1
Extra
- Playground - Try the library
- Getting Started
- API
- CLI
- TinyLD Web
- Supported Language
- Algorithm
- Developer
Getting Started
Install
yarn add tinyld # or npm install --save tinyld
API
import { detect, detectAll } from 'tinyld'
// Detect
detect('これは日本語です.') // ja
detect('and this is english.') // en
// DetectAll
detectAll('ceci est un text en francais.')
// [ { lang: 'fr', accuracy: 0.5238 }, { lang: 'ro', accuracy: 0.3802 }, ... ]
TinyLD CLI
tinyld This is the text that I want to check
# [ { lang: 'en', accuracy: 1 } ]
Benchmark
Benchmark done on tatoeba dataset (~9M sentences) on 16 of the most common languages.
Library | Script | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
---|---|---|---|---|---|---|
TinyLD | yarn bench:tinyld |
96.1747% | 2.6938% | 1.1315% | 0.1315ms. | 778KB |
TinyLD Web | yarn bench:tinyld-light |
92.1169% | 3.9536% | 3.9295% | 0.0616ms. | 89KB |
node-cld | yarn bench:cld |
88.9148% | 1.7489% | 9.3363% | 0.0612ms. | > 10MB |
node-lingua | yarn bench:lingua |
82.3157% | 0.2158% | 17.4685% | 0.7085ms. | ~100MB |
franc | yarn bench:franc |
68.7783% | 26.3432% | 4.8785% | 0.1381ms. | 267KB |
franc-min | yarn bench:franc-min |
65.5163% | 23.5794% | 10.9044% | 0.0614ms. | 119KB |
languagedetect | yarn bench:languagedetect |
61.6068% | 12.295% | 26.0982% | 0.1585ms. | 240KB |
Remark
- For each category, top3 results are in Bold
- Language evaluated in this benchmark:
- Asia:
jpn
,cmn
,kor
,hin
- Europe:
fra
,spa
,por
,ita
,nld
,eng
,deu
,fin
,rus
- Middle east: ,
tur
,heb
,ara
- Asia:
- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
Conclusion
Recommended
- For NodeJS:
TinyLD
ornode-cld
(fast and accurate) - For Browser:
TinyLD Light
orfranc-min
(small, decent accuracy, franc is less accurate but support more languages)
Not recommended
-
node-lingua
is just too big and slow -
languagedetect
is light but just not accurate enough, really focused on indo-european languages (support kazakh but not chinese, korean or japanese)