ICU Character Set Detection for Node.js
Character set detection is the process of determining the character set, or encoding, of character data in an unknown format.
A simple binding of ICU character set detection (http://userguide.icu-project.org/conversion/detection) for Node.js.
Installation
At first, install libicu
into your system (See this instruction for details).
After that, install node-icu-charset-detector
from npm.
npm install node-icu-charset-detector
Installing ICU
Linux
-
Debian (Ubuntu)
apt-get install libicu-dev
-
Gentoo
emerge icu
-
Fedora/CentOS
yum install libicu-devel
OSX
-
MacPorts
port install icu +devel
-
Homebrew
brew install icu4cbrew link icu4c --force
If experiencing issues with 'homebrew' installing version 50.1 of icu4c, try the following:
brew search icu4cbrew tap homebrew/versionsbrew versions icu4ccd $(brew --prefix) && git pull --rebasegit checkout c25fd2f $(brew --prefix)/Library/Formula/icu4c.rbbrew install icu4c
- From source
curl -O http://download.icu-project.org/files/icu4c/52.1/icu4c-52_1-src.tgztar xzvf icu4c-4_4_2-src.tgzcd icu/sourcechmod +x runConfigureICU configure install-sh./runConfigureICU MacOSXmakesudo make installxcode-
Usage
Simple usage
node-icu-charset-detector
provides a function detectCharset(buffer)
, where buffer
is an instance of Buffer
whose charset should be detected.
var charsetDetector = ; var buffer = fs;var charset = charsetDetector; console;console;console;
detectCharset(buffer)
returns the detected charset name for buffer
, and the returned charset name has two extra properties language
and confidence
:
charset.language
- language name for the detected character set.
charset.confidence
- confidence of the charset detection for
charset
.
- confidence of the charset detection for
Leveraging node-iconv
Since ICU itself does not have a feature to convert character sets, you may need to use node-iconv
(https://github.com/bnoordhuis/node-iconv), which has a powerful character sets converting feature.
Here is a simple example to leverage node-iconv
to convert character sets not supported by Node itself.
{ var charsetDetector = ; var charset = charsetDetector; try return buffer; catch x var Iconv = Iconv; var charsetConverter = charset "utf8"; return charsetConverter; } var buffer = fs;var bufferString = ;