multibyte
multibyte provides common string functions that respect multibyte Unicode characters.
npm install multibyte
The problem and the solution
On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations.
If you display Unicode text from a UTF-8 source, you need these multibyte
functions that take advantage of the fact that Array.from()
is Unicode safe.
import {
charAt,
codePointAt,
length,
slice,
split,
truncateBytes,
} from 'multibyte';
// JavaScript String.prototype.charAt() is not Unicode aware
'a🚀c'.charAt(1); // "\ud83d" ❌
charAt('a🚀c', 1); // "🚀" ✅
// JavaScript String.prototype.codePointAt() does not ignore the UTF-8 BOM
'\uFEFFa🚀c'.codePointAt(1); // 97 ❌
codePointAt('\uFEFFa🚀c', 1); // 128640 ✅
// JavaScript returns length in bytes, not Unicode characters
'a🚀c'.length; // 4 ❌
length('a🚀c'); // 3 ✅
// JavaScript slices along bytes, not Unicode characters
'a🚀cdef'.slice(2, 3); // "\ude80" ❌
slice('a🚀cdef', 2, 3); // "c" ✅
// JavaScript slices along bytes, not Unicode characters
'a🚀c'.split(''); // ["a", "\ud83d", "\ude80", "c"] ❌
split('a🚀c', ''); // ["a", "🚀", "c"] ✅
// JavaScript String length is not related to UTF-8 character length
'a🚀cdef'.slice(0, 2); // "a\ud83d" ❌
truncateBytes('a🚀cdef', 2); // "a" ✅
BOM (Byte order mark) - U+FEFF
Under the hood, all these function strip a leading BOM if present.