ngram-fingerprint
Windows | Mac/Linux |
---|---|
JavaScript implementation of the ngram-fingerprint algorithm from the Open Refine project described here.
Algorithm
The algorithm is slightly different to the one by Google Refine. The replacements of extended western characters is already done in the third step and not as the last step. This is mostly done so the sorting will work properly.
- change all characters to their lowercase representation
- remove all punctuation, whitespace, and control characters
- normalize extended western characters to their ASCII representation
- obtain all the string n-grams
- sort the n-grams and remove duplicates
- join the sorted n-grams back together
Usage
var fingerprint = // returns arispari