simfiles
Score files within a folder based on the similarity of their contents. Works as a CLI tool or library! Uses an algorithm similar to comm
for speed.
Use this tool to determine where, in your large code base, there might be excessive duplication of code.
Scores are between 0
(no lines match between the files) and 1
(lines, after normalization, completely match). When comparing files, this tool normalizes each line by trimming whitespace, removing empty lines, and sorting. This means that two files could receive a score of 1
(completely equal) but have differing whitespace or line order. Generally this isn't a problem.
The CLI defaults are geared towards JS/TS projects, but this tool is language agnostic: it only cares about lines in files.
Usage (library)
yarn add simfiles
;const opts = root: process ext: 'js' 'ts' ignore: '**/node_modules/*' output: null;const result = await ;
Usage (CLI)
$ npx simfiles --helpDetermine which files within your project are most similar Usage $ simfiles [options]Global Options --root Use this directory as root
Some example output using the react codebase:
$ git clone git@github.com:facebook/react.git
Cloning into 'react'...
$ cd react/packages/react/src
$ npx simfiles --output similarity.json
$ jq . ./similarity.json | head -20
[
{
"filePath0": "forks/ReactCurrentDispatcher.www.js",
"filePath1": "forks/ReactCurrentOwner.www.js",
"lineCount0": 7,
"lineCount1": 7,
"commonLines": 6,
"score": 0.8571428571428571
},
{
"filePath0": "__tests__/testDefinitions/PropTypes.d.ts",
"filePath1": "__tests__/testDefinitions/ReactDOM.d.ts",
"lineCount0": 15,
"lineCount1": 17,
"commonLines": 13,
"score": 0.8156862745098039
},
{
"filePath0": "forks/ReactSharedInternals.umd.js",
"filePath1": "ReactSharedInternals.js",
Viewing the Output
The output file is just JSON, so it can be queried using something like jq
(or you could write a script).
Some nice jq
"recipies":
Print each pair of similar files "nicely"
jq -r '.[] | "\(.filePath0)\n\(.filePath1):\n \(.score)"' similarity.json
Which files are more than 75% similar but not exactly the same?
jq '.[] | select(.score > 0.75) | select(.score < 1)' similarity.json
How many files are more than 75% similar?
jq '[.[] | select(.score > 0.75)] | length' similarity.json
What percentage of files are more than 75% similar?
total=$(jq '. | length' similarity.json); \morethan=$(jq '[.[] | select(.score > 0.75)] | length' similarity.json); \echo "scale=5 ; $morethan / $total" | bc
Speed
This library uses a string comparison algorithm that relies on the input being sorted. In this use case, this is drastically faster than the commonly used Levenshtein distance for computing string differences. This library initially tried Levenshtein distance libraries, but on tested code bases, those algorithms took anywhere from 30 minutes to hours. The current algorithm takes less than 5 minutes on the same code bases.
# Using react's git repo: https://github.com/facebook/react/ time npx simfiles --output similarity.json real 0m56.109suser 0m45.827ssys 0m1.381s
Contributing
This library uses web-scripts. When committing, please use yarn commit
to get semantic commit messages for releasing.
Note: If, when writing tests, you get a mysterious error:
ENOENT, no such file or directory '.../node_modules/callsites'
It's probably due to this mock-fs issue. The workaround is to call console.log
before mocking the file system.
License
MIT