Skip to content

LukasPukenis/nodejs-deduplicator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

node.js based tool for finding file duplicates

Installation

Install: yarn install

Build(continuously): yarn run start

You may run tests: npx jest

Now you can run: node dist/js/index.js --dir=source_directory --result=resultfile

Note: you may pass --verbose flag in case you want to see more information.

Usage

Before using you must compile it as compiled source is not provided. Some example commands:

node dist/js/index.js

node dist/js/index.js --dir=./node_modules --result=dedup-result1.txt --types=json,md --verbose

Note: In case previous operation was canceled then user will be prompted about resuming the previous operation or restarting it.

Options

  • option meaning
  • help - this list
  • verbose - display intermediate information while processing files
  • dir - source directory which will be filtered recursively. If ommited then current directory is gonna be used
  • result - filename of a result file. If not provided {deduplicate-results-todaysDate.txt} is gonna be used
  • types - comma separated list of extensions

How it works

This tool works in asynchronous manner to by first making file list and saving it. Building file list is crucial in enabling the tool to resume operations if canceled. Tool supports canceling and will write a current file index for later. MD5 hashes are calculated asynchronously and limited in how many concurrent calculations can be ran.

Output can be listed on the screen or in the resulting file. It does not provide functionality to remove the files as it must only find duplicates and removal is very sensitive thing so it's up to the user. However after having the result file, removal may be as simple as

while read p; do
  rm "$p"
done <results.txt

Limitations

This tool does not offer a perfect solution if the files are gonna be changed while processing or if before resuming some of the files are deleted or altered. Reading a file and calculating it's hash are the most expensive parts so if when resuming the deduplication processes would rehash all the files to see if anything has changed - it would render the resuming useless as it would spend almost same amount of time as just starting from scratch.

About

File duplicate detector

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published