Sherlock was written by Rob Pike and Loki. Unfortunately their website went offline somewhere in 2018 (http://rp-www.cs.usyd.edu.au/~scilect/sherlock/). This is a source code copy of 2016 since I still actively use this algorithm.
Parts of this readme are copied from the snapshot copy of archive.org.
On Linux / Unix, just use the makefiles provided to automatically compile the program. For Windows you will need consider your preferred compiler to compile the source code file.
Sherlock will internally generate a signature for each file selected and compare them against each other and calculate their similarity / overlapping score.
- sherlock.c - Source code file of Sherlock.
- Makefile - Makefile for sig and comp combined.
Sherlock is a command-line program. That is, you run it from an xterm or DOS window. It isn't graphical, and has no graphical user interface. Use it like this:
sherlock *.txt
That will compare all the text files in the current directory and produce a listing of the most similar files, together with a percentage similarity index.
To compare source files, you might use it like this:
sherlock *.java
Actually, it's a good idea to redirect the output into a file, so you can examine it in detail. Otherwise it'll just flash past very quickly. To redirect the output into a file, you use the > symbol:
sherlock *.java > results.txt
This creates a file called "results.txt" which contains the results.
There are several command-line options to Sherlock:
- -t threshold% This controls how similar files must be before they will be mentioned. Increase this to 50% or higher if you only want to see very similar files. The default is 20%
- -z zerobits This controls the "granularity" of the comparison. The higher this number, the cruder the comparison but the faster it will proceed. The lower this number, the more exact the comparison, but it will be slower, and it may be harder to detect plagiarism because small changes will fool the program into thinking the files are different. The default is 4, but the number can range from 0 to 31.
- -n number_of_words This controls how many words are used to form one digital signature. This also contributes to the granularity of the comparison. A higher number is slower while a lower number is less exact. The default is 3 words, which works fine in most cases.
- -o outfile If using Windows it may be difficult to specify an output file on the command line. Use this option to specify the output file.
Examples:
sherlock -t 80% -z 3 -n 2 -o results.txt *.java sherlock -t 50% -o results.txt *.txt sherlock -t 0% *.java # reports all similarity indexes
Sherlock performs an N2 comparison between all the files, so every file is compared with every other file.
The output lists the similarity indexes between each pair of files. This index is a percentage, where 0% means no similarity and 100% means there is a very high chance of a lot of similarity. 100% does not mean that the files are exactly the same, since the Sherlock program randomly throws away some data in order to perform a faster match.
The output of the program might look like this:
README and index.html: 5% README and makefile: 1% README and sherlock: 0% README and sherlock.c: 2% index.html and makefile: 8% index.html and sherlock: 0% index.html and sherlock.c: 10% makefile and sherlock: 0% makefile and sherlock.c: 6% sherlock and sherlock.c: 0%
(The threshold is normally 20%, so the above output would not ordinarily be shown. The example used a threshold lowered to 0% in order to see all similarities. The numbers are fake, however, and just for illustrational purposes.)
In the example, the most similar files are index.html and sherlock.c with a rating of 10%. This means that approximately 10% of the material in those two files might be overlapping (i.e. appear in both files).
sig will generate a single signature file.
- sig.c - sig will generate a single signature file.
- comp.c - comp will compare any number of signature files against each other and calculate their similarity / overlapping score.
- Makefile - Makefile for sig and comp combined.
Rob Pike created the original version of this program. There were actually two programs, called sig and comp. The sig program generated the digital signatures and stored them in a file. The comp program would then be used to compare the signature files and report the similarities.
Loki combined the two programs into a single program, called Sherlock. This has some advantages and disadvantages.
The main advantage is that no intermediate files need be created. Intermediate files require disk space and a management strategy. For example, you need to decide what suffix the signature files will use (e.g. ".sig") and where they will be stored (with the data, in a parallel directory structure, or other place).
Sherlock avoids these issues but requires all files to be compared every time to want a comparison. For example, if you are looking for similarities in essays handed in by students, Sherlock may work well because you should have all the essays handed in before you begin the plagiarism detection.
On the other hand, Sherlock is not see well suited to detecting duplicate email messages, since email arrives continually, and you can never have a finished set of email to work on. Intermediate ".sig" files might reduce the time needed to compare a new email item to older items, since the signatures for old items will already be computed and stored. Sherlock would have to read and compute the signatures for all of those older files, every time you need a comparison, because it does not use ".sig" files.
sherlock_signature source.txt
sherlock_signature source.txt > signature_file.txt # or directly write the output into a file
You can use the -z zerobits, -n number_of_words and -o outfile option.
sherlock_compare signature_file_1.txt signature_file_2.txt # to compare 2 files directly
sherlock_compare *.txt > result.txt # compare any files and write output into a result file.
You can use the -t threshold% option.
Output is the same a the normal Sherlock program.
Download them by saving the following files: