Document Distance Problem

What is Document Distance?

In the algorithms folder, there are 8 algorithms that find distance between the two documents. What distance between the documents is the idea of how different the two documents are.

The approach used to find the distance between the two documents is treating the words in each document as a vector with two arrays: first part of the vector is the list of the words, the second part is how many times they are repeated. To find the similarity between them we use the angle formula between the two vectors (same as in mathematics): distance = arccos(L1*L2/|L1|*|L2|), where L1 is the first document vector, and L2 is the second document vector.

Obviously, the smaller the angle between them, the more similar the documents will be. Therefore, the identical documents will yield 0 radian, while absolutely different documents will yield pi/2 radian.

Performances of the algorithms

There are three large test files in the folder, and two small test files. The small test files will yield 0-1 second of run time, depending on the algorithm. However, the real beauty of the algorithms is revealed when running the large documents. The files are of size 0.9 MB. Here's the analysis:

1 - 3: They would not finish running, and I stopped it after around 5 hours of running
4: 222 seconds
5: 267 seconds
6: <2 second (not bad jump, huh?)
7: <1 second
8: <1 second

You can see there are some discrepancies between the 4th and 5th algorithms, as the improvement has caused the run time to be longer. I cannot explain that but possible I made some mistake while implementing.

P.S. This was run on a Virtual Machine, therefore, the results are so slow, especially for the first three algorithms. On a standalone computer, it would probably finish in less than 5 hours

Courtesy

None of these algorithms are mine, and they are the courtesy of MIT under the Creative Commons License. I recommend reading their explanation of improvements, as well as asymptotic complexity analysis of the algorithms here.

However, I converted their implementation to object oriented approach, but that should not affect the performance of the algorithms, as I tried to leave the algorithms as close as possible to the originals.

Please take a look at both implementations, and, if you find any bugs, let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Distance Problem

What is Document Distance?

Performances of the algorithms

Courtesy

Clone this wiki locally