Using jiebaR package (SimHash algorithm) #66

remibacha · 2018-10-22T15:14:14Z

Hello

Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):

 library(jiebaR)
 coder <- "Simhash detects near duplicates and not exact duplicates"
 codel <- "SimHash is a technique for quickly detect near duplicates"

I have create a worker called "simhasher":

 simhasher = worker("simhash", topn = 5)
 simhasher <= codel

Then I have computed the distance:

 distance(codel, coder, simhasher)

Here is the result:

 $distance
 [1] 22

 $lhs
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

 $rhs
 23.4784      11.7392      11.7392      11.7392 
 "duplicates"    "Simhash"    "detects"      "exact"

I need you help on 3 things:

the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?
What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)
I also checked the worker I have created :

simhasher <= codel

And here is the result I discovered:

 $simhash
 [1] "12382334418040220206"

 $keyword
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly"

What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.

Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.

The text was updated successfully, but these errors were encountered:

BruceZhaoR · 2018-10-23T07:49:01Z

@remibacha
the jiebaR::distance first use TF-IDF calculate the keywords, then use these keywords to generate 64bits hash code, last, calucuate the hamming-distance between the hash codes.
Here is an example:

library(jiebaR)
#> Loading required package: jiebaRD
simhasher_5 = worker("simhash", topn = 5)
keyword_1 <- c("Simhash", "duplicates")
keyword_2 <- c("Simhash", "quickly")
simhash_1 <- vector_simhash(keyword_1, simhasher_5)
simhash_1
#> $simhash
#> [1] "144150442997195320"
#> 
#> $keyword
#>      11.7392      11.7392 
#>    "Simhash" "duplicates"
simhash_2 <- vector_simhash(keyword_2, simhasher_5)
simhash_2
#> $simhash
#> [1] "1730138795753340968"
#> 
#> $keyword
#>   11.7392   11.7392 
#> "Simhash" "quickly"

tobin(simhash_1$simhash)
#> [1] "0000001000000000001000000001000001101101000100000010001000111000"
tobin(simhash_2$simhash)
#> [1] "0001100000000010101100000001000101101101000000000000000000101000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 11
vector_distance(keyword_1, keyword_2, simhasher_5)
#> $distance
#> [1] 11
#> 
#> $lhs
#>      11.7392      11.7392 
#>    "Simhash" "duplicates" 
#> 
#> $rhs
#>   11.7392   11.7392 
#> "Simhash" "quickly"

# only one keyword "Simhash"
simhasher_1 <- worker("simhash", topn = 1)
simhash_1 <- vector_simhash(keyword_1, simhasher_1)
simhash_1
#> $simhash
#> [1] "1883542797686548280"
#> 
#> $keyword
#>   11.7392 
#> "Simhash"

simhash_2 <- vector_simhash(keyword_2, simhasher_1)
simhash_2
#> $simhash
#> [1] "1883542797686548280"
#> 
#> $keyword
#>   11.7392 
#> "Simhash"

tobin(simhash_1$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
tobin(simhash_2$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 0

vector_distance(keyword_1, keyword_2, simhasher_1)
#> $distance
#> [1] 0
#> 
#> $lhs
#>   11.7392 
#> "Simhash" 
#> 
#> $rhs
#>   11.7392 
#> "Simhash"

Created on 2018-10-23 by the reprex package (v0.2.0).

hamming_distance: https://en.wikipedia.org/wiki/Hamming_distance

You can modify the user dict in jiebaRD, ?USERPATH, ?edit_dict, which can change the weight of word's TF-IDF.

remibacha · 2018-10-24T07:27:37Z

Thanks for this example, really helpfull ! But I still don't get what the figures above the words in lhs and rhs are (e.g: 11.7392). Can you please explain it?

BruceZhaoR · 2018-10-25T03:39:36Z

@remibacha jiebaR is design for Chinese Text Segment, it has a default idf dict which only contains Chinse words. Maybe the default idf weight for English word is 11.7392. So, the tf-idf = tf * idf. Here is an example:

IDFPATH
#> [1] "E:/R/R-3.5-library/jiebaRD/dict/idf.utf8"
keys = worker("keywords", topn = 2)
keys <= "Simhash is quick, Simhash ia fast"
#> 23.4784   11.7392 
#> "Simhash"    "fast"

If you want to get a more accuary tf-idf weight, you need to train the Corpus yourself. The get_idf function may help you. Then you can use worker("keywords", idf = "path to your idf.dict", ....)

Suppose you have many Englisth corpus, you can use these corpus to trian idf, then, use worker("simhash", ...) to generate every doc's simhash value, last, you can use simhash_dist_mat to get the distance of the docments.

There is stringdist package, which can calculate various string distances based on edits
(Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q-
gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An
implementation of soundex is provided as well. Distances can be computed between
character vectors while taking proper care of encoding or between integer
vectors representing generic sequences. This package is built for speed and
runs in parallel by using 'openMP'. An API for C or C++ is exposed as well

I think the main trick is to hash the keyword and weight to the simhash code, and it is pretty fast for calculating hamming-distance, which can used for de-duplicate docs. for more, you can read https://github.com/yanyiwu/simhash/blob/master/README_EN.md the author's cppjieba is the soure of jiebaR. Some introductions: https://github.com/seomoz/simhash-cpp/#architecture and https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using jiebaR package (SimHash algorithm) #66

Using jiebaR package (SimHash algorithm) #66

remibacha commented Oct 22, 2018 •

edited

Loading

BruceZhaoR commented Oct 23, 2018

remibacha commented Oct 24, 2018

BruceZhaoR commented Oct 25, 2018

Using jiebaR package (SimHash algorithm) #66

Using jiebaR package (SimHash algorithm) #66

Comments

remibacha commented Oct 22, 2018 • edited Loading

BruceZhaoR commented Oct 23, 2018

remibacha commented Oct 24, 2018

BruceZhaoR commented Oct 25, 2018

remibacha commented Oct 22, 2018 •

edited

Loading