consider adding soft-cosine distance #21

behrica · 2022-10-10T21:47:17Z

Useful in comparing TFIDF text representations, instead of using cosine

https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure

The similarity function s_i_j should be plugable (as input to the function)

genmeblog · 2022-10-14T13:12:47Z

Thanks for the idea! I think I need your support here. I understand the definition. However I have no idea how to build convenient API for that. Set of examples would be helpful.

behrica · 2022-10-14T20:26:16Z

I think it should simply allow to plugin in any (distance) function which takes 2 values and returns a float.

(soft-cosine  [1 2 3 4]   [ 2 3 5 6]    (fn [x y]  .... do-somthing-to caluòate distance of x and y ))

Concrete case comes from NLP-

A language aware function:

(defn word-dist [token-1 token-2]
...
)
with this spec

(word-dist   "I"  "I") = 1
(word-dist   "like"  "like") = 1
(word-dist   "I"  "like") = 
(word-dist   "fruits"  "banana") = 0.5

(soft-cosine [ "I" "like" "fruits"] ["I" "like" "banana"] word-dist) = .... > 0.6 (not sure about concrete number)
It would compare "I" -> "I" = 1
"like" -> "like" = 1
"fruits" -> banana" = 0.5

In practice we would map all tokens to number first (this makes the vocabulary),
so the soft-cosine would be called with vectors of ints in this case. (if token frequency is used)
or floats, if tfidf is used.

behrica · 2022-10-14T20:31:27Z

I found here:
https://github.com/TeamCohen/secondstring/blob/master/src/com/wcohen/ss/SoftTFIDF.java

an old Java implementation which combines TFID and soft-cosine

I would prefer to have this separated.

The tfidf part we have already here:
https://github.com/scicloj/scicloj.ml.smile/blob/main/src/scicloj/ml/smile/nlp.clj#L285

This gives me the 2 vectors above, I want to get the distance for.

The "classical" way is to use simple cosine distance, but this is then not able to deal with "similarity of tokens".
The only way to do hat would be to "normalize" the vocabulary before, and somehow say that "fruits" and "banana" is the same thing, and remove one. But his is a too strict normalisation.

SoftCosine should be better , hopefully.

behrica · 2022-10-14T20:43:25Z

an other exmaple would be to plugin in text embeddings (word2vec).
They can calculate as well a semantic distance between any 2 words.

There is an Java implmentation here, and so I would plugin this concrete function:
https://javadoc.io/static/org.deeplearning4j/deeplearning4j-nlp/1.0.0-M2.1/org/deeplearning4j/models/embeddings/wordvectors/WordVectors.html#similarity(java.lang.String,java.lang.String)

(just doing first the mapping to the vocabulary token<->index)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider adding soft-cosine distance #21

consider adding soft-cosine distance #21

behrica commented Oct 10, 2022

genmeblog commented Oct 14, 2022

behrica commented Oct 14, 2022 •

edited

Loading

behrica commented Oct 14, 2022 •

edited

Loading

behrica commented Oct 14, 2022

consider adding soft-cosine distance #21

consider adding soft-cosine distance #21

Comments

behrica commented Oct 10, 2022

genmeblog commented Oct 14, 2022

behrica commented Oct 14, 2022 • edited Loading

behrica commented Oct 14, 2022 • edited Loading

behrica commented Oct 14, 2022

behrica commented Oct 14, 2022 •

edited

Loading

behrica commented Oct 14, 2022 •

edited

Loading