Jaccard Similarity Biased on Bipartite Graphs #8

AlexMRuch · 2019-08-27T16:51:31Z

You should consider using Standardized Co-incident Ratio (SCR) as a measure of similarity instead of the Jaccard Similarity, as the latter is biased in large bipartite graphs whereas the former is not: https://yongrenshi.weebly.com/uploads/5/7/8/6/57861243/ssr_structuralsimilarity.pdf. Should be a simple switch.

anvaka · 2019-08-29T16:58:15Z

Thank you Alex. I'll check this out in more details.

I briefly skimmed over the article, got puzzled by their definition of Jaccard Index - they say

Jij = d / (b + c + d)

Where

d - is number of shared members between
b - number of i's neighbors
c - number of j's neighbors.

I'd expect denominator to be (b + c - d) not (b + c + d). It is just a typo or they used incorrect Jaccard similarity definition?

anvaka · 2019-08-29T17:02:38Z

Also a quick follow up question, in case I'm missing it. How do you compute E(d) - the number expected in an otherwise identical randomized network?

AlexMRuch · 2019-08-29T18:08:11Z

If you’re using the affiliation matrix, then it is d/(b+c+d) because c is nodes that are in both sets and is already removed from the “or” combination. See the “Similarity of asymmetric binary attributes” figure in https://en.wikipedia.org/wiki/Jaccard_index <https://en.wikipedia.org/wiki/Jaccard_index>. For E(d), I’m guessing they either used a random graph generator (e.g., https://networkx.github.io/documentation/networkx-1.10/reference/generators.html <https://networkx.github.io/documentation/networkx-1.10/reference/generators.html>) with parameters equal to the empirical graph or they used one of the methods listed in the following: “The difference is the denominator. Instead of comparing the number of co-followers with the number of followers (measured either as the union or the geometric mean), SCR controls for the expected increase in co-following among high-degree followers by taking as a baseline the number of co-followers expected by chance in an otherwise identical network. The randomization procedure preserves the original in-degree and outdegree of every node (Zweig and Kauffman, 2011; Gionis et al., 2007; Neal, 2014).” Hope that helps!

…

On Aug 29, 2019, at 1:02 PM, Andrei Kashcha ***@***.***> wrote: Also a quick follow up question, in case I'm missing it. How do you compute E(d) - the number expected in an otherwise identical randomized network? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AFIYWOI6ZT222BPOOBAMR3DQG76LBA5CNFSM4IQGW7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PFDRQ#issuecomment-526275014>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFIYWOPFUU7BSLI5BSYVDBDQG76LBANCNFSM4IQGW7HA>.

anvaka · 2019-08-29T19:23:12Z

The difference is the denominator

It is very interesting. Any chance you can ELI5 this to me? Given the reddit network of posts and comments, what should be the E(d) function? How do I compute it efficiently?

AlexMRuch · 2019-08-29T20:18:52Z

Sure thing! So social networks often follow what’s called a power law distribution, which basically means that lots of nodes in the network have a small/moderate degree of connectivity to other nodes, but a small number of nodes have a very high degree. If you think of Reddit, for example, a lot of people only post 1-2 comments a month; however, some people comment on hundreds or even thousands of subreddits. Because the distribution of these degrees of connectivity is highly skewed with a heavy right tail, they throw off the estimate of the network’s average degree (i.e., the estimated average level of connectivity in the graph will be higher than it really should be). This is the same reason why we often report the median of these distributions, as it’s not affected by skewed distributions. The paper shows that the heavily skewed degree of connectivity in large networks biases many measures of similarity in networks (e.g., Euclidean, Pearson, Cosine, Jaccard, etc.). The bias happens because the out degree of the very highly connected nodes overemphasizes some of the weight in the denominators of these measures. The standardized co-incidence ratio, however, excludes this bias as it compares the degrees to what is expected at random. Fortunately, if you want to get similarity measures for your graph, you’ll only need to generate one random graph. You can use this tutorial and adjust the number of nodes (n) and edges (m) to the number of nodes in your network: https://networkx.github.io/documentation/stable/auto_examples/graph/plot_erdos_renyi.html <https://networkx.github.io/documentation/stable/auto_examples/graph/plot_erdos_renyi.html>. After you’ve generated your random network, you’ll only have to calculate E(d) once, which will return a number that equals the average number of nodes that any two pairs of nodes both share edges to (i.e., how many nodes on average do any pair of nodes in the network share as common neighbors).This will give you a number, like 10.23 or 5.81 or 27.22, depending on how sparse or dense the network is. Again, this number just represents the average number of common neighbors that any two nodes share. Once you have this number, you can update your similarity estimate script to set E(d) to the number you estimated and then re-estimate the similarities between each of your nodes using the d/E(d) SCR measure. It should give you a matrix just like your Jaccard method gives you.

…

On Aug 29, 2019, at 3:23 PM, Andrei Kashcha ***@***.***> wrote: The difference is the denominator It is very interesting. Any chance you can ELI5 this to me? Given the reddit network of posts and comments, what should be the E(d) function? How do I compute it efficiently? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AFIYWOLKEJHI5RMMSQAEZK3QHAO2BA5CNFSM4IQGW7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PRSLI#issuecomment-526326061>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFIYWOM4ZF2OMYWVHEIASJDQHAO2BANCNFSM4IQGW7HA>.

no-identd · 2020-09-02T22:36:52Z

Any update on this? Would be interesting to see how this differs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaccard Similarity Biased on Bipartite Graphs #8

Jaccard Similarity Biased on Bipartite Graphs #8

AlexMRuch commented Aug 27, 2019

anvaka commented Aug 29, 2019

anvaka commented Aug 29, 2019

AlexMRuch commented Aug 29, 2019 via email

anvaka commented Aug 29, 2019

AlexMRuch commented Aug 29, 2019 via email

no-identd commented Sep 2, 2020

Jaccard Similarity Biased on Bipartite Graphs #8

Jaccard Similarity Biased on Bipartite Graphs #8

Comments

AlexMRuch commented Aug 27, 2019

anvaka commented Aug 29, 2019

anvaka commented Aug 29, 2019

AlexMRuch commented Aug 29, 2019 via email

anvaka commented Aug 29, 2019

AlexMRuch commented Aug 29, 2019 via email

no-identd commented Sep 2, 2020