-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaccard Similarity Biased on Bipartite Graphs #8
Comments
Thank you Alex. I'll check this out in more details. I briefly skimmed over the article, got puzzled by their definition of Jaccard Index - they say
Where
I'd expect denominator to be |
Also a quick follow up question, in case I'm missing it. How do you compute |
If you’re using the affiliation matrix, then it is d/(b+c+d) because c is nodes that are in both sets and is already removed from the “or” combination. See the “Similarity of asymmetric binary attributes” figure in https://en.wikipedia.org/wiki/Jaccard_index <https://en.wikipedia.org/wiki/Jaccard_index>.
For E(d), I’m guessing they either used a random graph generator (e.g., https://networkx.github.io/documentation/networkx-1.10/reference/generators.html <https://networkx.github.io/documentation/networkx-1.10/reference/generators.html>) with parameters equal to the empirical graph or they used one of the methods listed in the following:
“The difference is the denominator. Instead of comparing the number of co-followers with the number of followers (measured either as the union or the geometric mean), SCR controls for the expected increase in co-following among high-degree followers by taking as a baseline the number of co-followers expected by chance in an otherwise identical network. The randomization procedure preserves the original in-degree and outdegree of every node (Zweig and Kauffman, 2011; Gionis et al., 2007; Neal, 2014).”
Hope that helps!
… On Aug 29, 2019, at 1:02 PM, Andrei Kashcha ***@***.***> wrote:
Also a quick follow up question, in case I'm missing it. How do you compute E(d) - the number expected in an otherwise identical randomized network?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AFIYWOI6ZT222BPOOBAMR3DQG76LBA5CNFSM4IQGW7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PFDRQ#issuecomment-526275014>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFIYWOPFUU7BSLI5BSYVDBDQG76LBANCNFSM4IQGW7HA>.
|
It is very interesting. Any chance you can ELI5 this to me? Given the reddit network of posts and comments, what should be the |
Sure thing!
So social networks often follow what’s called a power law distribution, which basically means that lots of nodes in the network have a small/moderate degree of connectivity to other nodes, but a small number of nodes have a very high degree. If you think of Reddit, for example, a lot of people only post 1-2 comments a month; however, some people comment on hundreds or even thousands of subreddits. Because the distribution of these degrees of connectivity is highly skewed with a heavy right tail, they throw off the estimate of the network’s average degree (i.e., the estimated average level of connectivity in the graph will be higher than it really should be). This is the same reason why we often report the median of these distributions, as it’s not affected by skewed distributions.
The paper shows that the heavily skewed degree of connectivity in large networks biases many measures of similarity in networks (e.g., Euclidean, Pearson, Cosine, Jaccard, etc.). The bias happens because the out degree of the very highly connected nodes overemphasizes some of the weight in the denominators of these measures. The standardized co-incidence ratio, however, excludes this bias as it compares the degrees to what is expected at random.
Fortunately, if you want to get similarity measures for your graph, you’ll only need to generate one random graph. You can use this tutorial and adjust the number of nodes (n) and edges (m) to the number of nodes in your network: https://networkx.github.io/documentation/stable/auto_examples/graph/plot_erdos_renyi.html <https://networkx.github.io/documentation/stable/auto_examples/graph/plot_erdos_renyi.html>. After you’ve generated your random network, you’ll only have to calculate E(d) once, which will return a number that equals the average number of nodes that any two pairs of nodes both share edges to (i.e., how many nodes on average do any pair of nodes in the network share as common neighbors).This will give you a number, like 10.23 or 5.81 or 27.22, depending on how sparse or dense the network is. Again, this number just represents the average number of common neighbors that any two nodes share.
Once you have this number, you can update your similarity estimate script to set E(d) to the number you estimated and then re-estimate the similarities between each of your nodes using the d/E(d) SCR measure. It should give you a matrix just like your Jaccard method gives you.
… On Aug 29, 2019, at 3:23 PM, Andrei Kashcha ***@***.***> wrote:
The difference is the denominator
It is very interesting. Any chance you can ELI5 this to me? Given the reddit network of posts and comments, what should be the E(d) function? How do I compute it efficiently?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AFIYWOLKEJHI5RMMSQAEZK3QHAO2BA5CNFSM4IQGW7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PRSLI#issuecomment-526326061>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFIYWOM4ZF2OMYWVHEIASJDQHAO2BANCNFSM4IQGW7HA>.
|
Any update on this? Would be interesting to see how this differs. |
You should consider using Standardized Co-incident Ratio (SCR) as a measure of similarity instead of the Jaccard Similarity, as the latter is biased in large bipartite graphs whereas the former is not: https://yongrenshi.weebly.com/uploads/5/7/8/6/57861243/ssr_structuralsimilarity.pdf. Should be a simple switch.
The text was updated successfully, but these errors were encountered: