-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Point out gephi as a debugger #1096
Comments
On Sep 14, 2022, at 5:03 PM, Nick Crews ***@***.***> wrote:
This isn't a bug, I just wanted to point out a tool that has been really useful for me for debugging and analyzing the performance of my dedupe.
My workflow is
generate pairs, score them, and cluster them
create a networkx graph. The records are nodes in the graph, the scores are the edges. I save the fields of a record saved as node attributes. I also include the label from cluster() for each node. I try different runs with different threshold values to see what changes, or I've implemented my own clustering to compare.
filter the network to be managably sized. Too big and my computer can't handle the following analysis. What I've done is only keep nodes that are in large components as they are some of the more interesting.
export the graph to the .gexf format using networkx.write_gexf(my_graph)
open the .gexf file in gephi and analyze
I can color each node by its label to see how clustering does:
and hover over individual nodes to see what the fields were for each node:
This has been indispensable to figure out:
Is my scoring sane? Are there some pairs that are obviously wrong that I need to adjust my metrics?
How does the clustering perform on these scores? what should I set my threshold to?
Here is the .gexf files from the screenshot if you want to download gephi and play around. This is from publicly available campaign donation data, records are the donors for individual donations. I thresholded the scores to either be 0 or 1 to make things simpler. This only includes records in components larger than 140.
components_larger_than_140.gexf.zip
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Nick - Thanks for writing up this post. Long time gephi/dedupe user and kudos on a brilliant use case. Using gephi to look at scoring and clustering is brilliant - glad to see use of this 3rd party tools to strengthen dedupe.
Thanks again.
-tom
|
Thanks Tom. Glad that someone found it useful! Depending on what you or Forest or others think about how universally valuable this is, perhaps we add support for this sort of debugging? eg a I'm not sure how else this issue is actionable, so it should get closed if we can't come up with something we want to change. |
This isn't a bug, I just wanted to point out a tool that has been really useful for me for debugging and analyzing the performance of my dedupe.
My workflow is
threshold
values to see what changes, or I've implemented my own clustering to compare.networkx.write_gexf(my_graph)
I can color each node by its label to see how clustering does:
and hover over individual nodes to see what the fields were for each node:
This has been indispensable to figure out:
threshold
to?Here is the .gexf files from the screenshot if you want to download gephi and play around. This is from publicly available campaign donation data, records are the donors for individual donations. I thresholded the scores to either be 0 or 1 to make things simpler. This only includes records in components larger than 140.
components_larger_than_140.gexf.zip
The text was updated successfully, but these errors were encountered: