-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create training file #91
Comments
Noob user here. I had the same question but then figured out that the structure is first there's a "distinct" set with all pairs recognized to be distinct, and then a "match" set with all pairs you've identified as being equal. I have a list of publications by people with similar names and my goal is to identify those authored by one particular person, so I need that one cluster to be good and I don't care much about deduplicating other authors. I have a list of publications authored by the correct person and a list of some that are definitely not by her. Here's the training file I created: say set A is the list of publications of the person I am interested in, set B is the set of those not by her. The "distinct" part is made up of the cartesian product of A and B, and the "match" part by the cartesian product {{a[i],a[j]} where a[i] and a[j] belong to A and i!=j}, together with a similar set for elements in B. The rest is making sure that the file has the correct format (double quotes instead of single quotes, the correct brackets, etc.). I'll add the code below. The code I used is for a dataset with one variable called "match" that identifies the person I am interested in. match==1 means it is the correct person, match==0 means it is someone else, match=='' means we don't know (these are the cases I want dedupe to help me with). Like I said, noob user here, so I'm not sure if this is the best or even correct way to go about it. But I hope it helps.
|
I've used cvsdedupe to try and match up a list of ~77,000 unmapped entries to a master list of ~141,000 known things. It worked and has given a list of ~30,000 matches.
I've since done a bunch of manual work to not only check the ML mapping from csvdedupe, but also from some other sources, so I now have a definitive list of matches that I'd like to feed back to cvsdedupe before rerunning it to try and refine and improve my results. I can't figure out how to do that.
The format of the
training.json
file seems pretty straight-forward, but I can't tell what marks something as a positive or negative match ... or even if that is what that file is about. Can anyone help me?The text was updated successfully, but these errors were encountered: