How to create training file #91

cviebrock · 2019-06-19T18:53:32Z

I've used cvsdedupe to try and match up a list of ~77,000 unmapped entries to a master list of ~141,000 known things. It worked and has given a list of ~30,000 matches.

I've since done a bunch of manual work to not only check the ML mapping from csvdedupe, but also from some other sources, so I now have a definitive list of matches that I'd like to feed back to cvsdedupe before rerunning it to try and refine and improve my results. I can't figure out how to do that.

The format of the training.json file seems pretty straight-forward, but I can't tell what marks something as a positive or negative match ... or even if that is what that file is about. Can anyone help me?

The text was updated successfully, but these errors were encountered:

zambrana98 · 2019-10-29T18:01:50Z

Noob user here. I had the same question but then figured out that the structure is first there's a "distinct" set with all pairs recognized to be distinct, and then a "match" set with all pairs you've identified as being equal.

I have a list of publications by people with similar names and my goal is to identify those authored by one particular person, so I need that one cluster to be good and I don't care much about deduplicating other authors. I have a list of publications authored by the correct person and a list of some that are definitely not by her.

Here's the training file I created: say set A is the list of publications of the person I am interested in, set B is the set of those not by her. The "distinct" part is made up of the cartesian product of A and B, and the "match" part by the cartesian product {{a[i],a[j]} where a[i] and a[j] belong to A and i!=j}, together with a similar set for elements in B. The rest is making sure that the file has the correct format (double quotes instead of single quotes, the correct brackets, etc.). I'll add the code below.

The code I used is for a dataset with one variable called "match" that identifies the person I am interested in. match==1 means it is the correct person, match==0 means it is someone else, match=='' means we don't know (these are the cases I want dedupe to help me with). Like I said, noob user here, so I'm not sure if this is the best or even correct way to go about it. But I hope it helps.

distinct = []
matches = []

matched_0 = {key:value for key, value in data_d.items() if value['match']=='0'}
matched_1 = {key:value for key, value in data_d.items() if value['match']=='1'}

pairs = [json.dumps([x,y]) for x in matched_0.values() for y in matched_1.values()]
file1 = open(training_file,"w")
file1.write('{"distinct": [')
for x in pairs[:-1]:
    file1.write('{"__class__": "tuple", "__value__": '+x+'}, ')
file1.write('{"__class__": "tuple", "__value__": '+pairs[-1]+'}]')

pairs = [json.dumps([x,y]) for x in matched_0.values() for y in matched_0.values() if x!=y]
file1.write(', "match": [')
for x in pairs:
    file1.write('{"__class__": "tuple", "__value__": '+x+'}, ')

pairs = [json.dumps([x,y]) for x in matched_1.values() for y in matched_1.values() if x!=y]
for x in pairs[:-1]:
    file1.write('{"__class__": "tuple", "__value__": '+x+'}, ')
file1.write('{"__class__": "tuple", "__value__": '+pairs[-1]+'}]}')
file1.close()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create training file #91

How to create training file #91

cviebrock commented Jun 19, 2019

zambrana98 commented Oct 29, 2019

How to create training file #91

How to create training file #91

Comments

cviebrock commented Jun 19, 2019

zambrana98 commented Oct 29, 2019