-
Notifications
You must be signed in to change notification settings - Fork 194
TuningGuide
The first time you run Duke you may want to do as follows:
java no.priv.garshol.duke.Duke --showmatches config.xml
This will show you the matches produced by Duke as they are found, and will give you a chance to tweak the configuration until the Duke records and the matches between them start to make sense.
Because of the use of Bayes's theorem, 0.5 (or 50%) is the base probability. Before the records are compared, this is the probability they start at, and if any value is set to 50% it will have no effect at all. When probabilities from different properties are combined, probabilities above 50% push the overall likelihood up toward 100%, while probabilities below 50% push it down toward 0%. As the probability gets further away from 50%, stronger probabilities are needed to move it further.
In general, be careful about the probabilities 0.0 and 1.0. Both of these claim absolute certainty, but there is really no such thing with real-world data. Even probabilities like 0.01 and 0.99 are probably too extreme in most cases.
When setting a probability, try to picture what the odds really are. For example, let's say two people have the same name. Exactly the same name. What are the odds it's the same person? It will depend on the context, but in reality, the odds are likely to be below 95%. Similarly, if two people have the same zip code, the odds that they are the same person are probably not higher than 51%, given how many people have the same zip code.
Note that the high and low probabilities are not necessarily symmetric. If two people have the same email address, the odds they're the same person are pretty good. Let's say 85%. However, if the addresses are different, that doesn't mean the chances they're the same are as low as 15%. Most people have more than one address, so if one source has the home email and the other the work email you've basically learned nothing. So for different email addresses 45% is more reasonable.
Once you have something that's beginning to make sense you may want to make a test file. This is a file that tells Duke which matches are correct and which are incorrect. Basically it takes the following form:
+,id1,id2,0.97
-,id1,id3,0.85
+,id4,id2,0.92
Lines beginning with + indicate correct matches; those beginning with - indicate incorrect matches.
You can get Duke to help you create a test file. Generally, it's best to do this for only a subset of your data, since manually checking every match can take a long time. You may want to lower the threshold for this, too, to make sure that you find all matches. Anyway, run Duke like this, to get it to write the test file for you, and ask you for each match whether it's correct:
java no.priv.garshol.duke.Duke --interactive --linkfile=test.txt --showmatches config.xml
Running Duke again with
java no.priv.garshol.duke.Duke --testfile=test.txt --testdebug config.xml
will display only matches which are incorrect or unknown. At the end you also get statistics on the quality of the matching.
This should help you see the effect of configuration changes, and also makes it easier to work with the results, since only new results (ie: those not in the test file) are displayed, removing a lot of cognitive noise.
If you want to know why two records compared the way they did you can use this command:
java no.priv.garshol.duke.DebugCompare <xmlfile> <id1> <id2>
It will show the similarity scores for each property and the overall probability computed from them.
Note that for DebugCompare to work you must have a Lucene index on disk (use the <path>
element in the config file), in order for there to be a Lucene index for DebugCompare to look the items up in.
If you find it hard to build a configuration yourself you can try using the GeneticAlgorithm. If you have a test file you can use that as input to the algorithm. If not you can use active learning and have the algorithm ask you questions instead.