-
Notifications
You must be signed in to change notification settings - Fork 551
Home
Dedupe helps when you are not sure whether two records are about the same thing or not. As different research communities encountered this problem they each gave it a new name -- deduplication, author disambiguation, entity resolution, record linkage, and others -- but, ultimately, its all about trying to figure out what records are about the same thing.
We can break down our problem into two parts. First we need to figure out some rule that we can use to decide when two records are about the same thing or not. Ideally, we want a rule that always puts together records that belong together and always separates records that are about different things. Usually, we don't have enough information to do that, so then we would like a rule that does as good a job as possible and indicates how uncertain we should be about whether two records really belong together.
Second, because good rules are often very complicated and take time to execute, we cannot compare each record to every other record because it would take much too long. We have to figure out how to make only a small fraction of possible comparisons, yet have those comparisons be the right ones.
The approach that we take is to find good solutions to both of these problems using machine learning. The work is based upon the synthesizing 2006 dissertation of Mikhail Bilenko, /Learnable Similarity Functions and Their Application to Record Linkage and Clustering./
A typical process follows:
-
The user supplies a random sample of the dataset she wants to deduplicate. She also provides a 'data model' that declares what kind of data each field is -- 'String', 'Lat-Long', 'Name', etc.
Depending upon the field type, the program will use an appropriate distance measure. Right now, we have only implemented the 'String' type which uses an affine-gap Levenshtein distance. Other field types and distances are easily added. -
Using active learning, the program provides a pair of records for the user to label as referring to the same thing, different things, or unclear. For each pair in the sample, the program calculates a a distance for each field based upon the data model. Using labeled examples, a regularized logistic classifier finds weights for those field distances so that the weighted sum of these distances
represents the uncertainty of whether the pairs are duplicates or distinct.After a user has labeled an example, this example is added to the training set. Weights are relearned, and
the record that the program is now most unsure of is given to the user to label. -
Once the user has finished training the classifier, the training data is used again to find a good blocking of the data. Blocking is about finding a set of features which true duplicates tend to share and which distinct records do not, and then only comparing those records that share that feature. For example, we might split the data into blocks that all have the same first five characters in an address field. Since most of the data do not share this feature, we will avoid making most comparisons.
We would like to find a way to block the data so that we have an opportunity to compare every true duplicate, but make as few comparisons of true distinct records as possible. Unfortunately, the only way to find the guaranteed optimal blocking is through exhaustive search. Since we would like to consider an large number of possible ways of blocking, we use a greedy algorithm, particularly Chvatal's Greedy Set-Cover algorithm.
-
At the end of training we learned a classification rule for comparing records and a good blocking procedure for deciding which fraction of comparisons to make. We can then apply these to our full dataset. First we define the blocks, which is usually a O(n) operation. Then, based on the blocks, produce the set of candidate record pairs to classify. Once we have that set, we could easily split it into chunks and distribute the relatively expensive comparisons across many machines.
-
The classification gives us a score for each candidate pair that represents how certain we are that a pair is a duplicate pair or distinct. However, more than two records can refer to the same thing, and we want to compare these pairwise scores into duplicate clusters. Right now we do this with a fast implementation of agglomerative hierarchical clustering, which is O(n^2) where n is the number of candidate pairs that have a duplicate certainty score above some threshold. At some point, this may get to slow, but since a small fraction of records are duplicates, this algorithm will likely work for very large data.
The program selects the pair that it is currently least sure of. Once the user has labeled the pair, the pair is added to a set of training data and the program relearns a set of weights to discriminate between duplicate and distinct pairs. The learner is a regularized logistic regression, and