-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default sample size or change sampling scheme based Tahamont's, et. al.'s findings #980
Comments
Here is the "replication" code for the paper. I think this should have what you need. @mmcneill developed the code here and will know more details. (I say replication in quotes because the underlying data can't be released publicly due to PII, but hopefully this will be helpful.) We're happy to answer any clarification questions on this. |
Definitely. And yes, finding public data to do record linkage experiments is a huge challenge. We came across this active learning paper by @tedenamorado which mentions a Brazilian election dataset that has groundtruth which could be harnessed for experiments. I believe this is the link to the data but it's been a while since I looked closely: |
For the dataset with 200k rows, changing |
yikes. that's kind of what i would expect. |
@zjelveh, @mmcneill, and their co-authors wrote up nice paper on dedupe that has a very interesting finding that increasing the size of the training sample significantly improved the recall of dedupe, even holding the number of labeled pairs constant.
Zubin's twitter thread on the paper: https://twitter.com/zubinjelveh/status/1501978665839734790
unfortunately, the paper suggests you only really get strongly better results if the training sample is 100 x large than the default of 1,500 (for dedupe). a sample size of 150,000 records is going to make the active learning routine very slow.
given that the training budget is the same, the increased performance related to the larger sampling must be because the larger sample must contain more informative record pairs for the active learner to take advantage of.
i've thought about overhauling our pretty unprincipled sampling scheme for a while #845. It seems possible that a better sampling scheme could achieve Zubin and Melissa's results with a much smaller sample size than 150,000 records.
The text was updated successfully, but these errors were encountered: