-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Remove need for target_rows
arg from estimate_u_using_random_sampling
, use convergence instead
#1060
Comments
A few thoughts on this:
Overall I'm not sure. The strategy you suggest definitely sounds like a sensible way of working out what target rows should be. It might be faster to run some analysis on the cardinality of the columns to go 'straight to' the answer. The exact algorithm to estimating this would take a bit of thought but should be possible. A final consideration is that in practice a lot of users are constrained by their compute. So they end up using a the highest value target rows which they can tolerate from a compute-time perspective, rather than the ideal value from a precision-of-parameter-estimation point of view. Does that make sense? Curious if you have any further thoughts in light of the above |
Those are great thoughts, thanks for the explanation. I think I have a vague sense of how cardinality could be used to estimate how many rows we need, but I'm nervous that any algorithm we come up with isn't going to be principled enough, and there is going to be some edge case that doesn't work for someone and then they get a bad estimation.
Hmm, that's good to know. I don't see any way around that though. If we are trying to capture a very rare event, I think you just have to try a lot of samples. We can't somehow "stack the deck" so that the event happens more frequently, because we can't correct the odds for "stacking the deck" without knowing the base probability, which is what we're trying to find! IDK maybe I'm not thinking of that correctly I think the best we can do is reduce number of samples/compute time for the comparisons which don't need many samples (like month of birth, as you say). I think the additional code complexity would be worth it so we can find u values for each Comparison separately. Some comparisons, like levenshtein, will have a much larger performance penalty than == checking, so say someone is using levenshtein on month-of-birth (OK that's a bad example, but ...), then reducing the number of samples for that column will have a really big effect. Convergence based on uncertaintyI'm thinking more on convergence criteria, and it might be principled to treat it like a counting experiment: If you take S samples, and T of them are true, then the rate is of course r=T/S. But the interesting thing is that the uncertainty of T is sqrt(T), so the uncertainty in r is sqrt(T)/S (source: page 245 of this book) For our scenario, for one comparison level: say we sample P pairs, and our level is true T times, then our estimate for u is u=T/P and uncertainty in u is sigma_u=sqrt(T)/P. But, since we are using u as a multiplier, what we really care about is not the absolute uncertainty in u, but the fractional uncertainty, E=sigma_u/u. This is merely E=sqrt(T)/T=1/sqrt(T). Solving for T we get T=1/E^2. So for example if we wanted our fraction uncertainty in u to be less than .1 ("68% (one sigma) of the time, our prediction is within 10% of its true value"), then we would need T=1/.1^2=100 true samples. So we could keep random sampling in batches of say 10k, with either of two stopping criteria: Either we get 100 True samples, or in the case of a really rare event, we may never see 100, so there is also a max_pairs parameter |
The counting idea makes sense and I think has the potential to make the calculation quite a bit faster. One challenge here is how the algorithm would work in Spark (or other distributed systems). In Spark, if a single large query parallelises well, then this is often faster than a ostensibly more efficient algorithm that requires several queries. Often it just requires quite a bit of testing to work on what in practice works better. Another aspect to this that's worth keeping in mind is that there's (arguably) less need to optimise the methods in Splink that are used for training. At least in our use, models are trained once and used many times, so the performance of One other idea we've had for improving speed is that the exact match u values can be calculated without taking the cartesian product. The idea is that if a column has cardinality of 12 with an equal distribution, the u value is 1/12. If the distribution is unequal, the computation is a little more complex, but still doesn't require the cartesian product to be computed. So you could possibly get a speed up by calculating this directly. We've never done this because you still need the other u values (e.g. for a levenstein match). And these are more tricky than they initially appear, because they are 'conditional' in the sense that you're looking for 'not an exact match but a levenstein match <2', as opposed to just 'a levenstein match <2' |
Is your proposal related to a problem?
It would be nice to not have to supply this hyperparameter.
Describe the solution you'd like
Similar to EM training, which iterates to convergence, we could run the random sampling on iteratively larger batches until the u params converge
perhaps steps of
If I understand correctly, the runtime of the initial runs will be dwarfed by the runtime of the final run, similar to iterative deepening, so it shouldn't take much longer than if you had divined the ideal number of samples from the beginning.
IDK if you would still want some sort of convergence criteria, so that during exploratory runs you would stop iterating sooner, and during final training you would train longer. But this criteria of "percentage change since last run" might be more intuitive than "number of rows", because I would want to know "ehh during my exporatory phase I just care about being within 5% of the ideal" so I could set the convergence criteria to ~.03 or something.
The text was updated successfully, but these errors were encountered: