Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Remove need for target_rows arg from estimate_u_using_random_sampling, use convergence instead #1060

Open
NickCrews opened this issue Feb 16, 2023 · 3 comments
Labels

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Feb 16, 2023

Is your proposal related to a problem?

It would be nice to not have to supply this hyperparameter.

Describe the solution you'd like

Similar to EM training, which iterates to convergence, we could run the random sampling on iteratively larger batches until the u params converge

perhaps steps of

  • 10e5
  • 5e6
  • 10e6
  • 5e7
  • 10e6
  • 5e7
  • 10e7
  • etc

If I understand correctly, the runtime of the initial runs will be dwarfed by the runtime of the final run, similar to iterative deepening, so it shouldn't take much longer than if you had divined the ideal number of samples from the beginning.

IDK if you would still want some sort of convergence criteria, so that during exploratory runs you would stop iterating sooner, and during final training you would train longer. But this criteria of "percentage change since last run" might be more intuitive than "number of rows", because I would want to know "ehh during my exporatory phase I just care about being within 5% of the ideal" so I could set the convergence criteria to ~.03 or something.

@NickCrews NickCrews added the enhancement New feature or request label Feb 16, 2023
@RobinL
Copy link
Member

RobinL commented Feb 28, 2023

A few thoughts on this:

  • To some extent, it is possible to 'directly estimate' what target rows needs to be from the cardinality of the input columns. For example, if the input column with the highest cardinality has just 100 distinct values, target rows could be relatively small.
  • Or, another way of saying the same thing, is that many of the u values can be estimated fairly precisely from a relatively small target rows. When we're setting target rows high, that's usually because we need a precise estimate of a very low probability event (e.g. two people having the same full name).
  • That does suggest that, at the cost of additional code complexity, you could probably speed the code up by running estimates separately for different comparison columns (high target rows for full name, low target rows for month of birth)

Overall I'm not sure. The strategy you suggest definitely sounds like a sensible way of working out what target rows should be. It might be faster to run some analysis on the cardinality of the columns to go 'straight to' the answer. The exact algorithm to estimating this would take a bit of thought but should be possible.

A final consideration is that in practice a lot of users are constrained by their compute. So they end up using a the highest value target rows which they can tolerate from a compute-time perspective, rather than the ideal value from a precision-of-parameter-estimation point of view.

Does that make sense? Curious if you have any further thoughts in light of the above

@NickCrews
Copy link
Contributor Author

Those are great thoughts, thanks for the explanation.

I think I have a vague sense of how cardinality could be used to estimate how many rows we need, but I'm nervous that any algorithm we come up with isn't going to be principled enough, and there is going to be some edge case that doesn't work for someone and then they get a bad estimation.

A final consideration is that in practice a lot of users are constrained by their compute.

Hmm, that's good to know. I don't see any way around that though. If we are trying to capture a very rare event, I think you just have to try a lot of samples. We can't somehow "stack the deck" so that the event happens more frequently, because we can't correct the odds for "stacking the deck" without knowing the base probability, which is what we're trying to find! IDK maybe I'm not thinking of that correctly

I think the best we can do is reduce number of samples/compute time for the comparisons which don't need many samples (like month of birth, as you say). I think the additional code complexity would be worth it so we can find u values for each Comparison separately. Some comparisons, like levenshtein, will have a much larger performance penalty than == checking, so say someone is using levenshtein on month-of-birth (OK that's a bad example, but ...), then reducing the number of samples for that column will have a really big effect.

Convergence based on uncertainty

I'm thinking more on convergence criteria, and it might be principled to treat it like a counting experiment:

If you take S samples, and T of them are true, then the rate is of course r=T/S. But the interesting thing is that the uncertainty of T is sqrt(T), so the uncertainty in r is sqrt(T)/S (source: page 245 of this book)

For our scenario, for one comparison level: say we sample P pairs, and our level is true T times, then our estimate for u is u=T/P and uncertainty in u is sigma_u=sqrt(T)/P. But, since we are using u as a multiplier, what we really care about is not the absolute uncertainty in u, but the fractional uncertainty, E=sigma_u/u. This is merely E=sqrt(T)/T=1/sqrt(T). Solving for T we get T=1/E^2. So for example if we wanted our fraction uncertainty in u to be less than .1 ("68% (one sigma) of the time, our prediction is within 10% of its true value"), then we would need T=1/.1^2=100 true samples.

So we could keep random sampling in batches of say 10k, with either of two stopping criteria: Either we get 100 True samples, or in the case of a really rare event, we may never see 100, so there is also a max_pairs parameter

@RobinL
Copy link
Member

RobinL commented Mar 1, 2023

The counting idea makes sense and I think has the potential to make the calculation quite a bit faster.

One challenge here is how the algorithm would work in Spark (or other distributed systems).

In Spark, if a single large query parallelises well, then this is often faster than a ostensibly more efficient algorithm that requires several queries. Often it just requires quite a bit of testing to work on what in practice works better.

Another aspect to this that's worth keeping in mind is that there's (arguably) less need to optimise the methods in Splink that are used for training. At least in our use, models are trained once and used many times, so the performance of predict() is more important than the performance of the various estimate...() methods.

One other idea we've had for improving speed is that the exact match u values can be calculated without taking the cartesian product. The idea is that if a column has cardinality of 12 with an equal distribution, the u value is 1/12. If the distribution is unequal, the computation is a little more complex, but still doesn't require the cartesian product to be computed. So you could possibly get a speed up by calculating this directly. We've never done this because you still need the other u values (e.g. for a levenstein match). And these are more tricky than they initially appear, because they are 'conditional' in the sense that you're looking for 'not an exact match but a levenstein match <2', as opposed to just 'a levenstein match <2'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants