-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treatment of weights in Records object constructor, Part 1 #1439
Comments
Thanks for opening an issue on this matter. I'll be back from Canada next week and I'll study this. |
On Tue, 20 Jun 2017, Henry Doupe wrote:
I'm opening this issue in response to PR #1429.
In the case where a subsample of the PUF file is passed to a Records object, I'm a bit confused as to why we scale the
weights by the full sample size divided by the subsample size. It would make more sense to me to scale the weights by the
the sum of the weights in the full sample divided by the sum of the weights in the subsample.
Same in expected value terms, but more accurate.
dan
…
# we are doing this:
wt = wt * (N_fullsample / N_subsample)
# but why don't we do it this way
wt = wt * (sum(fullsample_weights) / sum(subsample_weights))
I feel like I'm missing something. I'd appreciate any thoughts that you all may have on this.
cc @martinholmer
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVV1hSk4wGw9PmTBgyt7o28nssHuDks5sGBs9gaJpZM4OACPP.gif]
|
@feenberg Thanks for the reply.
That makes sense, but more accurate in what way? |
@hdoupe said:
I think the point @feenberg is making is that when picking an unweighted sample, scaling up the sample weights in these two ways produces the same expected value of the sum of the scaled-up sample weights. But that using the scale-up factor you suggest (the ratio of the weights) produces a sum of the scaled-up weights that has a lower variance than using a scale-up factor that is simply the ratio of the unweighted counts (as in the current code). By lower variance, I mean when the random-number seed changes that sum of the scaled-up weights will vary less (when using the ratio of the weights for the scale-up factor). Independently, scaling up the sample weights using the ratio of the weights (instead of ratio of the raw counts) is the correct approach when picking a weighted sample. Does that make sense? |
@martinholmer Let me make sure that I have this right. We are trying to choose an estimator for the factor that will be used to scale up the weights. For an unweighted sample, we expect this factor to be For the weighted sample, the ratio of the weights becomes a better estimator for the scaling factor. In my experience with PR #1429, the ratio of the weights was closer to 4% when |
This is the way I understand these matters. So, yes, I think you "have this right." Next question is whether you think pull request #1441 is sensible. |
@martinholmer Ok, thanks for helping me understand this. |
I'm opening this issue in response to PR #1429.
In the case where a subsample of the PUF file is passed to a
Records
object, I'm a bit confused as to why we scale the weights by the full sample size divided by the subsample size. It would make more sense to me to scale the weights by the the sum of the weights in the full sample divided by the sum of the weights in the subsample.I feel like I'm missing something. I'd appreciate any thoughts that you all may have on this.
cc @martinholmer
The text was updated successfully, but these errors were encountered: