Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of weights in Records object constructor, Part 1 #1439

Closed
hdoupe opened this issue Jun 20, 2017 · 7 comments
Closed

Treatment of weights in Records object constructor, Part 1 #1439

hdoupe opened this issue Jun 20, 2017 · 7 comments

Comments

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 20, 2017

I'm opening this issue in response to PR #1429.

In the case where a subsample of the PUF file is passed to a Records object, I'm a bit confused as to why we scale the weights by the full sample size divided by the subsample size. It would make more sense to me to scale the weights by the the sum of the weights in the full sample divided by the sum of the weights in the subsample.

# we are doing this:
wt = wt * (N_fullsample / N_subsample)
# but why don't we do it this way
wt = wt * (sum(fullsample_weights) / sum(subsample_weights))

I feel like I'm missing something. I'd appreciate any thoughts that you all may have on this.

cc @martinholmer

@martinholmer
Copy link
Collaborator

Thanks for opening an issue on this matter. I'll be back from Canada next week and I'll study this.

@feenberg
Copy link
Contributor

feenberg commented Jun 20, 2017 via email

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jun 21, 2017

@feenberg Thanks for the reply.

Same in expected value terms, but more accurate.

That makes sense, but more accurate in what way?

@martinholmer
Copy link
Collaborator

@hdoupe said:

@feenberg Thanks for the reply.

Same in expected value terms, but more accurate.

That makes sense, but more accurate in what way?

I think the point @feenberg is making is that when picking an unweighted sample, scaling up the sample weights in these two ways produces the same expected value of the sum of the scaled-up sample weights. But that using the scale-up factor you suggest (the ratio of the weights) produces a sum of the scaled-up weights that has a lower variance than using a scale-up factor that is simply the ratio of the unweighted counts (as in the current code). By lower variance, I mean when the random-number seed changes that sum of the scaled-up weights will vary less (when using the ratio of the weights for the scale-up factor).

Independently, scaling up the sample weights using the ratio of the weights (instead of ratio of the raw counts) is the correct approach when picking a weighted sample.

Does that make sense?

@martinholmer martinholmer changed the title Treatment of weights in Records object constructor Treatment of weights in Records object constructor, Part 1 Jun 27, 2017
@hdoupe
Copy link
Collaborator Author

hdoupe commented Jun 27, 2017

@martinholmer Let me make sure that I have this right. We are trying to choose an estimator for the factor that will be used to scale up the weights. For an unweighted sample, we expect this factor to be 1/subfrac. While both the ratio of the weights and the ratio of the sample sizes are the same in expected value, the ratio of the sample sizes has a lower variance. Thus, it is a better choice. That makes sense to me.

For the weighted sample, the ratio of the weights becomes a better estimator for the scaling factor. In my experience with PR #1429, the ratio of the weights was closer to 4% when subfrac was set to 2%. So I think this makes sense to me, too.

@martinholmer
Copy link
Collaborator

@hdoupe said in issue #1439:

Let me make sure that I have this right. We are trying to choose an estimator for the factor that will be used to scale up the weights. For an unweighted sample, we expect this factor to be 1/subfrac. While both the ratio of the weights and the ratio of the sample sizes are the same in expected value, the ratio of the sample sizes has a lower variance. Thus, it is a better choice. That makes sense to me.

For the weighted sample, the ratio of the weights becomes a better estimator for the scaling factor. In my experience with PR #1429, the ratio of the weights was closer to 4% when subfrac was set to 2%. So I think this makes sense to me, too.

This is the way I understand these matters. So, yes, I think you "have this right."

Next question is whether you think pull request #1441 is sensible.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jun 27, 2017

@martinholmer Ok, thanks for helping me understand this.

@hdoupe hdoupe closed this as completed Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants