-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine how to assign weights #9
Comments
The third of these approaches is quite interesting but I'm not sure how to operationalize it unless we can define what makes for a good set of weights, beyond satisfying targets, because there will be a huge number of sets of weights that can satisfy targets. If we can operationalize it, we can start to examine it empirically early on, using data files @MaxGhenis has already created - a problem that is large enough to be interesting, and small enough to keep us from worrying about computer resources. If we work out an analytic approach, we can easily apply it to more-sophisticated synthesized files, and compare to other approaches. But let's see if we can operationalize it. Let's assume we have 2 data files:
In round numbers, both files have 22k observations and 60 variables (including a weight). We want to compare file-quality measures such as weighted wages or AGI against the values in actual (presumed to be correct), by income range, for 3 approaches to the weights:
We know from experience that version 2 virtually always will be easily solvable even with equality constraints for hundreds of targets. That is, there are many - perhaps thousands - of sets of weights that will satisfy the constraints exactly. (We're finding 22k unknown variables that satisfy a few hundred constraints - many different sets of variables can do this.) Approach 2 chooses the set of weights that are closest to the synthesized weights. Now on to approach 3: any set of weights that hits all of the constraints exactly will minimize an objective function based solely on distance between targets and weighted values on the file, so how do we know which one to choose? To make it more concrete, suppose we want to satisfy 9 targets constructed from the actual file:
In approach 2, we would choose 22k new weights that satisfy these 9 constraints and also minimize an objective function that penalizes change from synthesized weights. One simple objective function we could use to decide which constraint-satisfying set of weights is best is minimizing the sum over 22k records of the squared difference from 1 of the ratio of new weights to synthesized weights. Formally, this is: objective = sum(x[i] - 1)^2 where each x[i] is the ratio of the new weight to the synthesized weight on a record. There would be 9 constraints, as defined above. This is a nonlinear program rather than an LP but the idea is essentially the same as what @MaxGhenis discussed. In practice we might use something slightly more complex. In approach 3, what would we do? We might set the objective function up as the sum of squared differences between each constraint's calculated value and its target value -- something like: choose new weights w to minimize the sum of the following 9 squared differences, where each i indexes into the 22k records: objective = (There are obviously scale issues in defining this objective function. We might scale the calculations so that each constraint is in [0, 1], or in some other way, but that's a next step after we figure out how to set up the problem.) As I understand how @MaxGhenis put this forward, there wouldn't be any constraints - in approach 3, we would just minimize this objective function. Obviously the objective function is minimized when all constraints are exactly satisfied, and so the optimal solution to approach 2 (based on its definition of optimality) would also minimize the objective function in approach 3. But many other sets of new weights would satisfy it, too - for example, any of the constraint-satisfying solutions that the NLP solver may have iterated through before it found the solution that minimized the objective function in approach 2. If the goal is to select weights that are better than those in approach 2 in some way, then either we need to add some sorts of constraints (I am not sure what) or somehow add a measure of "good" weights to the objective function. Is there some better definition of a good weight, other than one that is close to the synthesized weight? Would we rather have equal weights for all records, or something else? Even if we were to have hundreds of targets rather than 9 (and we probably would), we would almost certainly end up in this situation. @MaxGhenis, can you elaborate on how we would implement this approach? I think we would need to somehow define what good weights would be, and incorporate them in the objective function. (And after doing that, it is not clear to me why it would be better to have the constraints incorporated into the objective than set out separately as constraints - the former requires us to deal with scale issues and possibly assign relative importance to constraints, and the latter does not.) |
Thanks for formalizing this @donboyd5, your example objective function is exactly what I had in mind. As you say, we should also rescale, and probably weight the targets subjectively if we care about something like AGI more than something like # people under age 13. There could also be loose constraints like ensuring positive weights for all records, and that each individual target doesn't deviate too far (though including squared deviations in the objective function should get us far here). It'd be great to get to the point where we have to choose among multiple sets of weights which each satisfy targets perfectly. I don't think we'll be able to get there unless we either synthesize many more records than the original PUF, and/or we haven't included as many targets as we should. I'd expect the objective function to include hundreds if not thousands of targets, including counts within crosstabs, averages, quantiles, etc., so it'll be hard to hit all of them well (just as the original PUF misses on some targets). |
BTW ideally the targets in approach 3 are the same as those defined for other parts of the problem, like those from approach 2. We're really just reshaping the problem from constraints to an objective function. This ties to @feenberg's concern raised yesterday that by weighting records to hit an objective function and then evaluating the records on the same objective function is unfair. This could justify a couple modifications:
|
Here are some findings from @donboyd5 with respect to the initial test synthesis file, which synthesized The first table below looks at the 3 10% sample files. The columns are # of records, sum of s006 (didn't bother to divide by 100), and sum of wages (unweighted). Obviously the sum of the weight comes in well below either the training or test amount. I checked in Excel to make sure I didn't have some odd error reading the file. The unweighted sum of wages also is quite far from test and train but of course it's very early in the process. The second table repeats this for the full puf and synthesized version, to make sure it is not an artifact of the sample. The 3rd and 4th tables show quantiles of s006 in the 10% sample and full files, respectively. The extremes are not far off, but the middles are. I am going to guess this is related to the sequence of fitting and synthesis. In a future run, it might be worth forcing s006 to be an X variable (carried over to the synthesized file as is) or making it one of your randomly sampled seed variables. @feenberg also found that "the mean e00200 was $186,979." I'll create another file with |
On Tue, 11 Dec 2018, Max Ghenis wrote:
Here are some findings from @donboyd5 with respect to the initial test synthesis file, which
synthesized s006 as a non-seed variable:
I have a program that scores 35 or so plausible tax reforms with the PUF
and another file. If the alternate file is just the PUF rounded to 2
digits, the scores are very close. I'd like to try the synth file again,
but the first draft gave scores that were not good. I'll try again with
the next version.
Dan
|
That's great, Dan. It will be great to see it after you've got it to your
satisfaction.
Don
On Wed, Dec 12, 2018 at 7:12 AM Daniel Feenberg <notifications@github.com>
wrote:
…
On Tue, 11 Dec 2018, Max Ghenis wrote:
>
> Here are some findings from @donboyd5 with respect to the initial test
synthesis file, which
> synthesized s006 as a non-seed variable:
>
I have a program that scores 35 or so plausible tax reforms with the PUF
and another file. If the alternate file is just the PUF rounded to 2
digits, the scores are very close. I'd like to try the synth file again,
but the first draft gave scores that were not good. I'll try again with
the next version.
Dan
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGPEmFY-PIaXXmvNiWi5PVEdE75ndiFDks5u4PLDgaJpZM4Yqi5k>
.
|
Moving this to #8. |
All CDF comparisons we've looked at so far conflate two factors:
So far we've been synthesizing weight as any other feature; @donboyd5 has been making it one of the first in the synthpuf sequence, while I've been making it the last in sequential random forests. As the most important feature, it may deserve special treatment.
Per the PUF handbook:
We have a few options:
There may be others. IMO we should consider separating this problem from the problem of record synthesis, which should be evaluated on record-level similarity.
The text was updated successfully, but these errors were encountered: