-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What do we think about using calculated variables as X variables? #7
Comments
I suppose that any procedure that improves the quality of the synth file is a disclosure risk at some level. The advantage of using the calculated variables as a base is that the correlations with many of them are quite important in getting revenue scores right. For example, getting the correlation of dividends with interest correct is of minimal importance if the correlations with AGI are correct for both of them. By emphasizing the correlations that are important for revenue scoring, this will let us get by with worse correlations on less important pairs. And note that the correlations with AGI, AMTI, etc are crucial, it isn't just the size of the income amount that determines the importance of getting it right. I haven't said if CART should be applied sequentially to the elemental values, or each elemental value synthesized only from the list of calculated values. The former might be much better - I don't know. It may not be necessary and will allow much smaller disclosure risk. We are already discussing quality issues - all the graphs of CDFs of income are only interesting as one way to evaluate quality. I worry that will lead to an endless loop of improving one measure at the expense of others and no obvious way to decide which is better. What I like about my suggested method is that it provides a univariate comparison of methods, so it is possible to choose an unambiguous winner. |
I agree completely that we need to discuss measuring file quality and that we have begun doing so. Just didn't want to muddy this issue (using calculated values as X variables) with that huge question. I've opened issue #8 for that. |
Re:
Is this saying to use something like AGI as one of the first X variables, then calculate components of that using CART/RF synthesis? Is this to ensure we're stratifying by AGI correctly, because it's such an important feature? Calculating calculated variables post-synthesis seems most efficient to me. If the relationship between dividends and AGI is defined as a formula (given covariates), what's the value of checking its correlation? If that correlation is off, it indicates that dividends aren't correlating appropriately to other determinant(s) of AGI, and we can evaluate that directly. |
On Mon, 19 Nov 2018, Max Ghenis wrote:
Re:
1. Synthesize all elemental variables using the calculated variables as a
base. Use a mechanical application of RF or CART.
Is this saying to use something like AGI as one of the first X variables, then calculate components of
that using CART/RF synthesis? Is this to ensure we're stratifying by AGI correctly, because it's such
an important feature?
Yes, the 27 calculated variables would be used to synthesize the others.
This will give Cart every chance to get the correlation with AGI right,
and also (and just as important) the correlation with marginal tax rate
and no-tax status right.
Calculating calculated variables post-synthesis seems most efficient to me. If the relationship
between dividends and AGI is defined as a formula (given covariates), what's the value of checking its
That doesn't sound right. CART isn't a regression of dividends on AGI plus
some noise, is it? The "Classification" piece is critical.
correlation? If that correlation is off, it indicates that dividends aren't correlating appropriately
to other determinant(s) of AGI, and we can evaluate that directly.
I am confused by this. If we synthesize AGI and its components, they will
not match and the return will not balance. Would we allow that? I assume
we will use the tax calculator for calculated values, and I advocate using
the taxpayer version of the calculated values in the synthesis.
|
I think we might be in agreement but talking past each other (see #4 (comment)). Can we discuss at the end of today's 2pm call? |
Recapping our chat:
I'd also add that evaluating against a holdout will probably show that seeding with such rich data will overfit, relative to seeding with a smaller set of elemental variables. |
On Tue, 20 Nov 2018, Max Ghenis wrote:
Recapping our chat:
* We agree that we need to evaluate the result on fidelity, privacy, and computation time.
* I think we agree that seeding the algorithm with the 27 calculated variables will improve fidelity
at the expense of privacy, compared to seeding with elemental variables. I think it'll also
require more compute, both because we're synthesizing more variables (seeding with 27 elemental
variables instead of 27 calculated variables would be 27 fewer elemental variables to synthesize),
and because the models will include 27 more X variables.
* We can try both approaches to see how the benefits compare to the costs. In particular, we may
want to give extra weight to fidelity between elemental variables and key calculated variables
like AGI, which the seeding approach might help with.
I still don't understand. There are 200 variables in the PUF. If we seed
with 27 variables that leaves 163 variables for CART or RF to synthesize,
and 27 for TaxBrain to calulate. If we seed with an elemental variable
that leaves 162 to synthesize and 27 for TaxBrain to calculate. Is that
the computational difference that worries you? It seems small to me. Or is
the problem that synthesis takes longer with more seed variables? That
also seems to make for a small difference. The last variable to be
synthesized is based on 161 prior variables if we ignore the calculated
variables or 199 if we use them as seeds. Is that the worry?
I do understand that 2^27 is a large number, and may mean that we can't
use 27 seed variables, depending on how the synthesis is done. 2^10 is
small compared to the number of records though. so we ought to be able to
seed with at least 10 variables. We still have to use TaxBrain for all 27
in our released file.
I don't think we can say much about intrusion until we have the
distribution of the count of values that come from a common source record.
dan
|
Here's how I'm thinking about it (we could decide to seed with 1 elemental, 27, or some number in between):
I don't know if there's a theoretical runtime function with CART/RF, but pretty sure it's worse than How does |
I am confused by this discussion. Let me elaborate and hopefully someone can straighten me out. First, I'd like to understand the terminology. In our example/assumption:
Do we agree that:
Are any of these statements wrong? If so I misunderstand the issue. If not, let me move to my confusion. My confusion is that I don't understand what we mean by "seed." It first appears in the discussion when @MaxGhenis says,
It is the "compared to seeding with elemental variables part" that makes me think I don't understand. Let me start with the terminology of synthesis in general, and as it is implemented in synthpop specifically, and ask where "seed" fits in. Normally, in synthpop, we think of each Yi variable as a function of a vector of exogenous non-snythesized variables X, and previous Y variables already synthesized. Thus Y3=f(Y2, Y1, X), and so on. (That's for estimation purposes. For prediction, we replace the RHS Y variables with Y-hat variables.) In synthpop, X can be null - we can choose to have no exogenous predictors. If so, how do we predict the first Y variable? Synthpop draws randomly from its distribution. Then it estimates Y2=f(Y1), Y3=f(Y2, Y1), and so on. That's all very clear. So when I think about what to do about presyn-calcvars, the question to me is whether we use somewhere between 0 and 27 of these as part of X. They never are part of Y. Similarly, elemental variables are always part of Y - they are never part of X. We always must synthesize every one of them. If X is null (we use 0 presyn-calcvars), then we synthesize Y1 very simply (random draws), but we still synthesize every one of them Now, back to the seed discussion. The table above outlines 3 approaches. In the first, of the 173 elemental variables, we have 27 seeds, and 146 regressions. I just don't get this. We can't be putting any of the elemental variables into X - they are never exogenous, by definition - so we don't mean they are X variables, so what are they? And how do we manage to only need to do 146 regressions - if we have 173 elemental variables, don't we need 173 regressions? If we say that some of them are somehow unimportant and can be constructed in simple ways, that seems unrelated to the question of calculated variables. In the second row of the table, we have 1 elemental seed. That makes me think we're not talking about it as an X variable (which it cannot be), but as the first Y. It is synthesized by random draw, so I guess we could say it is not a regression, and we only need 172 regressions, it's just that I don't understand the terminology or how, if seeding is random draw of the first Y, we could have 27 first-Ys in the first row of the table. Anyway, I hope this explains my confusion. I don't understand 2^seeds either, but that may be because I don't understand what a seed is. I'm sorry, this wouldn't be the first time I misunderstood something fundamental, but if someone could set me straight I'd appreciate it, ideally crosswalking the seed terminology to the synthpop (and synthesis more generally) terminology of X and Y variables. |
I've been thinking of "seed" as any non-regression (/CART/RF) way of synthesizing a feature. As @donboyd5 said, In the random forests model, I started with a similar approach, but thought it could benefit from more seeds, since random forests don't do great with just one X. So I sampled with replacement combinations of the features This is how I've been interpreting the proposal to seed with 27 calculated variables: slim down the training set to those 27 features, then sample rows with replacement. If the proposal is instead to take those rows as given, without sampling, it's the same idea of seeding the regressions with a minimum of 27 X variables (I'd prefer sampling with replacement as it seems more synthetic to me, but this might not matter much if we're using such a rich starting point). So basically: |
Thanks, @MaxGhenis, that clears it up for me. I think the proposed idea for discussion probably was to take the 27 calculated variables (or subset) as given rather than to sample with replacement, although I don't have a strong opinion on which is better conceptually or how much difference it would make in practice. I have long thought it makes sense to have MARS and s006 in X (I had been thinking include in X as given, but maybe sampled would be good) on the theory that each would give us a strong result for file quality (when judging weighted totals and distributions), and that neither should entail disclosure risk. age_head makes sense to me for the same reason. I think XTOT could, too, but @feenberg believes that SOI might think it could create disclosure risk. |
In issue #4 @feenberg said:
I'd like to focus on the first two steps, which are about synthesis procedure, and let's not focus (in this issue) on steps 3 and 4, which are about file quality evaluation.
I had thought about steps 1 and 2 but had not done it that way. I was concerned, maybe erroneously, about including too much "actual" RHS information. But as I think about it now, maybe it makes a lot of sense. I'm curious to see what @MaxGhenis thinks.
I guess it shouldn't create any new disclosure risk? (If we assume that PUF records are a disclosability concern). Maybe it's just an empirical question: we should take a look and see how well it does?
The text was updated successfully, but these errors were encountered: