Which file, exactly, should we synthesize, and what is the right order of operations? #11

donboyd5 · 2018-11-21T15:51:26Z

We closed issue #4 which was about what we should do with calculated variables. We agreed the thing to do is synthesize "elemental" variables (from which other variables may be calculated), and run the results through Tax-Calculator to get the calculated variables, giving us properly balanced tax returns. (We did not resolve the question of whether to use the initial calculated variables that are on the pre-synthesis PUF as righthand side X variables in the synthesis and then throw them away (because of course we want the actual calculated variables to be calculated from synthesized elemental variables). We agreed there probably are conceptual pros and cons to this approach, and we'll be open-minded and empirical about it.)

However, I think there is another issue that came up in issue #4. @andersonfrailey said:

I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results.

I think he meant that we need to think about what file, exactly, we want to synthesize. In other words, what is the right order of operations?

Approach A
One possible ordering is:

Synthesize the raw PUF obtained from SOI
Augment the synthesized PUF via statistical match in a fashion similar to current augmentation (add nonfilers from CPS, add selected CPS variables)
Enhance the augmented-synthesized file by imputing itemizers, pension contributions, and the prime-spouse wage split (among other enhancements), producing a final releasable file (based on current processes).

Approach B
Another possible ordering is:

Augment the PUF from SOI via statistical match in a fashion similar to current augmentation (add nonfilers from CPS, add selected CPS variables)
Enhance the augmented PUF by imputing itemizers, pension contributions, and the prime-spouse wage split (among other enhancements), producing a final NON-releasable file (based on current processes).
Synthesize the enhanced-augmented PUF to produce a releasable file.

@andersonfrailey, is this the kind of question you were getting at? And if so, am I correct in interpreting your comment as saying that the first approach - Approach A, synthesize before we augment and enhance - makes more sense?

If so can you (and all of us) elaborate on the pros and cons of the two (or alternative approaches)? The first approach does seem to me like it has a lot of advantages:

the synthesis task is smaller
we don't have to try to synthesize variables for which we may not need to worry about confidentiality (although there probably are ways around this)

I do have one question, probably for @andersonfrailey: If we do Approach A, will we have all of the needed variables on the file after stage 1 (synthesis of raw PUF) to run the synthesized raw PUF through Tax-Calculator to get calculated variables, so that we can examine file quality long before we start the statistical match process? (I believe so.)

One possible downside of the first approach is that we would start with a synthesized file early in the file creation process. Thus, we would not automatically create a "gold standard" fully merged file (actual PUF merged with CPS) unless we did another step.

Anyway, I think it would be good to discuss this.

In order to help me think about this, I finally did something I should have done a long time ago, which is outline the full PUF-based file creation process. The results are here, in case anyone else finds them useful (and @andersonfrailey, if you see anything you don't think is right, would much appreciate a heads up).

MaxGhenis · 2018-11-26T06:05:35Z

I'll cast a vote for Approach A for these reasons:

It would allow us to more cleanly separate the enhancement logic from the PUF synthesis. As long as all enhancement techniques can take any PUF-like file as input, this repo's scope can be limited to creating a synthetic version of the raw PUF. That would lend itself to cleaner project management and more modular code.
To the extent that any enhancements include logic to ensure correct totals, we might have to re-do this if we synthesize those features. For example, synthesizing imputed benefits would be unlikely to yield the correct total participation counts, as C-TAM does now. As a result we'd probably have to add all imputations to the weight adjustment, either as pieces of the optimization function or as constraints.
Computational simplicity, as you note.

MaxGhenis mentioned this issue Nov 26, 2018

Rename repo and project #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which file, exactly, should we synthesize, and what is the right order of operations? #11

Which file, exactly, should we synthesize, and what is the right order of operations? #11

donboyd5 commented Nov 21, 2018 •

edited

Loading

MaxGhenis commented Nov 26, 2018

Which file, exactly, should we synthesize, and what is the right order of operations? #11

Which file, exactly, should we synthesize, and what is the right order of operations? #11

Comments

donboyd5 commented Nov 21, 2018 • edited Loading

MaxGhenis commented Nov 26, 2018

donboyd5 commented Nov 21, 2018 •

edited

Loading