Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure e00600 >= e00650 and e01500 >= e01700 #17

Open
MaxGhenis opened this issue Dec 11, 2018 · 6 comments
Open

Ensure e00600 >= e00650 and e01500 >= e01700 #17

MaxGhenis opened this issue Dec 11, 2018 · 6 comments
Assignees

Comments

@MaxGhenis
Copy link
Collaborator

This is a condition to run Tax-Calculator, and as reported by @donboyd5, the current synthesis throws the error:

ValueError: expression "e00600 >= e00650" is not true for every record

Don proposed two approaches:

  1. Compute non-qualified dividends as nqd=e00600 - e00650,
    then synthesize nqd and e00650 independently,
    then compute the total as the sum of the two.
    This is the way the synthpop authors tend to do this, per my email conversation with them.
  1. compute the ratio of qualified to total -- ratio=e00650 / e00600
    synthesize e00600 and ratio independently
    compute e00650 from the two
    I found this to yield higher quality results, but I did not examine it rigorously.

I think we'll need more logic for each of these methods, since the model could still produce cases where nqd < 0 or ratio < 1. In particular, nqd should produce identical outcomes to the current approach of synthesizing e00600 and e00650 separately on average. So we'd probably want to amend these approaches to nqd=max(nqd, 0) (equivalent to e00600=max(e00650, e00600)) or ratio=max(ratio, 1).

That said, ideally the prediction models would figure out this relationship, so it'll be worth re-evaluating after increasing the number of trees (currently 20).

I'd suggest starting with e00600=max(e00650, e00600) as an interim approach.

@donboyd5
Copy link
Owner

That sounds like a fine interim approach to me. We can do that with the synthesized file until you implement something else in the synthesis phase.

Here is a table with the distribution of the 4 variables (nqd and ratio calculated as per above) in the training (full puf) set and the first-cut synthesis. It's not a terribly common problem:

image

One simple way to impose the logic is to use the ratio approach (i.e., predict e00650 and also the ratio e00650 / e00600), fitting and predicting the ratio with CART. Because all of the ratio values computed from the training data will be in [0, 1], the fitted values also will be in [0, 1], and we won't have a problem (that's one reason I used the ratio approach).

@MaxGhenis
Copy link
Collaborator Author

Ah you're right, I suppose tree-based methods should make your approaches work, even if the nqd approach would be equivalent to the current behavior in linear models.

@donboyd5
Copy link
Owner

donboyd5 commented Dec 11, 2018

I've been going through the command-line-interface documentation (https://pslmodels.github.io/Tax-Calculator/) to make sure I prepare the synthesized file properly to run through Tax-Calculator.

Per the documentation, one other variable set raises the same kind of issue as dividends. Tax-Calculator:

also expects that the value of total pension and annuity income (e01500) will be no less than the value of taxable pension and annuity income (e01700) for each filing unit

We can use the same sort of interim approach for this for now, too, adjusting the total (e01500) after synthesis, as needed. I don't think we need a separate issue for this.

The documentation also mentions splitting wages and 2 other variables between prime and spouse. I do that after synthesis, and I think we always will want that to occur in a post synthesis stage (as part of taxdata, normally).

@donboyd5 donboyd5 changed the title Ensure e00600 >= e00650 Ensure e00600 >= e00650 and e01500 >= e01700 Dec 13, 2018
@MaxGhenis MaxGhenis self-assigned this Dec 15, 2018
@MaxGhenis
Copy link
Collaborator Author

MaxGhenis commented Dec 15, 2018

synpuf5 and 6 fix this:

(synth.E00600 < synth.E00650).sum()  # 0
(synth.E01500 < synth.E01700).sum()  # 0

@donboyd5 could you test it out with Tax-Calculator?

FYI these files also use several more seeds for #21 and use 50 rather than 20 trees (runtime near 3 hours on the full set).

@MaxGhenis MaxGhenis assigned donboyd5 and unassigned MaxGhenis Dec 15, 2018
@donboyd5
Copy link
Owner

donboyd5 commented Dec 15, 2018 via email

@donboyd5
Copy link
Owner

synpuf in its entirety runs through Tax-Calculator without a hitch, without needing any adjustments to the variables above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants