Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which variables should we seed on? #37

Open
MaxGhenis opened this issue Mar 7, 2019 · 26 comments
Open

Which variables should we seed on? #37

MaxGhenis opened this issue Mar 7, 2019 · 26 comments

Comments

@MaxGhenis
Copy link
Collaborator

We've seen that, in general, the more seeds in the synthesis production, the higher-fidelity the synthesis is, at the expense of privacy. More precisely, the relationship probably has to do with the unique identifiability of records when limited to the seeds.

For example, the only difference between the green and red bars here is that the green adds several more seeds:
image

Furthermore, even calculated seeds (which are dropped after the synthesis to be recalculated with Tax-Calculator) produce this relationship. The green bar above used calculated seeds.

Another data point supporting this is synthpop8, which used 9 calculated seeds ('E00100', 'E04600', 'P04470', 'E04800', 'E62100', 'E05800', 'E08800', 'E59560', 'E26190') that together uniquely identified over 80% of records. Each row in this synthesis exactly matched a training record, indicating we need to use far fewer seeds.

While we shouldn't use too many, we may also care a special amount about these calculated features, which could justify seeding on them rather than seeding on some other raw feature. Whether this approach improves the validity of calculated features like AGI is an empirical question we haven't tested, but it seems like a reasonable hypothesis.

Selecting the seeds is therefore one of the most important decisions in the synthesis process. I'd suggest a couple factors to consider in this decision:

  1. Prioritizing categorical features. This simplifies the synthesis process to be only on continuous measures. So for example, we'd want to prioritize MARS.
  2. Prioritizing logically "initial" features. For example, XTOT, nu18, MARS etc. are features of the household which logically precede income and deduction measures. This feeds into the question of visit sequence.
  3. Prioritizing the most important features. This could be critical calculated features like AGI, or the most important features in determining those critical calculated features.

Regarding (3): I ran a random forests model to determine the importance of each "raw" feature in predicting the 9 calculated features in synthpop8. Here are the top 5, according to the average rank in predicting those 9:

  1. E00200 (salaries and wages): most important for predicting E26190 (non-passive income) and E59560 (earned income for EIC).
  2. E18400 (SALT): most important for E05800 (income tax before credit), E08800 (income tax after credits), and P04470 (total deductions).
  3. S006 (weight): most important for E04800 (taxable income), E05800 (taxbc), and E08800 (taxac).
  4. E02000 (Schedule E), most important for E26190 (non-passive income).
  5. P23250 (Long-term gains less losses), most important for E00100 (AGI), E04800 (taxable income), and E62100 (alternative minimum taxable income).

image

Together these 5 features uniquely identify 61% of PUF records, so we'd probably still want a subset, especially if we add something like MARS and XTOT, but I suspect these will be valuable and avoid extra complexity of seeding on calculated features (also makes a simpler story to SOI that we're only using 65 features).

FEATURES = ['E00200', 'E18400', 'S006', 'E02000', 'P23250']
~pd.read_csv('~/puf2011.csv', usecols=FEATURES).duplicated(keep=False)).mean()
# 0.6131326698821662
@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019

What does it mean for "5 features uniquely identify 61% of PUF records". Does it mean "an exact match on 5 continuous variables" or something less?

@MaxGhenis
Copy link
Collaborator Author

if the synthesis process amounts to "find a record with the same values as the seeds, and call that the synthetic record" then it isn't synthesizing at all.

This isn't how the synthesis works in general, but it is how it works when there's no conditional variance of the synthesized features. If you have a tree-based model based on data where all records where x=2 and y=3 also have z=1, and you pass it data where x=2 and y=3, that tree-based model may assign 100% probability to the z=1 scenario. Depending on how strong this is, models that do more to fight overfitting like random forests could still assign that 100% probability. That seems to be what's happening here, and indicates we need to increase the conditional variance by reducing the conditions (seeds).

What does it mean for "5 features uniquely identify 61% of PUF records". Does it mean "an exact match on 5 continuous variables" or something less?

Right, restricting the PUF to ['E00200', 'E18400', 'S006', 'E02000', 'P23250'] produces a dataset where 61% of records are unique (this doesn't concern synthetic data).

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@MaxGhenis
Copy link
Collaborator Author

I'm not really surprised but it depends on the variable; some are more fine than others, and some correlate more with others.

There is also the oddity that the revenue scores are so poor if all the matches are exact. Is it only the weights that are off??

It's probably mostly the weights. Are you using @donboyd5's revised linear-programmed weights? It could also be that it's not the same records in the same representation; all synthetic records are exactly present in the true PUF, but I haven't checked if the reverse is true.

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019

Thanks, Max, this is great, with a lot of great detective work. It gives us lots to talk about tomorrow.

I created a Google doc named selected_MARS3group_puf_synthpop8_matches in our Google drive synpuf folder that explains some of my reasons for what I say below, and I also sent a link to each of you. To be on the safe side I am not putting the doc link here but if you have access to the folder you can get it.

I have four main comments:

  1. As noted in the doc, one of the reasons we get so many exact matches is because the puf has been so blurred/modified - for example, by rounding. That puf-creation blurring increases the risk that we will produce values that have been changed from true values.

  2. Not all exact matches present what seems like meaningful disclosure risk. Many involve mostly zero variables, variables that do not include much information, and records that are common, representing many people. The Google doc mentioned above gets into this in some detail.

That doesn't mean we shouldn't be on the lookout for them, but it does mean we have to interpret them carefully and think carefully about what to address and how.

  1. Where we do have exact-match disclosure risk, it does not necessarily mean we need to make changes that significantly degrade the quality of the file to avoid exact matches, such as eliminating powerful seeds. We have multiple options that go beyond changing seed variables, including:
  • adding small amounts of random noise to variables on the front end, before synthesizing, so that synthesized values are not exactly the same as puf values
  • using methods during synthesis that ensure we are less likely to put exact puf values on the synthesized files, including (a) larger buckets for CART methods, (b) or possibly econometric methods (although I have my doubts about this), and (c) density approaches for choosing leaves on terminal nodes.

Where we do have to reduce seeds, and we may need to, Max's detective work will prove really valuable.

  1. Assuming we get rid of all important exact matches (and not all are important), that doesn't mean we don't have disclosure risk. Distance measures will remain important. If a file is close but not perfect, it may still have too much disclosure risk.

@MaxGhenis
Copy link
Collaborator Author

@donboyd5 your doc says:

The total number of puf records involved in exact matches (npufrecs=419) out of the 3,144 puf records with MARS=3, and the number of syn records involved in exact matches (nsynrecs=1,057) of of the 15,720 syn records in the group.

Could you share some records in synthpop8 that you found don't exactly match a training record. I just triple-checked that all records in synthpop8 exactly match training records on all features in this notebook. Note I'm dropping S006 because that'll be reconstructed, and isn't relevant to privacy concerns.

We should decide whether we're treating PUF data as real data, as we've discussed in the past. We know that SOI blurs and rounds data, that lots of fields are zero, and that some records are duplicated when limiting to the 65 features we're synthesizing, but in lieu of the real data or details on how exactly they blur, how many real records each PUF record represents, etc., I think we need to just treat it as real data. That should mean avoiding synthesizing exact matches on records that appear only once in the PUF.

@feenberg asked:

Isn't there a way to loosen the restriction for a match from exat match to "in the same bin"? In the examples I recall reading about, the bins were above or below median.

Right now we're seeing true exact matches, and we're also looking at distance measures. I think below/above median would be too blunt an instrument to evaluate privacy concerns.

Isn't there a way to specify a minimum number of leaves before additional subdivision takes place? I recall reading examples where a minimum of 5 leaves were required.

Yes I think synthpop CART does this, but I'm not sure this guarantees variance.

Am I mistaken in my belief that the synthesis process maintains covariances only to the extent that they are mediated by seed variables? For example, what about the synthesis process encourages property tax and mortgage interest to be correlated?

No, the synthesis maintains covariances by including them in each prediction model. Suppose we only seed on MARS, and then the first two non-seed synthesized features are property tax and mortgage interest. Property tax will essentially be synthesized as the distribution of property tax, conditional on each MARS value. Mortgage interest will then be synthesized as the distribution of mortgage interest conditional on each record's MARS value, and its conditional property tax. Each covariance is maintained this way: one of each pair of features is synthesized as the distribution conditioned (at least in part) on the other.

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@MaxGhenis
Copy link
Collaborator Author

Would 1PM or 1:30PM be OK?

I think syn->puf is more relevant to privacy concerns, since we want to avoid releasing synthetic records that look too much like real records. The reverse is useful for comprehensiveness--to the extent that real records add value, ensuring they're not totally ignored by the model will probably produce a better synthesis--but outside this particular scope IMO.

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@MaxGhenis
Copy link
Collaborator Author

@feenberg We're basically using quantile regression, where the regression incorporates all the seeds and previously synthesized features. So we're predicting the 10th percentile, 20th, 30th, etc., and sampling a random quantile from there to capture the full conditional distribution.

In reality both CART and RF do this nonparametrically, so it's something in between the binning approach you describe and parametric regression models.

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019

Dan do you have a preference for 1:00 pm or 1:30 pm (Eastern time) tomorrow (assuming you can make the call)?

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019

I should add that I checked for exact matches in both directions -- within MARS=3, all puf against all syn records and all syn records against all puf records. (This is easy for exact-match checks. Much more computing work for distance measures.)

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@feenberg
Copy link
Collaborator

feenberg commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019 via email

@MaxGhenis
Copy link
Collaborator Author

@donboyd5 no problem thanks for checking. I re-ran the distance metrics on 1% of synthpop8_stack.csv and found that 25% of records exactly match a training record, or about 3x the share from your earlier synthpop. The median distance is about 0.08, also about a third of a previous model. I'll also email the group a case of a pretty complicated record that matches exactly: synthpop8_stack row 866210 matching training record 162458 (or subtract 1 if not zero-indexing). This still suggests to me we need to cut some seeds.

@feenberg Linear quantile regression would impose a linear structure on the relationships, but by using RF/CART, we don't impose a structure, nor do we have to define huge number of bins. These tree methods split on each node (feature) based on semi-random thresholds, and then either recursively improves these splits (CART) or builds many trees (RF) to produce the predictions, which are then sampled to generate the conditional quantiles. Here's an explanation of RF: everything is the same in RF regression as it is in RF quantile regression, except for the final stage where we use the distribution of predictions instead of the mean.

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019

Re @MaxGhenis's earlier comment, I agree, we need to treat puf data as if they are true tax returns, and hold ourselves to that standard. Whether we think that is best or not doesn't matter as it is how SOI wants to view it, so we need to view it that way, also. While I have made some comments about certain kinds of exact matches that shouldn't be worrisome, I think we need to worry about them nonetheless and find best possible ways of eradicating any exact matches that involve non-zero-valued continuous variables (in addition to categorical variables) and perhaps even exact matches that include categoricals and only zero-valued continuous variables - these are good topics for discussion.

That said, in some senses it may be a harder test than comparison to true returns, and in others it might be easier. I think exact matches are likely to be less of a concern vs. true returns (as they will not have been blurred), but I am not sure whether distances will be a harder or easier test. I do believe that after we get fully comfortable with comparisons to puf, we should seek a way to get low-stakes comparisons done against true returns before we face a high-stakes do or die test (via SOI) by that approach.

@donboyd5
Copy link
Owner

donboyd5 commented Mar 7, 2019

I kept promising to pull together some notes on distance measures. I have been failing, but I have made some progress. You can find what I've done here. I'll try to update it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants