-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing synpuf2, synpuf4, and puf.full #23
Comments
This table shows correlations among variables in the 3 files puf.full, synpuf2, and synpuf4 -- the 3 all-records files we have at present. From discussion elsewhere we know that synpuf4 will have better weighted values than synpuf2, so I'll pay more attention to synpuf4.
The table in this post and in the next post differ only in how they are sorted. This table is sorted by diff4. It shows only the top 50 values by this sort. Here is how to read the first row:
We can see that the 4th-worst correlation in synpuf4 is between wages (e00200) and interest income. This is concerning (to me) because wages are so large. I'll come back to this, I think. |
Are correlations here weighted by |
No. They are unweighted. So they should be the same as any you may have done. (Pls let me know if you have different results.) The only value-added here is having the weighted sums info in 2 rightmost columns. Not trying to replicate what you may be doing. I want to focus on the weighted file. But as I was moving in that direction, I naturally looked at some correlations so wanted to make sure you have the info since I have it. I'll post some weighted values info soon. |
If it is easier for you to have a whole bunch of info at once, rather than a bit here and a bit there, pls let me know. |
I haven't done correlations yet, was just checking. Bit-by-bit is good for me. This is useful, and it's interesting that synpuf4's correlations are of uniformly higher magnitude than the original PUF, and that this happened when adding weights to the seeds. Maybe the small number of trees (20) is causing insufficient dispersion. I'll start one with 40 trees. |
Some new summary results and where to find them I have put an html file named "eval_2018-12-13.html" with some new summary results in the Google Drive synpuf directory. As we get further along, there will be new versions with new dates. It is all still very early, and very rough. I think the section called, "How can synpuf2 weight be so far off and yet weighted wages are so close?" is very interesting. It shows that there is value in going beyond the bottom line, and looking at the distribution of weighted values. I know we are moving beyond synpuf2 but there is a lot to be seen by looking at synpuf4, diff4 (synpuf4 minus puf.full), and pdiff4 (diff4 as % of puf.full) by wage range. If something is not clear, please let me know. I will start putting some routinized results in the html output, after I run the files through Tax-Calculator and get calculated AGI and taxes. Probably tomorrow for that. Shortly after that, I'll start producing some CDFs of weighted variables by AGI similar to the graph in #16 but with a line for puf and for each synthesis in the analysis. I will work toward some summary measures after that, but I think it is more important to have diagnostic information at this point. Once we have data files that are worth choosing among, we'll need measures that help us do that, but we're not there yet. I am uploading the R project to github once I relearn how to do that (it has been awhile and I've since reinstalled Windows so have to get my machine set up properly), probably this evening. @andersonfrailey will then start looking at it, and we'll work together to make it better. @andersonfrailey, I had hoped to have the project cleaner than this but I'll shoot you a note explaining what makes sense to look at. After all of that I'll put in some reweighting routines to hit many targets, but it will be a while before I can get there. |
Sounds good, including these results in GH is OK since there's no record data right? Answering the question in the section title "How can synpuf2 weight be so far off and yet weighted wages are so close?" synpuf2's total return count was 78% too low, but it greatly overestimated high-wage returns, and these happened to cancel out to yield the 0.4% difference in total wages. A remarkable coincidence, but shows that file has got some problems. |
Let me know if there is a better way to provide snippets of feedback. I figure that until I have a regular set of output, I can open an issue, @MaxGhenis and others can look, and then we can close the issue as we move on to later iterations of a file. The goal here is to provide useful feedback to @MaxGhenis as he develops early runs. I'll have some standardized html output that I can put in the Google Drive folder fairly soon.
The next few tables show correlations between full synthesized files and the original puf (with aggregate records removed). They follow the same format, so I'll explain that with the first table only.
The text was updated successfully, but these errors were encountered: