Improve sample weight adjustment #1441

martinholmer · 2017-06-27T12:12:11Z

This pull request attempts to improve the logic of scaling up sub-sample weights so that they produces results that are closer to those produced using a full sample. The proposed changes attempt to resolve the issues discussed in #1439 and #1440.

@MattHJensen @feenberg @hdoupe

codecov-io · 2017-06-27T12:18:15Z

Codecov Report

Merging #1441 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1441      +/-   ##
==========================================
+ Coverage   99.71%   99.71%   +<.01%     
==========================================
  Files          40       40              
  Lines        2782     2785       +3     
==========================================
+ Hits         2774     2777       +3     
  Misses          8        8

Impacted Files	Coverage Δ
taxcalc/records.py	`98.79% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f46a04...52242f5. Read the comment docs.

martinholmer · 2017-06-27T12:18:46Z

@hdoupe, Could you review the changes in pull request #1441 to make sure they are sensible given your recent work on this topic? Thanks.

hdoupe · 2017-06-27T14:32:15Z

@martinholmer this looks sensible. What do you think about doing it this way?

WT_full_sum = self.WT.sum()
self.WT = self.WT.iloc[self.index]
WT_sub_sum = self.WT.sum()
WT_ratio = WT_full_sum / WT_sub_sum

self.WT = self.WT * WT_ratio

These should be equivalent, but this may be a bit faster. However, I'm not sure if it would matter too much given that we are only dealing with 10 years of data.

martinholmer · 2017-06-27T17:16:25Z

@hdoupe said:

this [#1441] looks sensible. What do you think about doing it this way?
WT_full_sum = self.WT.sum()
self.WT = self.WT.iloc[self.index]
WT_sub_sum = self.WT.sum()
WT_ratio = WT_full_sum / WT_sub_sum
self.WT = self.WT * WT_ratio
These should be equivalent, but this way may be a bit faster. However, I'm not sure if it would matter too much given that we are only dealing with 10 years of data.

PR#1441 and what you suggest above would be equivalent if the weights were the same in each year. But the weights are different in each year, so I don't think they are equivalent. In addition to resolving your issue #1439, pull request #1441 is trying to resolve my issue #1440.

Does this make sense?

hdoupe · 2017-06-27T19:38:46Z

@martinholmer , I think they should be the same. WT_full_sum and WT_sub_sum are pandas Series objects of length 18 with each entry being the total sum of weights for 2009, 2010, ..., 2026. Then, WT_ratio is a pandas Series object where the i-th item is the i-th item of WT_full_sum divided by the i-th item of WT_sub_sum.

Finally, self.WT * WT_ratio multiplies the i-th column of self.WT by the i-th item of WT_ratio.

If my understanding is correct, this is just another way to solve the problem but without using loops. Whichever implementation you choose is probably just a matter of preference in this situation.

martinholmer · 2017-06-27T20:07:33Z

@hdoupe,
Thanks for the tutorial on the magic of Pandas, which is something I've had trouble getting.

I tried to use your code fragment in place of what is in #1441, but did not get the same result. In fact, the test failed even though it passes with the current code.

The most likely answer is that I made a mistake in applying your code. Maybe you can take a look.
Here is what I tried:

        # weights must be same size as tax record data
        if not self.WT.empty and self.dim != len(self.WT):
            WT_full_sum = self.WT.sum()
            self.WT = self.WT.iloc[self.index]
            WT_sub_sum = self.WT.sum()
            WT_ratio = WT_full_sum / WT_sub_sum
            self.WT = self.WT * WT_ratio

But instead of passing the test_agg test in the test_pufcsv.py file, I get this failure message:

E           ValueError: PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE
E           WHEN subfrac=0.030, rtol=0.0100, seed=180
E            year,sub,full,reldiff= 2013	2199.38	2176.94	0.0103
E            year,sub,full,reldiff= 2017	2821.50	2790.64	0.0111
E            year,sub,full,reldiff= 2018	2931.17	2899.85	0.0108
E            year,sub,full,reldiff= 2020	3136.53	3105.29	0.0101
E            year,sub,full,reldiff= 2021	3266.19	3230.47	0.0111
E            year,sub,full,reldiff= 2022	3416.16	3369.21	0.0139
tests/test_pufcsv.py:111: ValueError

Any suggestions would be welcome as I am far from being a Pandas expert.

hdoupe · 2017-06-27T21:21:51Z

@martinholmer No problem, glad I can help. I ran the test with the first implementation and the second implementation and got the same results for each implementation:

E           ValueError: PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE
E           WHEN subfrac=0.030, rtol=0.0100, seed=180
E            year,sub,full,reldiff= 2013	2199.38	2176.94	0.0103
E            year,sub,full,reldiff= 2017	2821.50	2790.64	0.0111
E            year,sub,full,reldiff= 2018	2931.17	2899.85	0.0108
E            year,sub,full,reldiff= 2020	3136.53	3105.29	0.0101
E            year,sub,full,reldiff= 2021	3266.19	3230.47	0.0111
E            year,sub,full,reldiff= 2022	3416.16	3369.21	0.0139

tests/test_pufcsv.py:111: ValueError

The adjusted weights should sum to the same amount as the full sample weights. I tested this here:

from taxcalc.records import Records
import pandas as pd

puf = pd.read_csv("taxcalc/puf.csv")
puf_samp = puf.sample(frac=0.03, random_state=180)

rec = Records(data = puf_samp)
wt = pd.read_csv("taxcalc/puf_weights.csv")

print(wt.sum() - rec.WT.sum())

Both implementations were off by the same amount:

WT2009    0.000109
WT2010   -0.000067
WT2011   -0.000021
WT2012   -0.000074
WT2013    0.000086
WT2014    0.000078
WT2015    0.000147
WT2016   -0.000093
WT2017   -0.000097
WT2018    0.000175
WT2019   -0.000130
WT2020   -0.000175
WT2021    0.000084
WT2022   -0.000065
WT2023    0.000156
WT2024    0.000072
WT2025    0.000092
WT2026   -0.000191
dtype: float64

Are you sure that it passed the test on the first implementation? My tests indicate that the two implementations are producing the same result.

martinholmer · 2017-06-27T23:23:59Z

@hdoupe, Thanks again for all the help on Pandas usage. You are absolutely correct. I had no idea that Pandas would do sums for all the DataFrame columns when you do dframe.sum(). I discovered we were getting different results with my column-loop code (which should have been logically equivalent but slower) because of a Python 2.7 versus Python 3.x difference: 5/2=2 in Python 2.7, but 5/2=2.5 in Python 3.x. So my now-removed column-loop code was not accurate because the weight scale-up factor was an interger (since the weights in the puf_weights.csv file are integers).

Again, thanks for all the help. I'm going to merge this very soon.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey

martinholmer added 2 commits June 26, 2017 21:23

Fix Records ctor sampling logic

7bc6404

Fix docstring in test_pufcsv.py

26095b4

talumbau added the in progress label Jun 27, 2017

Update RELEASES.md for PR#1441

e3d790a

martinholmer mentioned this pull request Jun 27, 2017

Treatment of weights in Records object constructor, Part 1 #1439

Closed

Make sub-sample weight scaling more accurate and efficient

52242f5

martinholmer merged commit cf86e07 into PSLmodels:master Jun 27, 2017

talumbau removed the in progress label Jun 27, 2017

This was referenced Jun 27, 2017

Treatment of weights in Records object constructor, Part 2 #1440

Closed

Better quick sample settings ospc-org/ospc.org#574

Closed

martinholmer deleted the fix-sampling branch June 28, 2017 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sample weight adjustment #1441

Improve sample weight adjustment #1441

martinholmer commented Jun 27, 2017 •

edited

Loading

codecov-io commented Jun 27, 2017 •

edited

Loading

martinholmer commented Jun 27, 2017

hdoupe commented Jun 27, 2017 •

edited

Loading

martinholmer commented Jun 27, 2017

hdoupe commented Jun 27, 2017

martinholmer commented Jun 27, 2017

hdoupe commented Jun 27, 2017

martinholmer commented Jun 27, 2017

Improve sample weight adjustment #1441

Improve sample weight adjustment #1441

Conversation

martinholmer commented Jun 27, 2017 • edited Loading

codecov-io commented Jun 27, 2017 • edited Loading

Codecov Report

martinholmer commented Jun 27, 2017

hdoupe commented Jun 27, 2017 • edited Loading

martinholmer commented Jun 27, 2017

hdoupe commented Jun 27, 2017

martinholmer commented Jun 27, 2017

hdoupe commented Jun 27, 2017

martinholmer commented Jun 27, 2017

martinholmer commented Jun 27, 2017 •

edited

Loading

codecov-io commented Jun 27, 2017 •

edited

Loading

hdoupe commented Jun 27, 2017 •

edited

Loading