-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sample weight adjustment #1441
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1441 +/- ##
==========================================
+ Coverage 99.71% 99.71% +<.01%
==========================================
Files 40 40
Lines 2782 2785 +3
==========================================
+ Hits 2774 2777 +3
Misses 8 8
Continue to review full report at Codecov.
|
@martinholmer this looks sensible. What do you think about doing it this way?
These should be equivalent, but this may be a bit faster. However, I'm not sure if it would matter too much given that we are only dealing with 10 years of data. |
@hdoupe said:
PR#1441 and what you suggest above would be equivalent if the weights were the same in each year. But the weights are different in each year, so I don't think they are equivalent. In addition to resolving your issue #1439, pull request #1441 is trying to resolve my issue #1440. Does this make sense? |
@martinholmer , I think they should be the same. Finally, If my understanding is correct, this is just another way to solve the problem but without using loops. Whichever implementation you choose is probably just a matter of preference in this situation. |
@hdoupe, I tried to use your code fragment in place of what is in #1441, but did not get the same result. In fact, the test failed even though it passes with the current code. The most likely answer is that I made a mistake in applying your code. Maybe you can take a look.
But instead of passing the
Any suggestions would be welcome as I am far from being a Pandas expert. |
@martinholmer No problem, glad I can help. I ran the test with the first implementation and the second implementation and got the same results for each implementation:
The adjusted weights should sum to the same amount as the full sample weights. I tested this here:
Both implementations were off by the same amount:
Are you sure that it passed the test on the first implementation? My tests indicate that the two implementations are producing the same result. |
@hdoupe, Thanks again for all the help on Pandas usage. You are absolutely correct. I had no idea that Pandas would do sums for all the DataFrame columns when you do dframe.sum(). I discovered we were getting different results with my column-loop code (which should have been logically equivalent but slower) because of a Python 2.7 versus Python 3.x difference: 5/2=2 in Python 2.7, but 5/2=2.5 in Python 3.x. So my now-removed column-loop code was not accurate because the weight scale-up factor was an interger (since the weights in the Again, thanks for all the help. I'm going to merge this very soon. |
This pull request attempts to improve the logic of scaling up sub-sample weights so that they produces results that are closer to those produced using a full sample. The proposed changes attempt to resolve the issues discussed in #1439 and #1440.
@MattHJensen @feenberg @hdoupe