Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Upload Fixes #593

Merged
merged 6 commits into from
Aug 3, 2017
Merged

File Upload Fixes #593

merged 6 commits into from
Aug 3, 2017

Conversation

brittainhard
Copy link
Contributor

@brittainhard brittainhard commented Jul 27, 2017

@PeterDSteinberg @martinholmer @MattHJensen @hdoupe

I was unable to add a unit test that could help us solve the problem with this file, mainly because it would require saving puf.csv.gz. I was, however, able to make a small script that duplicates the behavior that is causing the problem.

import json
import celery

import pandas as pd
import taxcalc
from deploy.taxbrain_server.celery_tasks import dropq_task

behavior_params = {u'growdiff_response': {}, u'consumption': {}, u'behavior': {}, u'growdiff_baseline': {}}

user_mods = {u'2019': {u'_SS_Earnings_c': [500000], u'_ALD_InvInc_ec_rt': [0.2]}, u'2020': {u'_SS_Earnings_c': [600000], u'_II_em_cpi': True}, u'2018': {u'_SS_Earnings_c': [400000], u'_II_em_cpi': False, u'_II_em': [8000]}}

puf = pd.read_csv("puf.csv.gz", compression='gzip')

results = dropq_task(0, user_mods, None, behavior_params, puf)

You have to run it from the repository root, and puf.csv.gz has to also be in the root. This way, you can test out the function and make changes without having to use celery.

Let me know if it helps.

EDIT: You can cause the error by increasing the year_n argument, the first argument, to something like 9 or 10.

@brittainhard
Copy link
Contributor Author

Also, I think its a good idea if we move the discussion here from now on.

@brittainhard brittainhard changed the title File Upload Fies File Upload Fixes Jul 27, 2017
@MattHJensen
Copy link
Contributor

@hdoupe, could you explore whether the cps.csv.gz file in tax-calculator could be used to write failing tests for the bugs described in #578?

@brittainhard
Copy link
Contributor Author

@martinholmer I think maybe I am missing some understanding here.

So when I use the regular taxbrain form (not file input) to do calculations, and I set my start year, it calculates 9 years from that start date. So, if my start date is 2015, it calculates up to 2024.

When parsing the r1.json file, the earliest parameter date is 2018. That means that when the data is sent out, it does not contain parameters for any year before 2018. As is, the application assumes the start date is 2018 and it throws an error. With file uploads, the drop-down menu that allows you to select the year does not have any outcome on the data you send, because its is relying only on the data in the file. In this case, the year selection drop-down is useless.

So what does the ideal outcome look for you? If we use that sample file, do you want calculations from 2018 until 2026? That is a straightforward fix.

Please let me know.

@martinholmer
Copy link
Contributor

@brittainhard said:

So when I use the regular taxbrain form (not file input) to do calculations, and I set my start year, it calculates 9 years from that start date. So, if my start date is 2015, it calculates up to 2024.

Yes, exactly.

When parsing the r1.json file, the earliest parameter date is 2018. That means that when the data is sent out, it does not contain parameters for any year before 2018. As is, the application assumes the start date is 2018 and it throws an error. With file uploads, the drop-down menu that allows you to select the year does not have any outcome on the data you send, because its is relying only on the data in the file. In this case, the year selection drop-down is useless.

That's one of the problems. The drop-down menu of start_year values should work exactly the same on the file upload page as it does on the regular TaxBrain form. The start_year is the first of the ten budget years, for which we show results. There is no problem If the tax reform is not implemented until several years into the ten budget year window; the pre-reform years will just show no change in tax revenue. Perhaps some of the confusion is caused by the fact that the way to specify on the regular TaxBrain form a reform that is first implemented after the start_year is not very well documented. If start_year is 2017 (the default) and you want to make all earnings taxable under the payroll tax beginning in 2020, you would enter the following in the appropriate box:

*,*,*,9e99

The asterisk symbol means use the value under current law policy and 9e99 is a very large number for the maximum social security taxable earnings parameter. I recommend that you try this to see how it works from the regular TaxBrain form.

So what does the ideal outcome look for you? If we use that sample file, do you want calculations from 2018 until 2026?

No, we always want ten budget years. What we need is exactly the same as on the regular TaxBrain form: start_year specified by the pull-down menu (which has 2017 as the maximum allowable value) and nine more years after the start_year.

Does this make sense? If not, ask another question before doing any more coding.

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 28, 2017

@MattHJensen
could you explore whether the cps.csv.gz file in tax-calculator could be used to write failing tests for the bugs described in #578?

Sure thing. I'll look into it.

@brittainhard
Copy link
Contributor Author

@martinholmer

That's one of the problems. The drop-down menu of start_year values should work exactly the same on the file upload page as it does on the regular TaxBrain form. The start_year is the first of the ten budget years, for which we show results. There is no problem If the tax reform is not implemented until several years into the ten budget year window; the pre-reform years will just show no change in tax revenue.

I think this highlights a fundamental problem with the start-year dropdown.

First, if you want to display an entire 10 year window and your parameters start at a year later than the year specified in the dropdown, you're going to end up with columns in the table that contain no useful information.

Second, suppose you specified your start year as 2013 and your first parameters in the reform file started at 2023. You would get a table containing only columns without data, and your parameters would be completely ignored -- the table would only show data up to the year 2022. Moreover, in this case, if we were to make this change to show all 10 years even if there is no data, that can be considered a regression in itself.

From a user experience standpoint, this is a problem. Having a table that contains useless information or no information at all does not help the user understand the effects of their reform, and it can confuse the user about how these file-based reforms work. If they are receiving useless information, they might think there is something wrong with their parameters file, when that is not at all the case.

The solution I propose is this: include some code that will limit the number of years to be calculated if the start date is after 2017 (this will handle the error), and remove the start year drop-down entirely. This will mean the start year will be the earliest year in the reform file.

With that in mind, I have reopened issue #557. I think this is the best way to handle this problem.

@MattHJensen @PeterDSteinberg @hdoupe let me know what you think.

@MattHJensen
Copy link
Contributor

MattHJensen commented Jul 28, 2017

@brittainhard

The start_year drop down should specify the beginning of the budget window.

If the user makes no parameter changes within the ten-year budget window, then all revenue changes should be zero (not missing), and the current_law and reform revenues should be the same (but not missing).

If the user's first parameter change is for the 6th year of the budget window (as defined by the start year dropdown), then the revenue changes for the first five years should be 0 and the change in year 6 should be non zero.

All of this is the same as what @martinholmer describes in this comment and what I described in this comment.

Second, suppose you specified your start year as 2013 and your first parameters in the reform file started at 2023. You would get a table containing only columns without data, and your parameters would be completely ignored -- the table would only show data up to the year 2022. Moreover, in this case, if we were to make this change to show all 10 years even if there is no data, that can be considered a regression in itself.

We could add a warning that "your first parameter change happens outside of the budget window," but this should probably be discussed and dealt with in a separate issue as an enhancement.

@MattHJensen
Copy link
Contributor

MattHJensen commented Jul 28, 2017

Here is a test that should pass:

  • Run a reform file-based simulation where the start year is 2017 and the only reform provision is to set the social security payroll tax rate to 10% is 2019.

  • Run a GUI-based simulation where the the start year is 2017 and the only parameter reform is to set the social security payroll tax rate to: *, *, 0.10

  • Assert that the two outputs are the same.

@brittainhard
Copy link
Contributor Author

@MattHJensen I can make that fix. All I need to do is supply dummy information prepended to the reform data depending on what start year is selected. After that, we will need some tests to handle the case I mentioned before, with the start year of 2013. This will require some change in the file_input view, and another small change in celery_tasks.py.

@MattHJensen
Copy link
Contributor

MattHJensen commented Jul 28, 2017

@MattHJensen I can make that fix.

Thanks.

All I need to do is supply dummy information prepended to the reform data depending on what start year is selected.

Conforming to the goal of a unified API for reform files and GUI based inputs, I'd suggest following the same technique that is used for the asterisk (*) operator.

After that, we will need some tests to handle the case I mentioned before, with the start year of 2013. This will require some change in the file_input view, and another small change in celery_tasks.py.

I'd suggest handling this as a separate issue and supplying a warning rather than an error. There are legitimate reasons why a user would set the start year at 2013 and use a reform file with provisions starting in 2025. For example, s/he might be running two simulations with the same reform file, one with the start year at 2013 and one with the start year at 2017, in order to cobble together revenue projections from 2013 through 2026.

@martinholmer
Copy link
Contributor

@MattHJensen said:

I'd suggest handling this as a separate issue and supplying a warning rather than an error. There are legitimate reasons why a user would set the start year at 2013 and use a reform file with provisions starting in 2025. For example, s/he might be running two simulations with the same reform file, one with the start year at 2013 and one with the start year at 2017, in order to cobble together revenue projections from 2013 through 2026.

Yes, I agree, this is not an error.
My view is that it doesn't even merit a warning.
Let's focus on making TaxBrain work correctly and leave user warnings aside.

@brittainhard @hdoupe

@brittainhard
Copy link
Contributor Author

@MattHJensen

Conforming to the goal of a unified API for reform files and GUI based inputs, I'd suggest following the same technique that is used for the asterisk (*) operator.

Not quite sure I understand this part, can you clarify a bit? Where do you see this asterisk coming into the JSON pushed forward to the server? Is it something like {2017: "*"}?

@MattHJensen
Copy link
Contributor

MattHJensen commented Jul 28, 2017

Not quite sure I understand this part, can you clarify a bit? Where do you see this asterisk coming into the JSON pushed forward to the server? Is it something like {2017: "*"}?

The asterisk is not used in the JSON files. My point is just that these two are equivalent

  1. REFORM FILE: start year = 2017 & {"policy": {"_FICA_ss_trt": {"2019": [0.1] }}
  2. TaxBrain GUI: start year = 2017 & Social Security payroll tax rate : *, *, 0.1

So ideally their API representation is the same.

@brittainhard
Copy link
Contributor Author

This fix is pretty simple. Going to add some regression testing to ensure that first_budget_year is being sent forward (which was the source of the problem).

@brittainhard
Copy link
Contributor Author

@hdoupe did you have any luck using cps.csv.gz for testing?

@hdoupe
Copy link
Collaborator

hdoupe commented Jul 31, 2017

@brittainhard Yes, cps.csv.gz could be used here but it cannot be used with run_nth_year_tax_calc_model. Using puf.csv is pretty much hard-coded into the dropq functions in taxcalc. The code below uses cps.csv.gz to expose the bug from #578. Once the webapp starts using a version of taxcalc that is compatible with the cps file, this would work from the webapp repo.

import json
# import celery
import traceback

import pandas as pd
from taxcalc import Records, Policy, Calculator
# from taxbrain_server.celery_tasks import dropq_task


behavior_params = {u'growdiff_response': {}, u'consumption': {}, u'behavior': {}, u'growdiff_baseline': {}}

user_mods = {2019: {u'_SS_Earnings_c': [500000], u'_ALD_InvInc_ec_rt': [0.2]}, 2020: {u'_SS_Earnings_c': [600000], u'_II_em_cpi': True}, 2018: {u'_SS_Earnings_c': [400000], u'_II_em_cpi': False, u'_II_em': [8000]}}
first_year = 2019
year_n = 9

# puf = pd.read_csv("taxbrain_server/puf.csv.gz", compression='gzip')
# create post-reform Calculator instance
start_year = first_year
recs2 = Records.cps_constructor() #Records(data=puf.copy(deep=True))
policy2 = Policy()
policy_reform = user_mods
policy2.implement_reform(policy_reform)
calc2 = Calculator(policy=policy2, records=recs2)
while calc2.current_year < start_year:
    calc2.increment_year()
calc2.calc_all()
assert calc2.current_year == start_year

# increment Calculator objects for year_n years and calculate
for _ in range(0, year_n):
    # calc1.increment_year()
    calc2.increment_year()
# calc1.calc_all()
calc2.calc_all()

In the future, it would be nice to have access to the raw cps file. I'm not sure how to import that from taxcalc now. Perhaps, there should be a static function in taxcalc/records.py like this:

    @staticmethod
    def cps_data(exact_calculations=False,
                        growfactors=Growfactors()):

        return dict(data=pd.read_csv(os.path.join(Records.CUR_PATH, 'cps.csv.gz'), compression='gzip'),
                       exact_calculations=exact_calculations,
                       gfactors=growfactors,
                       weights=Records.CPS_WEIGHTS_FILENAME,
                       adjust_ratios=Records.CPS_RATIOS_FILENAME,
                       start_year=CPSCSV_YEAR)

@martinholmer I would be interested in hearing your thoughts on this.

@martinholmer
Copy link
Contributor

@hdoupe said in #593:

In the future, it would be nice to have access to the raw cps file. I'm not sure how to import that from taxcalc now. Perhaps, there should be a static function in taxcalc/records.py like this:

    @staticmethod
    def cps_data(exact_calculations=False,
                        growfactors=Growfactors()):

        return dict(data=pd.read_csv(os.path.join(Records.CUR_PATH, 'cps.csv.gz'), compression='gzip'),
                       exact_calculations=exact_calculations,
                       gfactors=growfactors,
                       weights=Records.CPS_WEIGHTS_FILENAME,
                       adjust_ratios=Records.CPS_RATIOS_FILENAME,
                       start_year=CPSCSV_YEAR)

@martinholmer I would be interested in hearing your thoughts on this.

Something like the above might be useful. But exactly when would it be useful to have CPS data in a pandas DataFrame rather than in a Records object?

@hdoupe
Copy link
Collaborator

hdoupe commented Aug 2, 2017

@martinholmer said:

Something like the above might be useful. But exactly when would it be useful to have CPS data in a pandas DataFrame rather than in a Records object?

After giving this some more thought, I can't think of a good reason for why we would want to use the raw data. At first, I thought that we would need it since we need the puf data in the webapp, but I realized that we have it because it can't be in the taxcalc repo. CPS.csv.gz can be accessed through the Records.cps_constructor() function. So, unless @brittainhard can think of a reason the csv file would be needed instead of the Records object then we can drop this suggestion for now.

However, for testing the dropq function in the webapp, it may be beneficial to make the dropq functions in taxcalc compatible with the CPS data. This could be as easy as adding a use_cps flag argument to run_nth_year_tax_calc_model and doing something like the following code snippet when a Records object is created:

if use_cps:
    recs1 = Records.cps_constructor(growfactors=growfactors_pre)
else:
    recs1 = Records(data=taxrec_df.copy(deep=True),
                    gfactors=growfactors_pre)

Or, there could be some other complications there that I don't immediately see.

@martinholmer
Copy link
Contributor

@hdoupe said:

However, for testing the dropq functions [called] in the webapp, it may be beneficial to make the dropq functions in taxcalc compatible with the CPS data.

Why? What advantage would this provide?
Doesn't everybody testing TaxBrain have access to the puf.csv file?

@brittainhard
Copy link
Contributor Author

@martinholmer It would be useful for unit tests. We can set up some regression tests using the full pipeline from webapp to taxcalc. Right now I can only test the API functions in webapp.

Thats mainly why I was asking the question.

@hdoupe
Copy link
Collaborator

hdoupe commented Aug 3, 2017

@martinholmer I definitely do not have a complete understanding of how the travis testing on github works. But, the way I understand it is that it does not use the puf.csv file. Thus, if we want to test the dropq functionality using travis, then the dropq functions in taxcalc would have to be compatible with the cps.csv file.

However, I could be completely off base here. If so, I am pair programing with @brittainhard tomorrow and maybe he can help clear up some of my confusion.

@martinholmer
Copy link
Contributor

@hdoupe said:

I definitely do not have a complete understanding of how the travis testing on github works. But, the way I understand it is that it does not use the puf.csv file.

That is correct. The puf.csv file is not for public use, so it can never be used in GitHub tests.

Thus, if we want to test the dropq functionality using travis, then the dropq functions in taxcalc would have to be compatible with the cps.csv file.

Why not just use the puf.csv file for local testing and skip tests that use the puf.csv file on GitHub. That is the approach we've been taking with Tax-Calculator for years. This is done inserting a @pytest.mark.requires_pufcsv decorator on the line above the def test_...(...): line for each test that uses the puf.csv file. Then on GitHub the pytest suite is executed using:

py.test -m "not requires_pufcsv"

and all the tests that need puf.csv are skipped. On your local computer, you do have puf.csv and so you execute:

py.test

in order to run all the tests.

Does this procedure make sense? Can't you use this approach for TaxBrain testing?

I don't see an easy way to adapt dropq to work with the cps.csv file.

@brittainhard

@brittainhard
Copy link
Contributor Author

@martinholmer I like this idea. It would be a huge benefit for making the app more stable.

@PeterDSteinberg and I have had talked before about setting up a more extensive testing suite that will use the services to run actual calculations. Some of these tests can be used in travis.yml but it might be better to design them with local unit and integration tests in mind.

Issue #543 makes mention of a larger mocked system for testing interactions, which is good, but we could also benefit from a much larger services based test structure.

I think that should be tabled for now and we should go ahead with the tests we have now. My test covers the regression, which was failing to include the budget year value, and we should add an issue for that pytest decorator and local tests. I'll also open a larger discussion issue about ways to enhance testing beyond simple unit testing.

@brittainhard brittainhard merged commit 9e2df96 into master Aug 3, 2017
@brittainhard brittainhard deleted the 578_file_upload_fix branch August 3, 2017 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants