Save model parameters as JSON objects #811

hdoupe · 2018-01-24T20:40:43Z

In PR #808, we discovered a bug involving how model parameters are saved. That sparked the following conversation:

Fixing this under the current approach of storing each parameter to a column in a database would involve extensive changes to the database. Given the magnitude of these changes, I think this is a good time to make the swap to storing the run parameters primarily as JSON files. This idea was first raised in #435, @GoFroggyRun has mentioned this, and @codekansas has also suggested that we move away from our current approach.

This would have no effect on the file upload runs, and the raw input from the GUI is parsed as a dictionary that is built from the attributes of the model object (i.e. taxbrain_data = dict(taxbrain_model.dict)). These attributes are populated by the raw input data. Thus, it would be easy to make the change going forward, but it may be more difficult to preserve backwards compatibility.

One solution is to create a decorator like we did for the new results table names. If the version of PB is less than 1.4.0 (would be a minor bump to 1.4.0 if we do this change), then we form the JSON object from the raw fields and serve that to the parsing functions. This would require saving all of the no longer needed fields though.

@GoFroggyRun @codekansas what do you think?

@GoFroggyRun said:

@hdoupe said:

I think this is a good time to make the swap to storing the run parameters primarily as JSON files.

Definitely agree.

Regarding backward compatibility, I'm not sure whether we want to store "all of the no longer needed fields." Would it be easier to preserve backward compatibility if we just provide the reform file ran, not necessarily via GUI, and its results (which I think is the point of preserving the compatibility)? We don't have to worry too much when fields were added or removed, while users are able to archive what they ran along with the results and cite whatever they need.

@hdoupe said:

Right, they could pull up the results just fine no matter what we do with the parameters. However, we need the fields data to provide the edit parameters page for previous runs. I think there are a couple solutions:

JSON file to fields converter--as long as we have the JSON reform (which we do from PB 1.0.2 onward) we can convert that back to the fields data and display that. I've been interested in having something like this for awhile, but it may take some time to get right. And there may be some edge cases that are difficult to catch.

One time database update that converts all stored run parameters into JSON fields objects and store those. Then delete the old, no-longer needed columns.

Keep the old columns and use some decorator attribute to dynamically create the JSON fields object and parse that.

(3) seems like the easiest but laziest option. (2) is risky since we are doing a lot of work on the database, but I guess we could back up the database. (1) probably consumes the most time.

On backwards compatibility, if someone checks out the edit parameters page, it may not mean the same as before. The fields that were not specified could potentially be populated by the new default values for 2017 and 2018. If the parameters are submitted again, the difference results will be different since the baseline is the TCJA. So, one solution is to drop backwards compatibility on the edit parameters side and keep backwards compatibility on the results.

The major downside of (4) is the user won't have a way to review the parameters used to get an archived result.

@hdoupe responded:

I plan to begin working on solution (3) in a separate PR. This will allow us to get a fix out faster, and it allows us to go back and implement a more permanent solution in the future. If anyone has any objections to this plan then please let me know and we can discuss it further.

@GoFroggyRun said:

@hdoupe what I have in mind is more like your (4).

By providing the reform files ran, we don't necessarily have to allow users to always be able to edit their previous runs, via GUI, as PB releases go. A reform/assumption file and tax-calculator version (and TaxData version once available) should be sufficient for a user to recreate (theoretically) his/her results and provide citation, which, in my opinion, are the primary reasons we care about backward compatibility. For example, the file upload page doesn't support editing, and thus makes its maintenance easier than GUI page in terms of backward compatibility. Maybe we can do something similar: once those GUI runs got stale, we make GUI re-editing not available, yet store reform/assumption information likewise in the file-upload way.

hdoupe · 2018-01-24T20:43:03Z

@GoFroggyRun said

By providing the reform files ran, we don't necessarily have to allow users to always be able to edit their previous runs, via GUI, as PB releases go. A reform/assumption file and tax-calculator version (and TaxData version once available) should be sufficient for a user to recreate (theoretically) his/her results and provide citation, which, in my opinion, are the primary reasons we care about backward compatibility.

Ok, that makes sense. We can display the reform and assumptions files going back to PB 1.0.2 which is in August. That's when we started converting GUI parameters into reform files and storing them.

For example, the file upload page doesn't support editing, and thus makes its maintenance easier than GUI page in terms of backward compatibility. Maybe we can do something similar: once those GUI runs got stale, we make GUI re-editing not available, yet store reform/assumption information likewise in the file-upload way.

We could only guarantee that you can edit GUI parameters back 6 months or something like that. This time period could be revisited we get a more stable GUI interface.

hdoupe · 2018-01-24T20:53:28Z

It would be nice to know how important archiving all of this information is for the users. My questions are:

How often do you revisit old runs and view the GUI parameters via the edit parameters page?
Would it be OK if for older runs (something like six months old) we only displayed the reform and assumption files that were created above the run?
Would it be OK if for older runs (something like six months old) we did not allow users to edit the parameters and resubmit the model?

@MattHJensen @MaxGhenis if you could weigh in on this, then that would be very helpful.

GoFroggyRun · 2018-01-24T21:46:05Z

@hdoupe I agree that we definitely want to have a "grace period" that allows users to re-edit their runs, I'm not sure, however, whether it is necessary to make it 6 months. It seems to me that 6 months are kinda too long. On the other hand, I'm also not sure how hard it would be to guarantee backward compatibility for a 6-month period. If a shorter "grace period" would make it easier to maintain backward compatibility, I would personally prefer the shorted one. Of course, it's really up to how frequent users would re-edit their previous runs.

MattHJensen · 2018-01-24T22:10:13Z

The primary goal should be to ensure that results are archived and that users can see exactly how to replicate any given set of results.

Actually allowing the user to edit and rerun parameters on an old reform is very nice, but not strictly necessary. Note that "edit parameters and rerun an old reform" can actually mean two different things: (a) it could mean mean being able to rerun an old reform in the same environment and get the exact same results, or (b) it could mean to be able to run the same parameters in an updated model. These are actually two separate things and both are good but both are difficult.

(a) means needing to be able to install any version of the models on a node at any moment.
(b) would require dramatically limiting parameter removals, structural redefinitions, and renamings -- or, alternatively -- automatically generating documentation that informs a user of those changes.

My sense is that (a) is probably not worth the effort if we provide sufficient information on how to replicate the results using the models locally. (It is worth being aware that this won't entirely satisfy users who don't have access to necessary private data).

hdoupe · 2018-01-26T16:21:09Z

Thanks for your input @MattHJensen.

It seems like (a) is a reach goal. Maybe we will be there someday, but with our current setup, it seems pretty difficult.

Approach (b) is what we have now. We run the same parameters on updated versions of models. This is not a problem when models are updated for minor internal patches. However, for larger changes, the backwards compatibility required in this approach becomes pretty cumbersome. To accomplish this, you have to deal with a few challenges regarding the parameters:

Technical Challenges

some parameter names change
some parameters are deleted

Qualitative Challenges

some parameters keep the same name but their definitions are tweaked internally
parameters become different relative to the baseline (e.g. what do pre-TCJA reforms mean when run with the post-TCJA baseline)

For now, I think our best approach is to go with (b), but I don't think we should work as hard to keep long-term (> 4-6 months) backwards compatibility w.r.t. viewing and editing the GUI parameters. I do think that we should give the user more documentation on the results page so that they have enough information to replicate the results. Now that all GUI input reforms are parsed into JSON files, we should display those on the results page. For results prior to adding this functionality (before 8/2017), we should provide a similar view of the parameters used to run the model. This last part is the most challenging part.

One solution is to keep all of the current parameter fields and parse them dynamically as the results page is loaded. However, it doesn't make much sense to keep 200+ columns in a database for what is probably low usage relative to running new reforms. What could work is a two step solution where we do:

Create the new raw JSON fields object and wrapper for it that dynamically creates this object from the fields stored before 8/2017. At this point, we are able to produce a JSON object detailing the parameter specifications for all of the results, but our database is unnecessarily large. This could complicate things in the future.
Apply this wrapper to all results prior to 8/2017 creating a JSON object for all previous runs. Then delete all the old parameter fields that we don't need anymore. Now, we can produce a JSON object detailing the parameter specifications for all of the results, and our database is in a much leaner state.

After working on issue #812, I feel more comfortable working on the database directly. I don't want to do this often because django discourages you from modifying the database directly and it just seems risky. However, for cases like this where we don't have many other choices, I'm willing to work on it directly.

I'm open to other solutions and critiques on this plan. @GoFroggyRun @codekansas what are your thoughts?

@MattHJensen Does this provide enough replicability?

hdoupe · 2018-01-31T20:05:30Z

@talumbau do you have any thoughts regarding #811? In particular, I'm interested in your feedback on the approach set out in the above comment.

hdoupe mentioned this issue Jan 31, 2018

Resolve more backwards compatibility issues related to old taxcalc tables #814

Merged

This was referenced Feb 2, 2018

January/February Release Plan #805

Closed

Update to Django 1.9 and store parameters as JSON object #822

Merged

hdoupe closed this as completed in #822 Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save model parameters as JSON objects #811

Save model parameters as JSON objects #811

hdoupe commented Jan 24, 2018

hdoupe commented Jan 24, 2018

hdoupe commented Jan 24, 2018

GoFroggyRun commented Jan 24, 2018

MattHJensen commented Jan 24, 2018

hdoupe commented Jan 26, 2018

hdoupe commented Jan 31, 2018

Save model parameters as JSON objects #811

Save model parameters as JSON objects #811

Comments

hdoupe commented Jan 24, 2018

hdoupe commented Jan 24, 2018

hdoupe commented Jan 24, 2018

GoFroggyRun commented Jan 24, 2018

MattHJensen commented Jan 24, 2018

hdoupe commented Jan 26, 2018

hdoupe commented Jan 31, 2018