Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save model parameters as JSON objects #811

Closed
hdoupe opened this issue Jan 24, 2018 · 6 comments
Closed

Save model parameters as JSON objects #811

hdoupe opened this issue Jan 24, 2018 · 6 comments

Comments

@hdoupe
Copy link
Collaborator

hdoupe commented Jan 24, 2018

In PR #808, we discovered a bug involving how model parameters are saved. That sparked the following conversation:

@hdoupe said

Fixing this under the current approach of storing each parameter to a column in a database would involve extensive changes to the database. Given the magnitude of these changes, I think this is a good time to make the swap to storing the run parameters primarily as JSON files. This idea was first raised in #435, @GoFroggyRun has mentioned this, and @codekansas has also suggested that we move away from our current approach.

This would have no effect on the file upload runs, and the raw input from the GUI is parsed as a dictionary that is built from the attributes of the model object (i.e. taxbrain_data = dict(taxbrain_model.dict)). These attributes are populated by the raw input data. Thus, it would be easy to make the change going forward, but it may be more difficult to preserve backwards compatibility.

One solution is to create a decorator like we did for the new results table names. If the version of PB is less than 1.4.0 (would be a minor bump to 1.4.0 if we do this change), then we form the JSON object from the raw fields and serve that to the parsing functions. This would require saving all of the no longer needed fields though.

@GoFroggyRun @codekansas what do you think?

@GoFroggyRun said:

@hdoupe said:

I think this is a good time to make the swap to storing the run parameters primarily as JSON files.

Definitely agree.

Regarding backward compatibility, I'm not sure whether we want to store "all of the no longer needed fields." Would it be easier to preserve backward compatibility if we just provide the reform file ran, not necessarily via GUI, and its results (which I think is the point of preserving the compatibility)? We don't have to worry too much when fields were added or removed, while users are able to archive what they ran along with the results and cite whatever they need.

@hdoupe said:

Right, they could pull up the results just fine no matter what we do with the parameters. However, we need the fields data to provide the edit parameters page for previous runs. I think there are a couple solutions:

  1. JSON file to fields converter--as long as we have the JSON reform (which we do from PB 1.0.2 onward) we can convert that back to the fields data and display that. I've been interested in having something like this for awhile, but it may take some time to get right. And there may be some edge cases that are difficult to catch.
  2. One time database update that converts all stored run parameters into JSON fields objects and store those. Then delete the old, no-longer needed columns.
  3. Keep the old columns and use some decorator attribute to dynamically create the JSON fields object and parse that.

(3) seems like the easiest but laziest option. (2) is risky since we are doing a lot of work on the database, but I guess we could back up the database. (1) probably consumes the most time.

  1. On backwards compatibility, if someone checks out the edit parameters page, it may not mean the same as before. The fields that were not specified could potentially be populated by the new default values for 2017 and 2018. If the parameters are submitted again, the difference results will be different since the baseline is the TCJA. So, one solution is to drop backwards compatibility on the edit parameters side and keep backwards compatibility on the results.

The major downside of (4) is the user won't have a way to review the parameters used to get an archived result.

@hdoupe responded:

I plan to begin working on solution (3) in a separate PR. This will allow us to get a fix out faster, and it allows us to go back and implement a more permanent solution in the future. If anyone has any objections to this plan then please let me know and we can discuss it further.

@GoFroggyRun said:

@hdoupe what I have in mind is more like your (4).

By providing the reform files ran, we don't necessarily have to allow users to always be able to edit their previous runs, via GUI, as PB releases go. A reform/assumption file and tax-calculator version (and TaxData version once available) should be sufficient for a user to recreate (theoretically) his/her results and provide citation, which, in my opinion, are the primary reasons we care about backward compatibility. For example, the file upload page doesn't support editing, and thus makes its maintenance easier than GUI page in terms of backward compatibility. Maybe we can do something similar: once those GUI runs got stale, we make GUI re-editing not available, yet store reform/assumption information likewise in the file-upload way.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 24, 2018

@GoFroggyRun said

By providing the reform files ran, we don't necessarily have to allow users to always be able to edit their previous runs, via GUI, as PB releases go. A reform/assumption file and tax-calculator version (and TaxData version once available) should be sufficient for a user to recreate (theoretically) his/her results and provide citation, which, in my opinion, are the primary reasons we care about backward compatibility.

Ok, that makes sense. We can display the reform and assumptions files going back to PB 1.0.2 which is in August. That's when we started converting GUI parameters into reform files and storing them.

For example, the file upload page doesn't support editing, and thus makes its maintenance easier than GUI page in terms of backward compatibility. Maybe we can do something similar: once those GUI runs got stale, we make GUI re-editing not available, yet store reform/assumption information likewise in the file-upload way.

We could only guarantee that you can edit GUI parameters back 6 months or something like that. This time period could be revisited we get a more stable GUI interface.

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 24, 2018

It would be nice to know how important archiving all of this information is for the users. My questions are:

  1. How often do you revisit old runs and view the GUI parameters via the edit parameters page?
  2. Would it be OK if for older runs (something like six months old) we only displayed the reform and assumption files that were created above the run?
  3. Would it be OK if for older runs (something like six months old) we did not allow users to edit the parameters and resubmit the model?

@MattHJensen @MaxGhenis if you could weigh in on this, then that would be very helpful.

@GoFroggyRun
Copy link
Contributor

@hdoupe I agree that we definitely want to have a "grace period" that allows users to re-edit their runs, I'm not sure, however, whether it is necessary to make it 6 months. It seems to me that 6 months are kinda too long. On the other hand, I'm also not sure how hard it would be to guarantee backward compatibility for a 6-month period. If a shorter "grace period" would make it easier to maintain backward compatibility, I would personally prefer the shorted one. Of course, it's really up to how frequent users would re-edit their previous runs.

@MattHJensen
Copy link
Contributor

The primary goal should be to ensure that results are archived and that users can see exactly how to replicate any given set of results.

Actually allowing the user to edit and rerun parameters on an old reform is very nice, but not strictly necessary. Note that "edit parameters and rerun an old reform" can actually mean two different things: (a) it could mean mean being able to rerun an old reform in the same environment and get the exact same results, or (b) it could mean to be able to run the same parameters in an updated model. These are actually two separate things and both are good but both are difficult.

(a) means needing to be able to install any version of the models on a node at any moment.
(b) would require dramatically limiting parameter removals, structural redefinitions, and renamings -- or, alternatively -- automatically generating documentation that informs a user of those changes.

My sense is that (a) is probably not worth the effort if we provide sufficient information on how to replicate the results using the models locally. (It is worth being aware that this won't entirely satisfy users who don't have access to necessary private data).

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 26, 2018

Thanks for your input @MattHJensen.

It seems like (a) is a reach goal. Maybe we will be there someday, but with our current setup, it seems pretty difficult.

Approach (b) is what we have now. We run the same parameters on updated versions of models. This is not a problem when models are updated for minor internal patches. However, for larger changes, the backwards compatibility required in this approach becomes pretty cumbersome. To accomplish this, you have to deal with a few challenges regarding the parameters:

Technical Challenges

  1. some parameter names change
  2. some parameters are deleted

Qualitative Challenges

  1. some parameters keep the same name but their definitions are tweaked internally
  2. parameters become different relative to the baseline (e.g. what do pre-TCJA reforms mean when run with the post-TCJA baseline)

For now, I think our best approach is to go with (b), but I don't think we should work as hard to keep long-term (> 4-6 months) backwards compatibility w.r.t. viewing and editing the GUI parameters. I do think that we should give the user more documentation on the results page so that they have enough information to replicate the results. Now that all GUI input reforms are parsed into JSON files, we should display those on the results page. For results prior to adding this functionality (before 8/2017), we should provide a similar view of the parameters used to run the model. This last part is the most challenging part.

One solution is to keep all of the current parameter fields and parse them dynamically as the results page is loaded. However, it doesn't make much sense to keep 200+ columns in a database for what is probably low usage relative to running new reforms. What could work is a two step solution where we do:

  1. Create the new raw JSON fields object and wrapper for it that dynamically creates this object from the fields stored before 8/2017. At this point, we are able to produce a JSON object detailing the parameter specifications for all of the results, but our database is unnecessarily large. This could complicate things in the future.
  2. Apply this wrapper to all results prior to 8/2017 creating a JSON object for all previous runs. Then delete all the old parameter fields that we don't need anymore. Now, we can produce a JSON object detailing the parameter specifications for all of the results, and our database is in a much leaner state.

After working on issue #812, I feel more comfortable working on the database directly. I don't want to do this often because django discourages you from modifying the database directly and it just seems risky. However, for cases like this where we don't have many other choices, I'm willing to work on it directly.

I'm open to other solutions and critiques on this plan. @GoFroggyRun @codekansas what are your thoughts?

@MattHJensen Does this provide enough replicability?

@hdoupe
Copy link
Collaborator Author

hdoupe commented Jan 31, 2018

@talumbau do you have any thoughts regarding #811? In particular, I'm interested in your feedback on the approach set out in the above comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants