-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design the new pipeline #262
Comments
Design goals:
|
Both 1. and 2. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966304 imply that we should save the model and re-load it. But how should we store these models?
Once we move to user-specific models, we can store them to the database. In fact, if we generate models periodically, we might want to store the history of them in the database. But right now, since we don't allow users to edit/confirm modes, we will start with the seed model built from old-style (Moves) data in the Backup and Section databases. Since we don't generate any more Moves-style data, this model will never change. So should it be stored on disk? |
In order to answer that question, we need to think about how the seed database will be Another thing to note is that we may want to continue to use the old-style (Moves) data until we build up a critical mass of new data. I just checked, and it is So the options for seeding are:
Both of those seem reasonable, but 1. seems slightly better because then we can export the model as a seed for other server instances without sharing the raw data. |
ok, so given that we are going with 1. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314968818, we will really have a static seed model and can store it however we want. I'm tempted to store it in a file in the current directory, just to keep things simple. |
Saved it as |
Note also that the current pipeline only loads the backup data if there is insufficient "new" data. But the backup data is ~ 1/3rd of the total ( Again, there are two ways of doing this:
Option 2. appears to be the least work at this time, although we may want to revisit this once we start experiments on the analysis. |
to make the true training set This is a fix for https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314987718
Now we move on to the real pipeline.
We can split these into three files corresponding to those stages, but really, they are specific to this algorithm (aggregate random forest with a specific set of features). Other algorithms will have potentially have different implementations of each of them. Let's just do this for one algorithm first and then extend to multiple algorithms in a later step. |
ok so now that we are building the real pipeline, we need to figure out how to store the newly created models. The plan from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966462 was to store periodically generated models to the database. It is fairly clear how to do this for user-specific models, but not for generic models that are built on aggregate data. Our timeseries is all focused on users, and our aggregate is simply a query across users. |
Basically, we need to create a new non-user-specific timeseries.
The easiest option is probably (2). The most principled option is probably (3). |
Last thing we need to figure out to finish the new model building part is to decide how we will store the confirmed values. That will allow us to query sections that have been confirmed. |
We haven't yet figured out what kinds of edits we want to do, so this is a bit tricky. But in order to move past this really complicated issue, I am going to assume that we only support mode edits/confirmations. We will represent this using a |
This issue documents the design discussion and choices for updating the mode inference pipeline to the new data model.
The text was updated successfully, but these errors were encountered: