Design the new pipeline #262

shankari · 2017-07-13T04:17:52Z

This issue documents the design discussion and choices for updating the mode inference pipeline to the new data model.

shankari · 2017-07-13T04:18:16Z

Design goals:

Separate model building and model application steps
Seed model building from old-style (Moves) data
Support multiple models in parallel

shankari · 2017-07-13T04:19:35Z

Both 1. and 2. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966304 imply that we should save the model and re-load it.
We can use either standard pickle or jsonpickle to convert the model to json (with the caveat that it is custom to a particular version of scikit-learn)

But how should we store these models?

Store to disk
Store to database

Once we move to user-specific models, we can store them to the database. In fact, if we generate models periodically, we might want to store the history of them in the database. But right now, since we don't allow users to edit/confirm modes, we will start with the seed model built from old-style (Moves) data in the Backup and Section databases. Since we don't generate any more Moves-style data, this model will never change.

So should it be stored on disk?

shankari · 2017-07-13T04:40:29Z

In order to answer that question, we need to think about how the seed database will be

Another thing to note is that we may want to continue to use the old-style (Moves) data until we build up a critical mass of new data. I just checked, and it is 14104 + 7439 = 21543 entries, which is pretty good. It looks like we can combine random forest models
https://stackoverflow.com/questions/28489667/combining-random-forest-models-in-scikit-learn
Not sure about other models from scikit-learn.

So the options for seeding are:

Read saved model + combine with newly created model = combined model
Read old data + combine with new data = combined model

Both of those seem reasonable, but 1. seems slightly better because then we can export the model as a seed for other server instances without sharing the raw data.

shankari · 2017-07-13T05:00:01Z

ok, so given that we are going with 1. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314968818, we will really have a static seed model and can store it however we want. I'm tempted to store it in a file in the current directory, just to keep things simple.

shankari · 2017-07-13T06:32:59Z

Saved it as 'seed_model.json' (e-mission/e-mission-server@1149631)

shankari · 2017-07-13T06:48:17Z

Note also that the current pipeline only loads the backup data if there is insufficient "new" data. But the backup data is ~ 1/3rd of the total (7439 / 21543 = .3453) and it seems sad to lose it. Seems like it would be good to support both.

Again, there are two ways of doing this:

Copy all confirmed sections into one "training set" database
Support loading from two databases

Option 2. appears to be the least work at this time, although we may want to revisit this once we start experiments on the analysis.

to make the true training set This is a fix for https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314987718

shankari · 2017-07-13T14:20:45Z

Now we move on to the real pipeline.
The current code can be broadly divided into:

extract features
model
infer

We can split these into three files corresponding to those stages, but really, they are specific to this algorithm (aggregate random forest with a specific set of features). Other algorithms will have potentially have different implementations of each of them.

Let's just do this for one algorithm first and then extend to multiple algorithms in a later step.
Note that we will need to generate seed models for each of the other algorithms as well, so its good to do that in a separate step.

shankari · 2017-07-13T15:13:41Z

ok so now that we are building the real pipeline, we need to figure out how to store the newly created models. The plan from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966462 was to store periodically generated models to the database.

It is fairly clear how to do this for user-specific models, but not for generic models that are built on aggregate data. Our timeseries is all focused on users, and our aggregate is simply a query across users.

shankari · 2017-07-13T15:30:11Z

Basically, we need to create a new non-user-specific timeseries.
We can do this in at least three ways:

use the None user_id
create a specific tag that represents no user
create a separate database for non-user-specific entries

The easiest option is probably (2). The most principled option is probably (3).
As long as we can get the interface right, there isn't much to choose between the options.

This is consitent with https://github.com/e-mission/e-mission-server/issues/508#issuecomment-315113681 A

shankari · 2017-07-16T00:16:53Z

Last thing we need to figure out to finish the new model building part is to decide how we will store the confirmed values. That will allow us to query sections that have been confirmed.

shankari · 2017-07-16T01:00:06Z

We haven't yet figured out what kinds of edits we want to do, so this is a bit tricky. But in order to move past this really complicated issue, I am going to assume that we only support mode edits/confirmations. We will represent this using a manual/confirm_mode type entry.

shankari referenced this issue in shankari/e-mission-server Jul 13, 2017

Combine the two sources of sections (section db and backup db)

4b98ffd

to make the true training set This is a fix for https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314987718

shankari referenced this issue in shankari/e-mission-server Jul 14, 2017

Add a new timeseries that is not linked to a user

1e91b8e

This is consitent with https://github.com/e-mission/e-mission-server/issues/508#issuecomment-315113681 A

shankari transferred this issue from e-mission/e-mission-server Feb 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design the new pipeline #262

Design the new pipeline #262

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 16, 2017

shankari commented Jul 16, 2017

Design the new pipeline #262

Design the new pipeline #262

Comments

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 13, 2017

shankari commented Jul 16, 2017

shankari commented Jul 16, 2017