Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design the new pipeline #262

Open
shankari opened this issue Jul 13, 2017 · 11 comments
Open

Design the new pipeline #262

shankari opened this issue Jul 13, 2017 · 11 comments

Comments

@shankari
Copy link
Contributor

This issue documents the design discussion and choices for updating the mode inference pipeline to the new data model.

@shankari
Copy link
Contributor Author

Design goals:

  1. Separate model building and model application steps
  2. Seed model building from old-style (Moves) data
  3. Support multiple models in parallel

@shankari
Copy link
Contributor Author

Both 1. and 2. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966304 imply that we should save the model and re-load it.
We can use either standard pickle or jsonpickle to convert the model to json (with the caveat that it is custom to a particular version of scikit-learn)

But how should we store these models?

  1. Store to disk
  2. Store to database

Once we move to user-specific models, we can store them to the database. In fact, if we generate models periodically, we might want to store the history of them in the database. But right now, since we don't allow users to edit/confirm modes, we will start with the seed model built from old-style (Moves) data in the Backup and Section databases. Since we don't generate any more Moves-style data, this model will never change.

So should it be stored on disk?

@shankari
Copy link
Contributor Author

In order to answer that question, we need to think about how the seed database will be

Another thing to note is that we may want to continue to use the old-style (Moves) data until we build up a critical mass of new data. I just checked, and it is 14104 + 7439 = 21543 entries, which is pretty good. It looks like we can combine random forest models
https://stackoverflow.com/questions/28489667/combining-random-forest-models-in-scikit-learn
Not sure about other models from scikit-learn.

So the options for seeding are:

  1. Read saved model + combine with newly created model = combined model
  2. Read old data + combine with new data = combined model

Both of those seem reasonable, but 1. seems slightly better because then we can export the model as a seed for other server instances without sharing the raw data.

@shankari
Copy link
Contributor Author

ok, so given that we are going with 1. from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314968818, we will really have a static seed model and can store it however we want. I'm tempted to store it in a file in the current directory, just to keep things simple.

@shankari
Copy link
Contributor Author

Saved it as 'seed_model.json' (e-mission/e-mission-server@1149631)

@shankari
Copy link
Contributor Author

Note also that the current pipeline only loads the backup data if there is insufficient "new" data. But the backup data is ~ 1/3rd of the total (7439 / 21543 = .3453) and it seems sad to lose it. Seems like it would be good to support both.

Again, there are two ways of doing this:

  1. Copy all confirmed sections into one "training set" database
  2. Support loading from two databases

Option 2. appears to be the least work at this time, although we may want to revisit this once we start experiments on the analysis.

@shankari
Copy link
Contributor Author

Now we move on to the real pipeline.
The current code can be broadly divided into:

  1. extract features
  2. model
  3. infer

We can split these into three files corresponding to those stages, but really, they are specific to this algorithm (aggregate random forest with a specific set of features). Other algorithms will have potentially have different implementations of each of them.

Let's just do this for one algorithm first and then extend to multiple algorithms in a later step.
Note that we will need to generate seed models for each of the other algorithms as well, so its good to do that in a separate step.

@shankari
Copy link
Contributor Author

ok so now that we are building the real pipeline, we need to figure out how to store the newly created models. The plan from https://github.com/e-mission/e-mission-server/issues/508#issuecomment-314966462 was to store periodically generated models to the database.

It is fairly clear how to do this for user-specific models, but not for generic models that are built on aggregate data. Our timeseries is all focused on users, and our aggregate is simply a query across users.

@shankari
Copy link
Contributor Author

Basically, we need to create a new non-user-specific timeseries.
We can do this in at least three ways:

  1. use the None user_id
  2. create a specific tag that represents no user
  3. create a separate database for non-user-specific entries

The easiest option is probably (2). The most principled option is probably (3).
As long as we can get the interface right, there isn't much to choose between the options.

@shankari
Copy link
Contributor Author

Last thing we need to figure out to finish the new model building part is to decide how we will store the confirmed values. That will allow us to query sections that have been confirmed.

@shankari
Copy link
Contributor Author

We haven't yet figured out what kinds of edits we want to do, so this is a bit tricky. But in order to move past this really complicated issue, I am going to assume that we only support mode edits/confirmations. We will represent this using a manual/confirm_mode type entry.

@shankari shankari transferred this issue from e-mission/e-mission-server Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant