-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating online and active learning models #60
Comments
Sorry, but is this a comment or feature request? |
I believe it's both? Generally, on-line learning is quite a relevant and important area. For a package to support on-line learning properly, it needs to support: Parallelization and distributed computations are separate features that are nice on their own, but are quite synergistic. As far as I can see, OnlineStats supports (i) sequential data streams, and (ii) updating through its fit method, as well as some simple parallelism, through its interface design, which is very nice. I see two main blockers for interfacing:
Point 1 is straightforward to solve, though obviously it's work (and maybe best done by the onlinestats folks?). Regarding point 2, this is more subtle: for interface hygiene, I don't like the design decision of onlinestats that fitting is always updating. I'd rather separate "fit" and "update", clearly distinguishing "first-time fitting" and "updating". This would, i.m.o., also make a lot of sense with Bayesian models, for the Bayesian update - Bayesian models are often automatically on-line (but not necessarily on sequential data streams as the stylized ML on-line setting). Any thoughts? Though generally, I wouldn't see supporting the on-line modelling task a priority above "getting mlj core working", obviously. |
As mentioned in #71, I'm interested in adding support for active learning to MLJ. My usecase is training models on real-time data like microphone or camera input (and outputting the model's reaction to actuators/devices in real-time). @fkiraly can you give a concrete example of how you would split OnlineStats' |
Bump. Can someone provide me an example of what they'd like the online learning API to look like so that I can build out to needed code/interfaces to support this feature? |
Thanks @jpsamaroo for re-pinging this discussion and for the offer to For clarity, here's my understanding of basic online learning: A (i) as if it the training data were was (ii) in a time approximating the time required to train on In some cases the learned state based on "train with Not all machine learning algorithms directly support online learning. Basic work-flow Here's how I see the basic work-flow for training and updating an MLJ X = MLJ.table(rand(1000, 17))
# initialize and train on first batch:
model = @load PCA
mach = machine(model, X)
fit!(mach)
# fit on second batch of data:
Xnew= MLJ.table(rand(10, 17))
inject!(mach, Xnew)
fit!(mach) When new data is injected into a machine, the machine updates an Composing online learners If a learner does not support online learning, then I suggest the An alternative is that updating a non-online learner with new data, We will need syntax for the learning networks. It would look like Xs = source(X)
mach = machine(model, Xs)
Xout = transform(mach, Xs)
# fit on first batch of data:
fit!(Xout)
# add data and update:
inject!(Xs, Xnew)
fit!(Xout) Implementation In brief, to implement the above just requires:
The more difficult design decisions revolve around deployment, tuning That said, the framework should be similar to that suggested in The pragmatic way to move forward which I would advocate, given Thoughts anyone? In terms of implementing the basics, I expect it is best that I take |
I'm developing an online unsupervised learning model for timeseries, which can do prediction / anomaly detection when coupled with a supervised model. As I'm looking for a standardized interface I'm thinking to experiment with MLJ. This can be a use case coupling this issue with #303 and #51 . |
Thanks for that. It might be a challenge to introduce time series and online learning to MLJ simultaneously but all help and input welcome. On the time series front, see also #303 (continuing time-series related discussion there) and JuliaAI/ScientificTypes.jl#14 . |
To generalize this a bit from a discussion with @ablaom on Slack, it seems like there are at least four different cases to consider:
For (4), lots of statistical models can be fit in terms of sufficient statistics. If we add or remove features, there are often ways to efficiently update those sufficient statistics without starting from scratch. For example, say we have a linear model with squared loss (and maybe some arbitrary regularization). This can be fit using a Cholesky decomposition of In addition, in this situation we'd want to be able to use a previous model fit as a starting point, maybe just starting the weight for the new feature at zero. |
I've recently come up with a workaround for this feature in which I update an xgboost model by defining MLJBase.fit!(m::Machine, X, y) and I've spent a bit of time considering whether this can be generalized. For the cases that @cscherrer laid out above, I think 1,2,3 should be relatively easy (for models where they are possible at all) while 4 is likely to be very hard. I'll summarize some of the thoughts I've had about a
Something like this seems like it would be easier than @ablaom 's Thoughts? |
The syntax |
That's why I think the fit!(mach::Machine, X, y) = fit!(mach.fitresult, X, y) I'm not entirely sure what you mean but I think your concern is that the existing definition of I don't really see any way around this: it's not realistic to always require that all the data is kept. If you have an entire network you can have So TL;DR my suggestion was that models would be required to implement something like |
Yeah, we have already have the stub (see above comment):
We just don't have any models that implement it. (And I don't like the name anymore - I'm using We could additionally:
How's that sound? One question is whether this could play nicely with model composition. That might be quite tricky, and I will have to think about it some more. |
Integrating OnlineStats (its online learning algorithms) and giving it an easy to use hyperparameter tuning context makes Julia even more useful for quick ML on real big data.
The text was updated successfully, but these errors were encountered: