You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem:
We're currently using a custom NamedMatrix protocol and implementation to house the raw votes in the math worker. This structure is more or less a dataframe/dataset structure. At the time it was developed, there wasn't really a good ready-made option for this, and so we built our own. This has served us decently over the years, but as the scale of conversations we've run have grown, this setup has proven limited:
core.matrix implementations don't generally offer a lot of control over the element data type, meaning the vote matrix takes up more memory than it necessarily needs to
the more performant core.matrix implementations also don't support proper missing values (as tech.ml does), meaning that we have to use regular ol Clojure vectors together with nil, which has big implications for performance and memory consumption
Remove columns from matrix corresponding to moderated out comments #1061 could be made easier; While I believe core.matrix does have efficient slice routines for the fancier implementations, I don't believe these are particularly performant for the Clojure vector implementation, and the tech.ml.dataset structure lends itself perfectly to this use case.
Additional context:
Redoing all of the math is a potentially huge task fraught with peril
Suggested solution:
It's possible that we can start small by replacing the "raw" vote NamedMatrix (prior to zeroing out moderated out columns and imputing means for missing entries) with tech.ml, and working iteratively from there. Could possibly even reimplement the PNamedMatrix protocol against the tech.ml.dataset, but this may be easier said than done given the expected output from some of these routines. Still, maybe easier than going all-in on tech.ml right away.
Alternative suggestions:
Go all in right away
Possibly implement some of the core.matrix protocols against tech.ml data structures, so we don't have a full rewrite? Not sure of the feasibility of this or whether it's worth the time, but could be potentially useful for other projects looking to transition.
Moar context:
Testing will be important in making sure we get this right as we move towards complete transition.
The text was updated successfully, but these errors were encountered:
Update: I've realized that while the core.matrix implementation doesn't support proper missing values, clatrixdoes support using ##NAN in matrices. I wish I had thought of this way back when, as I think using a proper matrix type over a (Clojure) vector of vectors will probably make a pretty big difference. Nevertheless, a lot of the reasons mentioned above for switching to the tech.ml/dtype-next stack remain valid.
Problem:
We're currently using a custom NamedMatrix protocol and implementation to house the raw votes in the math worker. This structure is more or less a dataframe/dataset structure. At the time it was developed, there wasn't really a good ready-made option for this, and so we built our own. This has served us decently over the years, but as the scale of conversations we've run have grown, this setup has proven limited:
nil
, which has big implications for performance and memory consumptionAdditional context:
Redoing all of the math is a potentially huge task fraught with peril
Suggested solution:
It's possible that we can start small by replacing the "raw" vote NamedMatrix (prior to zeroing out moderated out columns and imputing means for missing entries) with tech.ml, and working iteratively from there. Could possibly even reimplement the
PNamedMatrix
protocol against thetech.ml.dataset
, but this may be easier said than done given the expected output from some of these routines. Still, maybe easier than going all-in ontech.ml
right away.Alternative suggestions:
core.matrix
protocols againsttech.ml
data structures, so we don't have a full rewrite? Not sure of the feasibility of this or whether it's worth the time, but could be potentially useful for other projects looking to transition.Moar context:
Testing will be important in making sure we get this right as we move towards complete transition.
The text was updated successfully, but these errors were encountered: