Switch from boutique NamedMatrix impl and core.matrix to tech.ml stack for math #1062

metasoarous · 2021-07-11T00:28:05Z

Problem:
We're currently using a custom NamedMatrix protocol and implementation to house the raw votes in the math worker. This structure is more or less a dataframe/dataset structure. At the time it was developed, there wasn't really a good ready-made option for this, and so we built our own. This has served us decently over the years, but as the scale of conversations we've run have grown, this setup has proven limited:

for large conversations, the NamedMatrix bootstrap routine is terribly inneficient, and I think is behind HeapOverflow errors (when large conversations are running) which have brought the math worker down and led to reports not loading for some time (see Create endpoint with math worker status #957, Report stuck on "loading" screen #803, Report fails loading #1051, Polis report not generating on free account #956, and the foreboding Report page fails to load #666)
core.matrix implementations don't generally offer a lot of control over the element data type, meaning the vote matrix takes up more memory than it necessarily needs to
the more performant core.matrix implementations also don't support proper missing values (as tech.ml does), meaning that we have to use regular ol Clojure vectors together with nil, which has big implications for performance and memory consumption
Remove columns from matrix corresponding to moderated out comments #1061 could be made easier; While I believe core.matrix does have efficient slice routines for the fancier implementations, I don't believe these are particularly performant for the Clojure vector implementation, and the tech.ml.dataset structure lends itself perfectly to this use case.

Additional context:
Redoing all of the math is a potentially huge task fraught with peril

Suggested solution:
It's possible that we can start small by replacing the "raw" vote NamedMatrix (prior to zeroing out moderated out columns and imputing means for missing entries) with tech.ml, and working iteratively from there. Could possibly even reimplement the PNamedMatrix protocol against the tech.ml.dataset, but this may be easier said than done given the expected output from some of these routines. Still, maybe easier than going all-in on tech.ml right away.

Alternative suggestions:

Go all in right away
Possibly implement some of the core.matrix protocols against tech.ml data structures, so we don't have a full rewrite? Not sure of the feasibility of this or whether it's worth the time, but could be potentially useful for other projects looking to transition.

Moar context:
Testing will be important in making sure we get this right as we move towards complete transition.

The text was updated successfully, but these errors were encountered:

metasoarous · 2021-07-24T17:07:49Z

Update: I've realized that while the core.matrix implementation doesn't support proper missing values, clatrix does support using ##NAN in matrices. I wish I had thought of this way back when, as I think using a proper matrix type over a (Clojure) vector of vectors will probably make a pretty big difference. Nevertheless, a lot of the reasons mentioned above for switching to the tech.ml/dtype-next stack remain valid.

metasoarous added the 🔩 p:math label Jul 11, 2021

metasoarous mentioned this issue Dec 20, 2022

Use more performant core.matrix implementation in math worker #1579

Open

metasoarous added the 🚀 scale-and-perf Performance and scale issues label Dec 20, 2022

jucor mentioned this issue Jan 30, 2025

[Work in Progress DO NOT MERGE] Port math library from clojure to python #1893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from boutique NamedMatrix impl and core.matrix to tech.ml stack for math #1062

Switch from boutique NamedMatrix impl and core.matrix to tech.ml stack for math #1062

metasoarous commented Jul 11, 2021 •

edited

Loading

metasoarous commented Jul 24, 2021

Switch from boutique NamedMatrix impl and core.matrix to tech.ml stack for math #1062

Switch from boutique NamedMatrix impl and core.matrix to tech.ml stack for math #1062

Comments

metasoarous commented Jul 11, 2021 • edited Loading

metasoarous commented Jul 24, 2021

metasoarous commented Jul 11, 2021 •

edited

Loading