Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

cakedev0 · 2025-09-03T20:18:33Z

This PR re-implements the way DecisionTreeRegressor(criterion='absolute_error') works underneath for optimization purposes. The current algorithm for calculating the AE of a split incures a O(n^2) overall complexity for building a tree which quickly becomes impractical. My implementation makes it O(n log n) making it tremendously faster.

For instance with d=2, n=100_000 and max_depth=1 (just one split), the execution time went from ~30s to ~100ms on my machine.

Referenced Issues

Fixes #9626 by reducing the complexity from O(n^2) to O(n log n).
Also fixes #32099 & #10725 (which are probably duplicates). But that's more of a side effect of re-implementing completely the criterion logic for MAE.

Supersedes #11649 (which was opened to fix #10725 7 years ago but never merged).

Explanation of my changes

The changes focus solely on the class MAE(RegressionCriterion).

Previous implementation had O(n^2) overall complexity emerging from several methods in this class

in update: O(n) cost due to updating a data structure that maintains data sorted (WeightedMedianCalculator/WeightedPQueue). Called O(n) times to find the best split => O(n^2) overall
in children_impurity: O(n) due to looping over all the data points. Called O(n) times to find the best split => O(n^2) overall

Those can't really be fixed by small local changes, as overall, the algorithm is O(n^2) independently of how you implement it. Hence a complete rewrite was needed. As discussed in this technical report I made, there are several efficient algorithms to solve the problem (computing the absolute errors for all the possible splits along one feature).

The one I chose initially was an intuitive adaptation of the well-known two-heap solution of the "find median from a data stream" problem. But even if it had a O(n log n) expected complexity, it can be O(n^2 log n) in some pathological cases. So after some discussions, it was chosen to implement an other solution: the "Fenwick tree option". This solution is based on a Fenwick tree, a data-structure specialized in efficient prefix sums computations and updates.

See the technical report for detailed explanation of the algorithm, but in short, the main steps are:

insert a new element (y, w) in the tree, and search by prefix sum to find the weighted median: O(log n)
rewrite the AE computation by taking advantage of the following calculations:
$\sum_i w_i | y_i - m | = \sum_{y_i >= m} w_i(y_i - m) + \sum_{y_i < m} w_i(m - y_i) $
$= \sum_{y_i >= m} w_i y_i - m \sum_{y_i >= m} w_i + m \sum_{y_i < m} w_i - \sum_{y_i < m} w_i y_i $
the value of those 4 prefix/suffix-sums can be found while searching for the median in the tree, and once you have those, the computation becomes O(1).

Iterate on the data from left to right to compute the AE for every possible left child. And iterate from right to left to compute the AE for every possible right child.

This logic is implemented in tree/_criterion.pyx::precompute_absolute_errors as I wanted to be able to unit test it.

After some research I found a paper about the same problem. Their approach uses the two heaps idea and generalizes to arbitrary quantiles (as done in my follow-up PR), but it does not handle weighted samples. Also, the paper uses a more elaborate formula for the absolute error/loss computation than mine, TBH it looks unnecessarily complex.

…dded print everywhere to debug; fixed some bugs

…al PR but not all

github-actions · 2025-09-03T20:19:31Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: d12585d. Link to the linter CI: here}

cakedev0 · 2025-09-04T16:49:10Z

sklearn/tree/_utils.pyx

+# MAE split precomputations algorithm
+# =============================================================================

-def _any_isnan_axis0(const float32_t[:, :] X):


I moved this one up, in the helpers section.

sklearn/tree/tests/test_tree.py

adrinjalali · 2025-09-08T07:43:23Z

@adam2392 could you please have a look here?

adam2392

First of all. Thanks @cakedev0 for taking a look at this challenging, but impactful issue, and proposing a fix.

I took an initial glance. This overall looks like the right direction to me, so I want to make sure others take a look before we dive into the nitty stuff of making the PR mergable, and maintainable.

I have an open q: For decision trees, we can imagine imposing a quantile-criterion split (e.g. the pinball loss). Naively, I think we can make the WeightedHeaps work to maintain any sort of quantile right?

Perhaps @thomasjpfan wants to take a look as well before we dive deeper into the code.

sklearn/tree/_criterion.pxd

sklearn/tree/_criterion.pyx

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

cakedev0 · 2025-10-20T18:59:42Z

It might be interesting to double-check that everything behaves as expected using a memory profiler such as scalene or memray but from afar, it looks ok.

I never used this kind of tools. I might give it a try if I'm in the mood of learning new things ^^

sklearn/tree/_criterion.pyx

sklearn/tree/_utils.pyx

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel

I started to review the code but won't have time to complete my review today, so here are a first few comments.

Something that seems to be missing from the tests are checks that the code works as expected for multi-output y for non-default criterions.

sklearn/tree/_criterion.pyx

sklearn/tree/tests/test_tree.py

sklearn/tree/_criterion.pyx

cakedev0 · 2025-10-21T14:23:42Z

Something that seems to be missing from the tests are checks that the code works as expected for multi-output y for non-default criterions.

True!

It should be easy enough to add such tests with some @pytest.mark.parametrize. I can open another PR to do that, as I believe it's out-of-scope for this PR (though it would increase confidence in this PR to have such tests).

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adam2392 · 2025-10-22T02:18:07Z

sklearn/tree/_utils.pyx

+# WeightedFenwickTree data structure
+# =============================================================================
+
+cdef class WeightedFenwickTree:


Catching up through comments.

Am I understanding that this class's function is slower when using np.empty + memoryviews? Is it the class itself, or something to do with if we try to train a lot of trees?

https://github.com/scikit-learn/scikit-learn/pull/32100/files/7523930d97e42306bda44682a7a3efb8c91d71c6#r2445862261

What I measured to be slower was just one long run _py_precompute_absolute_errors, so not related to training a lot of trees. It was ~20% slower.

I think it's because there are structs in C behind Cython memory-views. And when doing mem_v[i] in Cython, it does mem_v->data[i] in C, or something like that.

But I think this was known be people who wrote sklearn/tree initially. Typically, in sort, which is what dominates the execution time usually, pointers are used and not memory views.

ogrisel · 2025-10-23T16:25:04Z

It should be easy enough to add such tests with some @pytest.mark.parametrize. I can open another PR to do that, as I believe it's out-of-scope for this PR (though it would increase confidence in this PR to have such tests).

+1 for a concurrent PR then.

ogrisel

I am slowly wrapping my head around the code of this PR, but unfortunately won't have time to finalize my review today. Still here is my current feedback inline below.

Also note for later: instead of computing the loss for a median or a specific quantile q, we could leverage this code to efficiently compute the aggregate loss for a uniform grid of quantiles values: a lot of the computation (e.g. compute_ranks, progressively adding points to the Fenwick tree) would be shared.

Integrating the pinball loss over q in [0, 1] is a way to estimate the CRPS, which is a strictly proper scoring rule for a probabilistic estimate, so in effect we would get a distributional tree estimator for quite cheap.

sklearn/tree/tests/test_fenwick.py

sklearn/tree/_criterion.pyx

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…to mae-split-optim

ogrisel

LGTM (besides nitpicks below)! Thanks for the great PR.

I let @adam2392 do the merge if he is still +1 for merge after the latest changes.

Possible follow-ups:

generalize to regression for an arbitrary quantile;
add support for missing values (if not overly complex).

sklearn/tree/_utils.pyx

sklearn/tree/_criterion.pyx

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

cakedev0 · 2025-10-28T10:11:50Z

add support for missing values (if not overly complex).

Actually, this is very simple (even simplifies the current code base), and has nothing to do with criteria. Criteria don't interact with feature values, just with the target values and their ordering via sample_indices, so they shouldn't have anything to do with missing values in features. See my PR: #32119

ogrisel · 2025-10-28T10:50:03Z

I missed that PR despite the notification...

…to mae-split-optim

cakedev0 added 10 commits September 2, 2025 22:40

First draft, needs tests & fixes

f7cf7d6

Merge remote-tracking branch 'upstream/main' into mae-split-optim

7061ff6

fixed compilation errors

f4edaa2

fixed compilation errors

01fd9b2

Moved AE computation in external helper to be able to unit-test it; a…

3f87b99

…dded print everywhere to debug; fixed some bugs

WIP some additional tests that helped me, some will be kept in my fin…

e8adf96

…al PR but not all

tests cleanup

4ed868e

cleanup

83d89a4

cleanup

1ca34bf

Merge remote-tracking branch 'upstream/main' into mae-split-optim

43692f7

github-actions bot added cython module:tree labels Sep 3, 2025

cakedev0 added 9 commits September 3, 2025 22:30

WIP fixing linting issues

d463558

fixed linting

fa993d4

fix spelling

cbf5405

Added test that would fail before this PR

a4bd310

added changed logs

f4a0e07

cleanup

a86a190

comments & cleanups

092af65

slight refactor of class inheritance

4a12dea

Merge remote-tracking branch 'upstream/main' into mae-split-optim

b44fb2b

cakedev0 commented Sep 4, 2025

View reviewed changes

cakedev0 marked this pull request as ready for review September 4, 2025 16:52

cakedev0 commented Sep 4, 2025

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved

cakedev0 mentioned this pull request Sep 4, 2025

DecisionTreeRegressor with absolute error criterion: non-optimal split #32099

Open

cakedev0 commented Sep 4, 2025

View reviewed changes

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

adam2392 self-requested a review September 9, 2025 00:02

adam2392 reviewed Sep 9, 2025

View reviewed changes

cakedev0 and others added 5 commits October 20, 2025 19:08

Merge remote-tracking branch 'upstream/main' into mae-split-optim

f996c51

moved back sort function into partitioner

75c27e1

Apply suggestion from @ogrisel

2c97399

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Update sklearn/tree/_criterion.pyx

0285c97

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

add link to report

7523930

cakedev0 commented Oct 20, 2025

View reviewed changes

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

ogrisel reviewed Oct 21, 2025

View reviewed changes

sklearn/tree/_utils.pyx Outdated Show resolved Hide resolved

Update sklearn/tree/_utils.pyx

78d6cfc

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel reviewed Oct 21, 2025

View reviewed changes

sklearn/tree/_criterion.pyx Show resolved Hide resolved

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

cakedev0 and others added 2 commits October 21, 2025 16:27

Apply suggestions from code review

0fdab95

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

rename y -> sorted_y in compute_ranks

f2219f8

adam2392 reviewed Oct 22, 2025

View reviewed changes

cakedev0 mentioned this pull request Oct 26, 2025

TST: Trees: test multi-output with every criterion on toy example #32575

Merged

ogrisel reviewed Oct 27, 2025

View reviewed changes

ogrisel mentioned this pull request Oct 27, 2025

API to predict multiple quantiles at once #23334

Open

cakedev0 and others added 5 commits October 27, 2025 21:45

Merge remote-tracking branch 'upstream/main' into mae-split-optim

46251fa

Update sklearn/tree/tests/test_fenwick.py

64a3516

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adressed comments on test fenwick

70277d3

Merge branch 'mae-split-optim' of github.com:cakedev0/scikit-learn in…

2942c14

…to mae-split-optim

a lot of doc/comments

3e60afd

ogrisel approved these changes Oct 28, 2025

View reviewed changes

sklearn/tree/_utils.pyx Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

Apply suggestions from code review

10c7dde

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

cakedev0 added 2 commits October 29, 2025 10:12

Merge remote-tracking branch 'upstream/main' into mae-split-optim

fce6df9

Merge branch 'mae-split-optim' of github.com:cakedev0/scikit-learn in…

d12585d

…to mae-split-optim

adam2392 self-requested a review October 29, 2025 18:10

Uh oh!

Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

Are you sure you want to change the base?

Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

Conversation

cakedev0 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Referenced Issues

Explanation of my changes

Uh oh!

github-actions bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

cakedev0 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Sep 8, 2025

Uh oh!

adam2392 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cakedev0 commented Oct 20, 2025

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cakedev0 commented Oct 21, 2025

Uh oh!

adam2392 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

cakedev0 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 23, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cakedev0 commented Oct 28, 2025

Uh oh!

ogrisel commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cakedev0 commented Sep 3, 2025 •

edited

Loading

github-actions bot commented Sep 3, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading