Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIS: Cython 3 perf regressions #27086

Closed
lithomas1 opened this issue Aug 17, 2023 · 35 comments
Closed

DIS: Cython 3 perf regressions #27086

lithomas1 opened this issue Aug 17, 2023 · 35 comments
Labels
cython Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance

Comments

@lithomas1
Copy link
Contributor

It looks like scikit-learn hasn't updated to Cython 3 yet, but when you do, can you keep an eye out for performance regressions?

On pandas, we updated to Cython 3 but then reverted, since we had issues with some of our benchmarks regressing, and it would be nice to know if this is a bigger problem in Cython.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Aug 17, 2023
@jjerphan jjerphan added Performance Needs Benchmarks A tag for the issues and PRs which require some benchmarks cython and removed Needs Triage Issue requires triage labels Aug 18, 2023
@glemaitre
Copy link
Member

Thanks. We should be able to observe the regression in the continuous benchmark: https://scikit-learn.org/scikit-learn-benchmarks/

@glemaitre
Copy link
Member

Actually we don't have any results uploaded since July 12. Maybe @jeremiedbb can help :)

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2023

Also we don't use Cython 3 in the main branch yet as far as I know. We only test Cython 3 in the scipy-dev build which is not benchmarked.

@glemaitre
Copy link
Member

glemaitre commented Aug 22, 2023

We updated recently the lock file (during EuroSciPy sprint): https://github.com/scikit-learn/scikit-learn/pull/27024/files and thus we are now using Cython 3.0.

We could have a wild check to see if we took an overall slowdown on all test on the CI.

I checked and there is too much variation to resort to the test suite.

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2023

I missed that. Thanks for the heads up.

@jjerphan
Copy link
Member

jjerphan commented Aug 25, 2023

Actually since there's no upper version pin for Cython, we might have been using Cython 3 since its release.

@lithomas1
Copy link
Contributor Author

Hm,
I just looked at your asvs and you don't seem to distinguish between Cython 0.29 and Cython 3.0 in yours

See ours for reference https://asv-runner.github.io/asv-collection/pandas/#frame_ctor.FromDicts.time_nested_dict
(In the left sidebar, you can choose between Cython 0.29 and Cython 3.0 while your benchmarks only seem to filter by python and pandas version)

@jjerphan
Copy link
Member

I think we need to pin Cython>=0.29.33<3 except for nightly builds and regenerate locks like before we make sure there's no performance regression with Cython 3.

What do you think? I can't look into that now, but could someone else do? 🙏

@glemaitre
Copy link
Member

@jeremiedbb is actually looking at the ASV issue and will run before and after #27024 that changed the lock files introducing Cython 3.0

@jeremiedbb
Copy link
Member

Actually the use of cython 3 in the benchmarks is not linked to the update of the lock files. The lock files are only for the CI while the asv benchmarks run against the latest version available.
All runs before mid july when the bench stopped being uploaded were run with cython 0.29. The last one that I manually ran yesterday uses cython 3.

From a quick look I can see some severe regressions:

We need more run to check that it does not come from a one time issue with the machine running the benchmarks. Additionnal checks are also needed to make sure cython 3 is the reason of these regressions since the benchmarks were not uploaded for a long time.

@jeremiedbb
Copy link
Member

jeremiedbb commented Aug 30, 2023

I added additional runs on asv and the slowdown looks persistent https://scikit-learn.org/scikit-learn-benchmarks/#/.

I could reproduce for the RandomForest locally on the last commit of main, just changing the version of cython. Although the slowdown is lower on my laptop (still x2).
I couldn't reproduce for HGBT. It might be because my laptop has only 2 cores while the machine running the benchs has 8.

It makes me quite confident that cython 3 is the reason of these slowdowns.

I'm also convinced that we should pin the versions of the dependencies in the asv conf to be able to distinguish between regression coming from our code and regression coming from the dependencies. I'm going to open a PR for that.

@Micky774
Copy link
Contributor

I'll take a look at profiling a couple of the slowdowns. Hopefully I can get a sense of what actual changes may be causing this.

@Micky774
Copy link
Contributor

@da-woods have you seen other reports of these slowdowns? I'm wondering if there are any hints as to what may be happening

@da-woods
Copy link

General comments - haven't looked into your specific regressions:

The most common issue I've seen is due to the switch from function being implicitly noexcept to being able to propagate exceptions by default. Specifically for functions returning void this means that there will always be an exception check. In a nogil (often parallel) block this means that they always have to acquire the GIL.

This also applies to functions returning a fused type for slightly more complicated reasons (but the same basic idea).

I'm working on a) a fix for the fused return type and b) better diagnostics of this issue (i.e. to point out places where it might be an issue). Hopefully those will make it into the next release.

Hopefully that gives you a clue where to start looking. If you want I can try to find time to investigate myself (but you'll probably need to give me idiot-proof instructions on how to run the benchmark)

@da-woods
Copy link

I think I forgot to say how to fix it (if this is actually the issue). The choices are:

  • Declare the functions causing the issue as noexcept,
  • Use the legacy_implicit_noexcept compiler directive - I don't really recommend because it's a catch all, but I'll give you a quick diagnostic about if this is the issue,
  • Use a dummy int return type (and just return 0 on success), this lets Cython use -1 as an error return type, so you get the benefit of the improved exception propagation with much less performance cost.

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 1, 2023

I profiled the RandomForestClassifier with py-spy and it shows that with cython 3 there's a lot of overhead coming from pthread locks that were not there before.
Screenshot_cy3
Everything circled in blue is related to pthreads locks. I can't figure out what causes that from looking at the functions where it happens, e.g.

cdef int update(self, SIZE_t new_pos) except -1 nogil:
"""Updated statistics by moving sample_indices[pos:new_pos] to the left child.
Returns -1 in case of failure to allocate memory (and raise MemoryError)
or 0 otherwise.
Parameters
----------
new_pos : SIZE_t
The new ending position for which to move sample_indices from the right
child to the left child.
"""
cdef SIZE_t pos = self.pos
# The missing samples are assumed to be in
# self.sample_indices[-self.n_missing:] that is
# self.sample_indices[end_non_missing:self.end].
cdef SIZE_t end_non_missing = self.end - self.n_missing
cdef const SIZE_t[:] sample_indices = self.sample_indices
cdef const DOUBLE_t[:] sample_weight = self.sample_weight
cdef SIZE_t i
cdef SIZE_t p
cdef SIZE_t k
cdef SIZE_t c
cdef DOUBLE_t w = 1.0
# Update statistics up to new_pos
#
# Given that
# sum_left[x] + sum_right[x] = sum_total[x]
# and that sum_total is known, we are going to update
# sum_left from the direction that require the least amount
# of computations, i.e. from pos to new_pos or from end to new_po.
if (new_pos - pos) <= (end_non_missing - new_pos):
for p in range(pos, new_pos):
i = sample_indices[p]
if sample_weight is not None:
w = sample_weight[i]
for k in range(self.n_outputs):
self.sum_left[k, <SIZE_t> self.y[i, k]] += w
self.weighted_n_left += w
else:
self.reverse_reset()
for p in range(end_non_missing - 1, new_pos - 1, -1):
i = sample_indices[p]
if sample_weight is not None:
w = sample_weight[i]
for k in range(self.n_outputs):
self.sum_left[k, <SIZE_t> self.y[i, k]] -= w
self.weighted_n_left -= w
# Update right part statistics
self.weighted_n_right = self.weighted_n_node_samples - self.weighted_n_left
for k in range(self.n_outputs):
for c in range(self.n_classes[k]):
self.sum_right[k, c] = self.sum_total[k, c] - self.sum_left[k, c]
self.pos = new_pos
return 0

@da-woods
Copy link

da-woods commented Sep 1, 2023

From the graph you show it definitely looks like a "waiting on the GIL" issue. Although it isn't obvious quite what's causing it in that function. There's also an issue that's been reported about acquiring the GIL for memoryview reference counting at the end of the function, so it's possibly that. That issue would need a fix on our side. It might go away if you used sample_indices and sample_weights from the class rather than assigning them to local variables. But I imagine that's done for good reason.

@jjerphan
Copy link
Member

jjerphan commented Sep 2, 2023

#27266 has been opened to have scikit-learn depend on Cython < 3.

@da-woods
Copy link

da-woods commented Sep 2, 2023

Would it be possible to remeasure RandomForestClassifier with cython/cython#5678? My semi-informed suspicion is that this might well fix it. (Or to let me know how to run the benchmark and I can measure it)

@lesteve
Copy link
Member

lesteve commented Sep 5, 2023

FYI, it seems like CircleCI timeout issues are due (for the most part at least) to Cython 3. In particular examples/ensemble/plot_forest_hist_grad_boosting_comparison.py now takes ~8 minutes rather than 1 minute before we updated the lock files to Cython 3 and it uses a random forest.

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 6, 2023

@da-woods looks like cython/cython#5678 improved things a bit but not entirely. I ran the same profile with the master branch of cython:

Screenshot_cy_master

The gil issue at the end of _splitter_node_split_best has gone away (circled in blue on the right of previous profile) but it's still here for ClassificationCriterion_update (circled in blue on the left of previous profile).

It might go away if you used sample_indices and sample_weights from the class rather than assigning them to local variables.

I'm going to try that, thanks for looking into it.
EDIT: I tried and it didn't change anything :/

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 6, 2023

Here's the annotated version of this function if it can help (with cython master):
Screenshot_annotate

@da-woods
Copy link

da-woods commented Sep 6, 2023

That does look helpful thanks. There's two related problems:

  1. Refnanny only applies to Cython's internal testing, so getting the GIL to set it up should be conditionally compiled.
  2. I don't think we need refnanny at all here.

I'll try to fix it in the next couple of days

da-woods added a commit to da-woods/cython that referenced this issue Sep 6, 2023
See scikit-learn/scikit-learn#27086

Essentially:
* I'd made functions raising an exception require refnanny even
  since prange/parallel block exception handling required it
* I actually don't think this is sufficient and that there's other
  ways of raising an exception within a parallel block.
* The upshot is that *any* function with a call to a function with
  a checked exception now requires refnanny.
* For functions with object arguments (i.e. any cdef class method)
  that leads to the GIL always being acquired around the refnanny
  setup (even when refnanny is disabled).

I've fixed it by avoiding using refnanny in parallel blocks
unless we know it's available. I think it's too hard to tell
reliably ahead of time (at least for me).

I think the `have_object_args and has_with_gil_block` criteria
for acquiring the GIL around function definitions might be
excessive, and possibly unnecessary some of the time. It's possibly
mainly reassigned object args. This might be worth refactoring
but I don't think it's necessary to fix this particular
performance regression.
@jeremiedbb jeremiedbb reopened this Sep 7, 2023
scoder pushed a commit to cython/cython that referenced this issue Sep 29, 2023
Originally reported in scikit-learn/scikit-learn#27086

Essentially:
* I'd made functions raising an exception require refnanny even
  since prange/parallel block exception handling required it
* I actually don't think this is sufficient and that there's other
  ways of raising an exception within a parallel block.
* The upshot is that *any* function with a call to a function with
  a checked exception now requires refnanny.
* For functions with object arguments (i.e. any cdef class method)
  that leads to the GIL always being acquired around the refnanny
  setup (even when refnanny is disabled).

I've fixed it by avoiding using refnanny in parallel blocks unless we
know it's available and has actually been used in the generated code.
I think it's too hard to tell reliably ahead of time.
@scoder
Copy link

scoder commented Sep 29, 2023

Fix committed, will be resolved in Cython 3.0.3.

@jeremiedbb
Copy link
Member

Good news, I ran some benchmarks regarding the estimators for which we identified slowdowns coming from cython 3 (RandomForest*, (Hist)GradientBoosting*, KNeighbors*) against the latest cython master and they're all as fast as they were with cython 0.29.36 (sometimes even a little bit faster) !

It should be confirmed in the asv benchmarks when we update them to run against cython 3.0.3 (when it's released)

@glemaitre
Copy link
Member

Thanks a bunch @scoder @da-woods for taking care about this issue.

@lorentzenchr
Copy link
Member

Can we close?
Has someone the link to our benchmark results?

@glemaitre
Copy link
Member

@jeremiedbb merged added Cython 3.0.3 in the ASV benchmark. We should be able to check that we don't have remaining regressions.

@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

I had a look at https://scikit-learn.org/scikit-learn-benchmarks/#regressions?sort=3&dir=desc and the last commits are still from early September, so the upload of the results does not seem to work and does not allow to conclude if the Cython regressions have been fixed in 3.0.3.

@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

Actually the results are uploaded, see: https://github.com/scikit-learn/scikit-learn-benchmarks/commits/master . However the plots do not show them...

@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

Actually, they show up on when you navigate to the results from the homepage grid, for instance to:

https://scikit-learn.org/scikit-learn-benchmarks/#ensemble.HistGradientBoostingClassifierBenchmark.time_fit

which does indeed show that the regression was fixed for that model. However when navigating from the regressions list, they do not show up because the environment has changed and we only see the points from the environment that experienced the regression but not the new points that where subsequently collected with the new environment.

This is quite unfortunate. It's a UX problem in the ASV report. I will manually check the other estimators that experienced a regression to see if they are all fast again.

@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

Actually, it's simpler: when you come from the regressions list view, you can click on Cython 3.0.3 in the left side panel to add the new point and see that the problem was fixed.

@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

Ok I checked and apparently all the big regressions that appeared in July and August (after 1.3.0) when switching to Cython 3.0.0 have been fixed after the update to Cython 3.0.3.

So let's close!

Note: there are still some smaller and much older regressions for benchmarks other benchmarks but they are unrelated (regressions that happened around 0.24.x).

@ogrisel ogrisel closed this as completed Oct 19, 2023
@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

I will open a PR to unpin cython in our builds.

@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2023

I will open a PR to unpin cython in our builds.

This is taking more time than expected because I faced a mamba bug with virtual packages and conda-lock that was quite counter intuitive to debug because of the hidden error in a nested subprocess that makes it challenging to launch a debugger. Anyways, here is the PR to fix in mamba:

I will proceed with the unpinning tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance
Projects
None yet
Development

No branches or pull requests

10 participants