-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update scikit-learn to 1.4 #5851
Update scikit-learn to 1.4 #5851
Conversation
cuML closely mirrors scikit-learn, just as cuDF closely mirrors pandas. This alignment led us to a one-release deprecation policy in cuDF. I think it is totally fine to favor alignment with the target library over the stability of slower deprecation/removal. I would vote to deprecate anything you can in 24.06 with removal in 24.08. For anything you can’t safely deprecate, I would consider just making breaking changes to align with the latest scikit-learn. |
If there is prior art within RAPIDS then it makes some amount of sense to follow that (aka copy cudf). For me the important factor in deciding is how much time (in months) do we need to/want to give users to deal with deprecations. scikit-learn is quite conservative and gives you ~1 year to deal with things (1.2 deprecates something with a warning, six months later 1.3 continues to warn, another six months later 1.4 removes the warning and old behaviour). This means you have to deal with deprecations for quite a while (as a scikit-learn developer). How did cudf decide on the mark as deprecated and warn in 22.08, remove in 22.10 cycle? That seems much quicker than pandas. From the doc you linked it sounds like they deprecate something with a warning in v1.x and will continue to warn until they release 2.0. My assumption is that a lot more than a few months pass between v1.x and v2.0. Potentially even longer than scikit-learn's ~1 year policy. Overall I'd vote for using the same deprecation cycle and policy for cudf and cuml (and other RAPIDS packages). What ever the pros and cons of each policy might be, they are all outweighed by having something consistent across RAPIDS IMHO. |
tl;dr I think a 1-release deprecation cycle for cuml-specific behavior that is not matched by scikit-learn is fine, but if you have a deprecated code path that provides support for behavior that is deprecated in scikit-learn too, it's probably worth the effort of maintaining it as long as they do unless it starts posing a heavy undue burden. We should differentiate between cudf's own deprecation policy and the policy cudf follows for matching pandas deprecations. If pandas deprecates something, we deprecate it as well in cudf, but we won't remove it until pandas does. That can be a very long time, e.g. While I do think we should try and align cudf and cuml (and other RAPIDS libraries) as much as possible in this regard, it is worth keeping in mind that the situations are a bit different. In general the scikit-learn API is far less expansive and has fewer sharp edges than the pandas API. Many pandas functions have very large API surfaces, both in terms of the number of parameters and the degree of flexibility in the ways that the corresponding arguments may be specified. Furthermore, there are a lot more behavioral differences that often cause headaches, usually around cases like null handling, nulls vs nans, empty columns, all-null columns, etc. We have historically had many cases where either specific combinations of parameters were intentionally doing something that we considered more sensible than the default pandas choice (possibly because we supported truly nullable data while pandas historically conflated nulls and nans), or some combination of arguments hits a data-specific edge case that we simply hadn't considered before. Especially now with |
Thinking out loud: In This means work on cuml now, should attempt to match the deprecations of scikit-learn v1.5 which will probably be released in May, before cuml 24.06 comes out. During the transition phase from the current state of cuml to this "in sync" state there will be a few breaking changes. This is because cuml currently depends on scikit-learn 1.2, so in order to switch to v1.5 we will have to make breaking changes. I think even changing to v1.4 requires a breaking change, but it is in I think matching the deprecations makes sense if you are in sync with scikit-learn. What do people think of this? I think it is probably desirable to be in sync. A practical snag in the above, but solvable somehow, is that it is unknowable when a version of scikit-learn will be released. The target is to release roughly every six months, but this isn't a firm commitment. For example if it looks like a release would come out just before the end of December, it is likely to be moved to January. This means when we write the deprecation warning in cuml, we need to make an educated guess as to the version of cuml in which the deprecation expires. |
I think that strategy makes sense. One good way that you can verify this is by running your tests with warnings as errors. cudf already does this. That forces us to catch warnings from APIs that we expect to throw them, and when we write such tests we also catch the warnings from the corresponding cudf API. That allows us to verify that we're throwing the same warning as pandas. cuml might need to do more work to get its tests running without warnings before that is viable, though. You're right that not knowing the exact schedule is a bit tricky. At that point I think it's mostly a social question of what you want to promise. You could avoid putting a specific removal timeline in the deprecation message, you could include a best-effort estimate, or you could say something like "will be removed by X if not sooner" if you can confidently upper bound. I think that's largely up to you. |
I've updated the "will be removed by" version to 25.08. My best guess for when scikit-learn 1.7 will be released. This PR updates cuml to scikit-learn 1.4, but 1.5 will probably be out in a week or two. Likely before 24.06 will be released, so maybe we merge this PR now and update cuml to scikit-learn 1.5 if it has been released in time. |
@betatim merged branch-24.06 into the PR to trigger CI |
/ok to test |
Mostly deals with parameters that were deprecated in scikit-learn 1.2 and are now no longer available. Update end version of deprecation Update version in which deprecation expires Change argument name to deal with deprecation
6db6835
to
5deed40
Compare
I started working on updating to 1.5 which was released yesterday |
I'll take a look at the failing metrics comparison tests on Monday if no one else has an idea. A little unclear why suddenly the metrics should be different/need less precision to get the comparison to be equal. |
The distance metrics seem to only differ very slightly on ARM, I wonder if it's a change in sklearn or a dependency of it, but from a first looks seems too small to warrant worries |
The arguments to r2_score were the wrong way around, this might explain the sudden decrease in the score which leads to failing tests.
I reduced the precision for the metrics tests. Looks like they pass now, except for the two ARM jobs that failed to start (is there a way to restart/re-run them?). Also fixed a few uses of deprecated constructor arguments and the usage of |
/merge |
@betatim You may want to update the title of this PR to say 1.5. I didn't realize you went to the latest release until reading through the discussion. |
This is an attempt to update the scikit-learn dependency from 1.2 to 1.4. Most changes are related to constructor arguments that were deprecated in 1.2 and in 1.4 have changed/been removed.
A question I have is what cuml's deprecation policy is? I've gone with "two releases" for parameters where we can easily do so (deprecated in 24.06 and then remove them in 24.10). However that is only about 4 months of deprecation which could be a bit short.
Some of the changes would be hard to do as a deprecation (with 1.4 there is no way to provide the "old way"), we'd have to stick with 1.3 for now. I think this is a bit of a bummer but maybe the price to pay for not keeping on top of deprecations. And it seems like there is no deprecation policy in the docs/towards users? So maybe we can play this card once now, to catch up and at the same time introduce a deprecation policy.
The SHAP test needed its reference updating. I am not sure why, at least I couldn't quickly find a reason for why you'd have to do this.
I am not sure how possible it would be to support a range of scikit-learn versions (say 1.2 - 1.4). Would be cool but maybe not worth the added complexity?
Todo:
LARS- LARS is experimental, so no need for deprecationxref #5799