Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: FLAML catboost metrics arent reproducible #1364

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dannycg1996
Copy link
Collaborator

@dannycg1996 dannycg1996 commented Oct 15, 2024

Note: Please don't merge without a discussion first

Why are these changes needed?

Currently it is impossible to reproduce the best_loss provided by FLAML, using the best model which FLAML provides - when that best model is a CatBoost model.

There are two important changes made in this PR:

  1. Removal of self.params[self.ITER_HP] = self._model.tree_count_ line on model.py
  2. Removal of this if statement from model.py: if current_time + self._time_per_iter > self.deadline: return False

The line highlighted in 1) is the cause of issue #1317. As I understand it, CatBoost estimators can be thought of as performing their own AutoML/optimisation process internally - they have early stopping, and when use_best_model = True they internally optimise the objective metric. The issue with the current code is that it overwrites the n_estimators value with the actual number of trees used, but doesn't switch off this AutoML functionality. Rather than providing the user with the correct model, we simply change the internal AutoML process. As such, we get different results when we retrain and test that model on the same folds.

  1. is a little trickier, and needs some discussion. Essentially the callbacks change the behaviour of the models, when the time_budget remaining is less than the time which the model will take to train. This means that when we take the model FLAML provides, and refit it on the same folds but with no time budget involved, we get different behaviour and different results. I imagine that this callback was likely implemented as a means of ensuring that FLAML respects the time budget, but I personally think that results which aren't reproducible are not particularly useful.

My main issue with 2) is that the XGBoostEstimator also utilises this same check_resource_limits(), so this change could have unintended consequences. On a similar note, I've realised that XGBoostEstimator and CatBoostEstimators call check_resource_limits differently i.e.:

class XGBoostResourceLimit(BaseResourceLimit, TrainingCallback):
    def after_iteration(self, model, epoch, evals_log) -> bool:
        now = time.time()
        return not self.check_resource_limits(now, epoch, "xgb")


class CatBoostResourceLimit(BaseResourceLimit):
    def after_iteration(self, info) -> bool:
        now = time.time()
        return self.check_resource_limits(now, info.iteration, "cat")

Could you please explain why XGBoostResourceLimit returns return not self.check_resource_limits(now, epoch, "xgb") whilst CatBoostResourceLimit returns self.check_resource_limits(now, info.iteration, "cat") please?

It's the not part which I'm interested in.

Also worth mentioning is that we still assign self._time_per_iter in check_resource_limits. Can this code also be deleted, or is it still of use somewhere?

I'm fairly confident now that CatBoost results are now reproducible, but please let me know if you have any issues with my code.

Related issue number

Closes #1317

Checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

n_estimators value on automl.model differs from value in logs (for CatBoost models)
2 participants