Training continuation with multiple DMatrix objects #6148

ldesreumaux · 2020-09-22T15:24:50Z

Issue

If a training is started with a DMatrix object and continued with another DMatrix object that is identical (same dataset...), I should get the same model as if the training had been done with a single DMatrix object.

In the following C++ code, two models are trained. In the first training, UpdateOneIter is called two times with the same DMatrix. In the second training, the two calls to UpdateOneIter are made with two different DMatrix objects that contain the same dataset. I expect the two model dumps (out_models1 and out_models2) to be the same, but they are different.

#include <xgboost/learner.h>

using namespace xgboost;

void ConfigureBooster(std::shared_ptr<Learner>& booster) {
    std::vector<std::pair<std::string, std::string>> cfg;

    cfg.emplace_back(std::make_pair("tree_method", "hist"));
    cfg.emplace_back(std::make_pair("objective", "binary:logistic"));
    cfg.emplace_back(std::make_pair("subsample", "0.9"));
    cfg.emplace_back(std::make_pair("seed", "42"));
    cfg.emplace_back(std::make_pair("seed_per_iteration", "1"));

    booster->SetParams(cfg);
    booster->Configure();
}

int main(int argc, char** argv) {
    const std::string dataset_path = "../data/census.bin";
    FeatureMap fmap;

    /* Training with 1 DMatrix object */

    std::shared_ptr<DMatrix> dtrain1(DMatrix::Load(dataset_path, true, false));
    std::shared_ptr<Learner> booster1(Learner::Create({dtrain1}));
    ConfigureBooster(booster1);
    booster1->UpdateOneIter(0, dtrain1);
    booster1->UpdateOneIter(1, dtrain1);
    std::vector<std::string> out_models1 = booster1->DumpModel(fmap, true, "text");
    for (std::string out_model : out_models1)
        std::cout << out_model << std::endl;

    /* Training with 2 DMatrix objects (but same dataset!) */

    std::shared_ptr<DMatrix> dtrain2(DMatrix::Load(dataset_path, true, false));
    std::shared_ptr<Learner> booster2(Learner::Create({dtrain2}));
    ConfigureBooster(booster2);
    booster2->UpdateOneIter(0, dtrain2);
    dtrain2.reset(DMatrix::Load(dataset_path, true, false));
    booster2->UpdateOneIter(1, dtrain2);
    std::vector<std::string> out_models2 = booster2->DumpModel(fmap, true, "text");
    for (std::string out_model : out_models2)
        std::cout << out_model << std::endl;

    return 0;
}

XGBoost version: 1.2.0

Fix

I investigated the issue and found that adding the following lines to GHistIndexMatrix::Init solves the issue:

void GHistIndexMatrix::Init(DMatrix* p_fmat, int max_bins) {
  cut.cut_ptrs_.HostVector().clear();
  cut.cut_values_.HostVector().clear();
  cut.min_vals_.HostVector().clear();
  cut.cut_ptrs_.HostVector().emplace_back(0);

  hit_count.clear();
  hit_count_tloc_.clear();

  ...

The text was updated successfully, but these errors were encountered:

hcho3 · 2020-09-24T05:14:29Z

@ldesreumaux Now that you have a fix, would you like to submit a pull request?

trivialfis · 2020-09-24T05:25:50Z

I prefer moving it into DMatrix. Let's hold on this a little bit.

trivialfis · 2021-07-29T17:21:37Z

Actually, this has already been fixed in #7064 .

trivialfis added the type: bug label Sep 22, 2020

This was referenced Jun 25, 2021

Move GHistIndex into DMatrix. #7064

Merged

[WIP] Get column matrix from GHistIndex. #7072

Closed

trivialfis closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training continuation with multiple DMatrix objects #6148

Training continuation with multiple DMatrix objects #6148

ldesreumaux commented Sep 22, 2020 •

edited

Loading

hcho3 commented Sep 24, 2020

trivialfis commented Sep 24, 2020

trivialfis commented Jul 29, 2021

Training continuation with multiple DMatrix objects #6148

Training continuation with multiple DMatrix objects #6148

Comments

ldesreumaux commented Sep 22, 2020 • edited Loading

Issue

Fix

hcho3 commented Sep 24, 2020

trivialfis commented Sep 24, 2020

trivialfis commented Jul 29, 2021

ldesreumaux commented Sep 22, 2020 •

edited

Loading