Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean codes for python-package; dump model to JSON #97

Merged
merged 2 commits into from
Dec 2, 2016
Merged

Clean codes for python-package; dump model to JSON #97

merged 2 commits into from
Dec 2, 2016

Conversation

wxchan
Copy link
Contributor

@wxchan wxchan commented Nov 25, 2016

Used for python or other package to make calculations on model.

I created another branch after Microsoft/python-package, which will be merged into this branch after python-package finished. So you can review this PR after that.

The tree is dumped as tree structure, for example:

{
    "tree_index": 0,
    "tree_structure": {
        "split_feature": 25,
        "right_child": {
            "leaf_parent": 0,
            "leaf_value": 0.0177513,
            "leaf_index": 1
        },
        "left_child": {
            "split_feature": 25,
            "right_child": {
                "split_feature": 26,
                "right_child": {
                    "leaf_parent": 2,
                    "leaf_value": 0.0350313,
                    "leaf_index": 3
                },
                "left_child": {
                    "leaf_parent": 2,
                    "leaf_value": 0.0213992,
                    "leaf_index": 2
                },
                "split_index": 2,
                "internal_value": 0.657011,
                "threshold": 0.7745,
                "split_gain": 30.3472
            },
            "left_child": {
                "leaf_parent": 1,
                "leaf_value": 0.0206436,
                "leaf_index": 0
            },
            "split_index": 1,
            "internal_value": 0.596097,
            "threshold": 0.5925,
            "split_gain": 45.18
        },
        "split_index": 0,
        "internal_value": 0,
        "threshold": 1.0895,
        "split_gain": 65.198
    },
    "num_leaves": 4
}

@msftclas
Copy link

Hi @wxchan, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution!
You've already signed the contribution license agreement. Thanks!

The agreement was validated by Microsoft and real humans are currently evaluating your PR.

TTYL, MSBOT;

@wxchan
Copy link
Contributor Author

wxchan commented Nov 25, 2016

I came across a problem.

I add functions as follow:

src/c_api.cpp:
LGBM_BoosterDumpModel:

DllExport int LGBM_BoosterDumpModel(BoosterHandle handle,
  const char** out_str) {
  API_BEGIN();
  Booster* ref_booster = reinterpret_cast<Booster*>(handle);
  *out_str = ref_booster->DumpModel().c_str();
  API_END();
}

Booster::DumpModel:

std::string DumpModel() {
  return boosting_->DumpModel();
}

python-package/lightgbm/basic.py:

def dump_model(self):
  tmp_model = ctypes.c_char_p()
  _safe_call(_LIB.LGBM_BoosterDumpModel(self.handle, ctypes.byref(tmp_model)))
  print(tmp_model.value[:100]) # check the output

And the output in last line is:

b'\x01\x00\x00\x00\x00\x00\x00\x00\xd0\xa5\x1b\x00\x01\x00\x00\x00\x95s\x00
\x00\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xfflabel_index":0,\n"max_f
eature_idx":99,\n"objective":"binary",\n"sigmoid'

Why this happens?

@guolinke
Copy link
Collaborator

@wxchan , the reason is you returning pointer of temporary memory(ref_booster->DumpModel().c_str()).
The solution:

  1. allocate a str_buf in python, and copy from c_str. https://github.com/Microsoft/LightGBM/blob/python-package/python-package/lightgbm/basic.py#L1070-L1075
  2. use a template class to manage these temporary resource, like xgboost do:
    https://github.com/dmlc/xgboost/blob/master/src/common/thread_local.h
    https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L186-L200
    https://github.com/dmlc/xgboost/blob/master/src/c_api/c_api.cc#L668-L678
    However, these resources only will be freed when application quit.

I think pre-allocate the buffer in the python is a better solution. But it is hard to decide the buffer len. May be you can pass the buffer len of python as well, and return actual len.
If buffer len is smaller than actual len, you do nothing in c_api and re-allocate the buffer in python then call again.

@wxchan
Copy link
Contributor Author

wxchan commented Nov 26, 2016

Fixed with 1st solution. Buffer len is set to 2^20 now, for reference, with num_leaf=15 and num_tree=100, the actual len is around 2^18.

@wxchan
Copy link
Contributor Author

wxchan commented Dec 2, 2016

@guolinke model loaded from file cannot dump to json now(cause segmentation fault), any clue why?

@guolinke
Copy link
Collaborator

guolinke commented Dec 2, 2016

i think some fields are not loaded from file. e.g. threshold_in_bin_

1. merge python-package
2. add dump model to json
3. fix bugs
4. clean code with pylint
5. update python examples
@wxchan
Copy link
Contributor Author

wxchan commented Dec 2, 2016

@guolinke I find the reason, to calculate feature importance needs train_data_->feature_names(). I am going to remove feature importance in dumped string for now.

@wxchan wxchan changed the title Dump model to JSON Clean codes for python-package; dump model to JSON Dec 2, 2016
@guolinke
Copy link
Collaborator

guolinke commented Dec 2, 2016

@wxchan I see. It seems we should load feature_names_ as well when load model from file.
(So we should save feature_names_ to file as well)
I will add this into categorical-feature-support branch ,due to i need to change model to support the prediction of categorical features.
And we can continue to provide interface of set feature name and get feature importance.

@guolinke
Copy link
Collaborator

guolinke commented Dec 2, 2016

it seems travis is down... I will merge this after some reviews.

@wxchan
Copy link
Contributor Author

wxchan commented Dec 2, 2016

@guolinke Feature importance is not needed in dumped string. To use another interface will be fine.

I was doing some small updates with commit --amend, but I finished now. It should also solves #102 .

You may add python installation guide into wiki.

@guolinke guolinke merged commit 9f4849b into microsoft:master Dec 2, 2016
@wxchan wxchan deleted the dev branch December 7, 2016 07:39
@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants