Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate text parsing from feature importances and evaluation metrics #6091

Open
hcho3 opened this issue Sep 7, 2020 · 1 comment
Open

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Sep 7, 2020

Currently, important functions such as feature importances and evaluation metrics rely on parsing of text strings, specifically the text output from the model dump function. For example:

trees = self.get_dump(fmap, with_stats=True)
importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue
# look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']')
# extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0])
# extract feature name from string before closing bracket
fid = fid[0].split('<')[0]
if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g
# calculate average value (gain/cover) for each feature
if average_over_splits:
for fid in gmap:
gmap[fid] = gmap[fid] / fmap[fid]
return gmap

private Map<String, Integer> getFeatureWeightsFromModel(String[] modelInfos) throws XGBoostError {
Map<String, Integer> featureScore = new HashMap<>();
for (String tree : modelInfos) {
for (String node : tree.split("\n")) {
String[] array = node.split("\\[");
if (array.length == 1) {
continue;
}
String fid = array[1].split("\\]")[0];
fid = fid.split("<")[0];
if (featureScore.containsKey(fid)) {
featureScore.put(fid, 1 + featureScore.get(fid));
} else {
featureScore.put(fid, 1);
}
}
}
return featureScore;
}
/**
* Get the feature importances for gain or cover (average or total)
*
* @return featureImportanceMap key: feature index,
* values: feature importance score based on gain or cover
* @throws XGBoostError native error
*/
public Map<String, Double> getScore(
String[] featureNames, String importanceType) throws XGBoostError {
String[] modelInfos = getModelDump(featureNames, true);
return getFeatureImportanceFromModel(modelInfos, importanceType);
}

bst_eval_set = bst.eval_set(evals, i, feval)
if isinstance(bst_eval_set, STRING_TYPES):
msg = bst_eval_set
else:
msg = bst_eval_set.decode()
res = [x.split(':') for x in msg.split()]
evaluation_result_list = [(k, float(v)) for k, v in res[1:]]

public String evalSet(DMatrix[] evalMatrixs, String[] evalNames, int iter, float[] metricsOut)
throws XGBoostError {
String stringFormat = evalSet(evalMatrixs, evalNames, iter);
String[] metricPairs = stringFormat.split("\t");
for (int i = 1; i < metricPairs.length; i++) {
String value = metricPairs[i].split(":")[1];
if (value.equalsIgnoreCase("nan")) {
metricsOut[i - 1] = Float.NaN;
} else if (value.equalsIgnoreCase("-nan")) {
metricsOut[i - 1] = -Float.NaN;
} else {
metricsOut[i - 1] = Float.valueOf(value);
}
}
return stringFormat;
}

Also see #4665 (comment) #4665 (comment)

We should aim to eliminate all such uses of text parsing, since a slight change in the text dump will cause all these functions to break.

Proposed replacement:

  • Feature importances: Implement a new C++ function that returns a JSON string representing features and their importances.
  • Evaluation metrics: Implement a new C++ function that returns a JSON string representing eval set names and their eval metrics.

Now that we have a functioning JSON library as well as numeric printing function (charconv) in XGBoost, it should be doable.

@trivialfis
Copy link
Member

Looked into this a little bit. The implementation isn't difficult, but depends on #6605 due to the use of feature names/types. I will try to figure out a better way to store those information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants