Eliminate text parsing from feature importances and evaluation metrics #6091

hcho3 · 2020-09-07T06:01:14Z

Currently, important functions such as feature importances and evaluation metrics rely on parsing of text strings, specifically the text output from the model dump function. For example:

xgboost/python-package/xgboost/core.py

Lines 1797 to 1832 in 68c55a3

    
           trees = self.get_dump(fmap, with_stats=True) 
        
           importance_type += '=' 
        
           fmap = {} 
        
           gmap = {} 
        
           for tree in trees: 
        
               for line in tree.split('\n'): 
        
                   # look for the opening square bracket 
        
                   arr = line.split('[') 
        
                   # if no opening bracket (leaf node), ignore this line 
        
                   if len(arr) == 1: 
        
                       continue 
        
                   # look for the closing bracket, extract only info within that bracket 
        
                   fid = arr[1].split(']') 
        
                   # extract gain or cover from string after closing bracket 
        
                   g = float(fid[1].split(importance_type)[1].split(',')[0]) 
        
                   # extract feature name from string before closing bracket 
        
                   fid = fid[0].split('<')[0] 
        
                   if fid not in fmap: 
        
                       # if the feature hasn't been seen yet 
        
                       fmap[fid] = 1 
        
                       gmap[fid] = g 
        
                   else: 
        
                       fmap[fid] += 1 
        
                       gmap[fid] += g 
        
           # calculate average value (gain/cover) for each feature 
        
           if average_over_splits: 
        
               for fid in gmap: 
        
                   gmap[fid] = gmap[fid] / fmap[fid] 
        
           return gmap

xgboost/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java

Lines 509 to 540 in 68c55a3

    
             private Map<String, Integer> getFeatureWeightsFromModel(String[] modelInfos) throws XGBoostError { 
        
               Map<String, Integer> featureScore = new HashMap<>(); 
        
               for (String tree : modelInfos) { 
        
                 for (String node : tree.split("\n")) { 
        
                   String[] array = node.split("\\["); 
        
                   if (array.length == 1) { 
        
                     continue; 
        
                   } 
        
                   String fid = array[1].split("\\]")[0]; 
        
                   fid = fid.split("<")[0]; 
        
                   if (featureScore.containsKey(fid)) { 
        
                     featureScore.put(fid, 1 + featureScore.get(fid)); 
        
                   } else { 
        
                     featureScore.put(fid, 1); 
        
                   } 
        
                 } 
        
               } 
        
               return featureScore; 
        
             } 
        
             /** 
        
              * Get the feature importances for gain or cover (average or total) 
        
              * 
        
              * @return featureImportanceMap key: feature index, 
        
              * values: feature importance score based on gain or cover 
        
              * @throws XGBoostError native error 
        
              */ 
        
             public Map<String, Double> getScore( 
        
                     String[] featureNames, String importanceType) throws XGBoostError { 
        
               String[] modelInfos = getModelDump(featureNames, true); 
        
               return getFeatureImportanceFromModel(modelInfos, importanceType); 
        
             }

xgboost/python-package/xgboost/training.py

Lines 85 to 91 in 68c55a3

    
           bst_eval_set = bst.eval_set(evals, i, feval) 
        
           if isinstance(bst_eval_set, STRING_TYPES): 
        
               msg = bst_eval_set 
        
           else: 
        
               msg = bst_eval_set.decode() 
        
           res = [x.split(':') for x in msg.split()] 
        
           evaluation_result_list = [(k, float(v)) for k, v in res[1:]]

xgboost/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java

Lines 240 to 255 in 68c55a3

    
           public String evalSet(DMatrix[] evalMatrixs, String[] evalNames, int iter, float[] metricsOut) 
        
                   throws XGBoostError { 
        
             String stringFormat = evalSet(evalMatrixs, evalNames, iter); 
        
             String[] metricPairs = stringFormat.split("\t"); 
        
             for (int i = 1; i < metricPairs.length; i++) { 
        
               String value = metricPairs[i].split(":")[1]; 
        
               if (value.equalsIgnoreCase("nan")) { 
        
                 metricsOut[i - 1] = Float.NaN; 
        
               } else if (value.equalsIgnoreCase("-nan")) { 
        
                 metricsOut[i - 1] = -Float.NaN; 
        
               } else { 
        
                 metricsOut[i - 1] = Float.valueOf(value); 
        
               } 
        
             } 
        
             return stringFormat; 
        
           }

Also see #4665 (comment) #4665 (comment)

We should aim to eliminate all such uses of text parsing, since a slight change in the text dump will cause all these functions to break.

Proposed replacement:

Feature importances: Implement a new C++ function that returns a JSON string representing features and their importances.
Evaluation metrics: Implement a new C++ function that returns a JSON string representing eval set names and their eval metrics.

Now that we have a functioning JSON library as well as numeric printing function (charconv) in XGBoost, it should be doable.

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-01-25T11:24:34Z

Looked into this a little bit. The implementation isn't difficult, but depends on #6605 due to the use of feature names/types. I will try to figure out a better way to store those information.

This was referenced Sep 7, 2020

Feature importance using C api #6079

Closed

trees_to_dataframe causing index out of bounds exception #5409

Closed

trivialfis added the feature-request label Oct 28, 2020

hcho3 mentioned this issue Jan 23, 2021

[RFC] Unifying prediction API. #6632

Open

trivialfis mentioned this issue Jun 15, 2021

Implement feature score in GBTree. #7041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate text parsing from feature importances and evaluation metrics #6091

Eliminate text parsing from feature importances and evaluation metrics #6091

hcho3 commented Sep 7, 2020 •

edited

Loading

trivialfis commented Jan 25, 2021

Eliminate text parsing from feature importances and evaluation metrics #6091

Eliminate text parsing from feature importances and evaluation metrics #6091

Comments

hcho3 commented Sep 7, 2020 • edited Loading

trivialfis commented Jan 25, 2021

hcho3 commented Sep 7, 2020 •

edited

Loading