Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Unifying prediction API. #6632

Open
trivialfis opened this issue Jan 22, 2021 · 10 comments
Open

[RFC] Unifying prediction API. #6632

trivialfis opened this issue Jan 22, 2021 · 10 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Jan 22, 2021

Background

XGBoost has a number of prediction functions exposed on C API and various language bindings. Including prediction on DMatrix and inplace prediction. Inside these prediction functions, we also have a number of prediction types, including value, margin, leaf, contribs and interaction. The outputs of them have different meanings and shapes. Right now language bindings are responsible for figuring that out, which has became a burden since we have introduced dask interface on top of Python (#6614). Also, the output shape is quite complicated, I have difficult time on figuring out how to slice up the output array from pred_leaf. Aside from these, there are also different prediction parameters, including ntree_limit, n_layers, is_training, also a never used parameter tree_begin. Lastly if the prediction is carried inplace, some more information like missing and base_margin needs to be carried into implementation.

Requirments

We unify the prediction functions of C API in a consistent manner. The new prediction functions should be able to figure out the output shape for language bindings, and should be extensible to future feature addition. At the same time, we need to look into what are the parameters that we don't want, like ntree_limit. Since this is designing at C API level, we should try to comply to some C programming practices on API design.

At the same time, we are not near next major release (2.0), so old API should be kept for compatibility.

Proposal

  1. Define a public prediction parameter struct:
enum PredictType {  // should not use enum, just for demo
  kValue,
  kMargin,
  kContribution,
  kInteraction,
  kLeaf
};

typedef struct _PredictParam {
  bool is_training;
  int32_t begin_iteration;
  int32_t end_iteration;
  PredictType type;

  // Unused if input is DMatrix.
  void* base_margin;
  int base_margin_shape;
  float missing;
} PredictParam;
  1. Define a set of public functions with consistent semantic:
int XGBoosterPredictFromDMatrix(BoosterHandle booster, DMatrixHandle dmatrix, PredictParam param, bst_ulong **out_shape, bst_float **out_result);

int XGBoosterPredictFromDense(BoosterHandle booster, void* data, DataType type, int* shape, PredictParam param, bst_ulong **out_shape, bst_float **out_result);

...

The functions should output correct shape on out_shape parameter, and PredictParam will be responsible for future extensibility. Additionally we can cooperate more information into input and output, like device ordinal, data slicing etc. This RFC is for whether should we be carrying out this refactor.

Brief notes

Some more notes on the prediction function:

  • The output shape for contribution and interaction is not unified. It depends on whether multi-class is used.
  • Predict leaf is always returning 2 dim array, num class and forest are not considered.
  • When output_margin is set, the output array is still 1 dim vector, but it should be 2 dim matrix for multi-class.
  • The output shape for softmax and softprob are both 1 dim vector, should be make softprob output a 2 dim matrix?
  • Maybe we should add a parameter to let user choose a stricter output shape?

@hcho3 @RAMitchell

@hcho3
Copy link
Collaborator

hcho3 commented Jan 23, 2021

@trivialfis

I have difficult time on figuring out how to slice up the output array from pred_leaf.

Have you managed to figure this out?

@trivialfis
Copy link
Member Author

Yup, I wrote a test for predict leaf

@trivialfis
Copy link
Member Author

So what do you think of the RFC?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 23, 2021

@trivialfis Is this RFC connected to #6091?

@trivialfis
Copy link
Member Author

No, this is just for the predict function, does not affect feature importance.

@trivialfis
Copy link
Member Author

trivialfis commented Jan 25, 2021

Some more notes on the prediction function:

  • The output shape for contribution and interaction is not unified. It depends on whether multi-class is used.
  • Predict leaf is always returning 2 dim array, num class and forest are not considered.
  • When output_margin is set, the output array is still 1 dim vector, but it should be 2 dim matrix for multi-class.
  • The output shape for softmax and softprob are both 1 dim vector, should be make softprob output a 2 dim matrix?
  • Maybe we should add a parameter to let user choose a stricter output shape?

@Craigacp
Copy link
Contributor

Would these proposed functions be thread safe, or is that argument dependent?

@trivialfis
Copy link
Member Author

The inplace prediction function is thread safe and lock free. Normal prediction is argument dependent currently, but should not be too difficult to make it fully thread safe.

@Craigacp
Copy link
Contributor

Craigacp commented Jan 25, 2021

Ok. I'm in favour of making it completely thread safe, as that's one of the things that's frustrating about the current XGBoost4J API. It would be nice to get rid of synchronized on the predict methods.

If it's possible to help do that without having to understand all the internals of the library then I'm happy to help, but I think the last time I looked it required understanding a lot of the internals to figure out the thread safety of a particular bit of code.

@trivialfis
Copy link
Member Author

@Craigacp This PR should help gbtree: #6648

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants