-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491
Comments
For those unfamiliar with survival analysis: this survey paper provides an excellent overview of survival analysis. Found this paper thanks to @avinashbarnwal |
you wrote that AFT model requires "extra sigma for normal distribution as a parameter to be estimated and beta for logistic distribution" but it would be better if you use the standard parameterization, which has sigma for both distributions, and allows easy extension to other distributions as well. this is described in section 6.9 of the survival package manual http://members.cbio.mines-paristech.fr/~thocking/survival.pdf |
also see avinashbarnwal#1 which is a summary of calls with @avinashbarnwal our GSOC'19 student |
I will change it to standard parameterization-sigma for both distributions. I was referring to your document https://github.com/tdhock/aft-poster/blob/master/HOCKING-AFT.pdf for log-logistic where you have mentioned beta as the parameter. |
AFT is now part of XGBoost: #4763 |
@hcho3 Can we close this? |
Motivation
XGBoost supports a different kind of loss functions ranging from least square to cox-proportional hazard model. We have rich support for linear regression, classification, count and survival loss models. But, adding more important survival and classification loss functions would improve the features and flexibility of the package.
Goals
Support survival model - Accelerated Failure Time for left, right and interval censored data.
Support count data with an upper bound, Binomial loss.
Non-Goals
This proposal is for XGBoost only. The interface proposed here is specific to R but can be generalized to other language bindings later.
Assumptions
Adding new attributes in the data matrix is allowed. This might change the properties of cox-proportional hazard model given that we might support for interval censored regression.
Risks
New hyperparameters sigma for tuning which might slow the computation for cross-validation.
Design
Existing Code
Phase 1: Accelerated Failure Time(AFT)
Here we aim to support left, right and interval censored datasets.
New loss functions for Accelerated Failure time with normal and logistic distributions.
Interval regression requires two labels to be input, where left and right censored data can be represented in the form of interval-censored where the lower limit is -inf for left and higher limit is inf for right censored datasets. We add a new attribute to the DMatrix object to store the extra label.
For point event (un-censored label), lower limit equals to a higher limit.
New parameter is created to pass the distribution type, belonging to Normal or Logistic distribution.
Gradient, hessian and metric performance will be added for AFT loss functions.
Loss functions require extra parameter - sigma. This is going to be treated as one of the hyperparameters.
Phase 2: Binomial Loss for count data with an upper bound
Here we aim to add new loss function for binomial Loss.
Input is count as a label which is different from binary classification.
We have an upper bound on the count of response for each row differently. New attributed is added to the DMatrix object to store this upper bound count.
Gradient, hessian and metric performance will be added for this loss function.
cc @hcho3 @tdhock
Note. This work is being carried out under the Google Summer of Code 2019. See discussion at #4242.
The text was updated successfully, but these errors were encountered: