Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound #4491

Closed
avinashbarnwal opened this issue May 23, 2019 · 6 comments
Assignees

Comments

@avinashbarnwal
Copy link
Contributor

avinashbarnwal commented May 23, 2019

Motivation

XGBoost supports a different kind of loss functions ranging from least square to cox-proportional hazard model. We have rich support for linear regression, classification, count and survival loss models. But, adding more important survival and classification loss functions would improve the features and flexibility of the package.

Goals

  • Support survival model - Accelerated Failure Time for left, right and interval censored data.

  • Support count data with an upper bound, Binomial loss.

Non-Goals

This proposal is for XGBoost only. The interface proposed here is specific to R but can be generalized to other language bindings later.

Assumptions

Adding new attributes in the data matrix is allowed. This might change the properties of cox-proportional hazard model given that we might support for interval censored regression.

Risks
New hyperparameters sigma for tuning which might slow the computation for cross-validation.

Design

Existing Code

  • Survival Model - Cox-Proportional model is supported now where the label’s values are assigned accordingly where it is positive if the event happens and negative if the event doesn’t happen. This is for right censored data only.
  • Logistic regression - Logistic regression, Logistic regression for binary classification are supported having labels with {0,1} values.

Phase 1: Accelerated Failure Time(AFT)
Here we aim to support left, right and interval censored datasets.

  • New loss functions for Accelerated Failure time with normal and logistic distributions.

  • Interval regression requires two labels to be input, where left and right censored data can be represented in the form of interval-censored where the lower limit is -inf for left and higher limit is inf for right censored datasets. We add a new attribute to the DMatrix object to store the extra label.

  • For point event (un-censored label), lower limit equals to a higher limit.

  • New parameter is created to pass the distribution type, belonging to Normal or Logistic distribution.

  • Gradient, hessian and metric performance will be added for AFT loss functions.

  • Loss functions require extra parameter - sigma. This is going to be treated as one of the hyperparameters.

Phase 2: Binomial Loss for count data with an upper bound
Here we aim to add new loss function for binomial Loss.

  • Input is count as a label which is different from binary classification.

  • We have an upper bound on the count of response for each row differently. New attributed is added to the DMatrix object to store this upper bound count.

  • Gradient, hessian and metric performance will be added for this loss function.

cc @hcho3 @tdhock

Note. This work is being carried out under the Google Summer of Code 2019. See discussion at #4242.

@hcho3 hcho3 changed the title Loss Functions - Survival Model and Binomial Loss with count upper bound [RFC] Loss Functions - Survival Model and Binomial Loss with count upper bound May 23, 2019
@hcho3 hcho3 self-assigned this May 23, 2019
@hcho3
Copy link
Collaborator

hcho3 commented May 23, 2019

For those unfamiliar with survival analysis: this survey paper provides an excellent overview of survival analysis. Found this paper thanks to @avinashbarnwal

@tdhock
Copy link

tdhock commented May 23, 2019

you wrote that AFT model requires "extra sigma for normal distribution as a parameter to be estimated and beta for logistic distribution" but it would be better if you use the standard parameterization, which has sigma for both distributions, and allows easy extension to other distributions as well. this is described in section 6.9 of the survival package manual http://members.cbio.mines-paristech.fr/~thocking/survival.pdf

@tdhock
Copy link

tdhock commented May 23, 2019

also see avinashbarnwal#1 which is a summary of calls with @avinashbarnwal our GSOC'19 student

@avinashbarnwal
Copy link
Contributor Author

I will change it to standard parameterization-sigma for both distributions. I was referring to your document https://github.com/tdhock/aft-poster/blob/master/HOCKING-AFT.pdf for log-logistic where you have mentioned beta as the parameter.

@hcho3
Copy link
Collaborator

hcho3 commented Apr 2, 2020

AFT is now part of XGBoost: #4763

@trivialfis
Copy link
Member

@hcho3 Can we close this?

@hcho3 hcho3 closed this as completed Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants