Zero-inflated count data are omnipresent in many fields including health care research and actuarial science. Zero-inflated Poisson (ZIP) and Zero-inflated Negative Binomial (ZINB) regression are commonly used to model such outcomes. These mixture models typically include a logistic component to model the presence of excess zeros above and beyond those generated by the count component and a Poisson/Negative Binomial component to model the counts. Several methods have been proposed for variable selection in ZIP and ZINB regression models. However, when the features to be associated possess an inherent grouping structure, these individual variable selection approaches are suboptimal. In order to perform group variable selection in ZIP/ZINB regression models, we extend various commonly used group regularization methods from the linear regression literature. With that formulation, we are able to achieve bi-level variable selection in both zero and count submodels of the corresponding zero-inflated models. The tuning parameter(s) of the final model can be chosen according to the minimum AIC/BIC criteria (following Huang et al., 2009).
You can install Gooogle
directly from Github as follows:
install.packages("devtools")
devtools::install_github("himelmallick/Gooogle")
library(Gooogle)
gooogle(data=data,yvar=yvar,xvars=xvars,zvars=xvars,group=rep(1,14),dist="poisson",penalty="gBridge")
- data: the dataset (in data frame) to be used for the analysis.
- yvar: the outcome variable name.
- xvars: the vector of variable names to be included in count model.
- zvars: the vector of variable names to be included in zero model.
- group: the vector of integers indicating the grouping structure among predictors.
- dist: the distribution of count model ('poisson' or 'negbin').
- penalty: the penalty to be applied for regularization. For group selection, it is one of 'grLasso', 'grMCP', or 'grSCAD' while for bi-level selection it is 'gBridge'.
For efficiency, if any coefficients are to be included in the model without being penalized, their grouping index should be zero. group
is expected to be a vector of consecutive integers.
The gooogle
function will return a list containing the following objects:
- coefficients: a list containing the estimates for the count and logistic submodels.
- aic: AIC for the fitted model.
- bic: BIC for the fitted model.
- loglik: Log-likelihood for the fitted model.
Let's try one example on real data for which we are using the docvisit
dataset from the R package zic
. Similar to previous studies (Jochmann, 2013), we express each continuous predictor as a group of three cubic spline variables, resulting in 24 candidate predictors with 5 triplets and 9 singleton groups.
####################
# Load the dataset and prepare the variables
library(zic)
library(splines)
data("docvisits")
n<-nrow(docvisits)
age<-bs(docvisits$age,3)[1:n,]
hlth<-bs(docvisits$health,3)[1:n,]
hdeg<-bs(docvisits$hdegree,3)[1:n,]
schl<-bs(docvisits$schooling,3)[1:n,]
hhin<-bs(docvisits$hhincome,3)[1:n,]
attach(docvisits)
doc.spline<-cbind.data.frame(docvisits$docvisits,age,hlth,hdeg,schl,hhin,handicap,married,children,self,civil,bluec,employed,public,addon)
names(doc.spline)[1:16]<-c("docvisits",paste("age",1:3,sep=""),paste("health",1:3,sep=""),paste("hdegree",1:3,sep=""),paste("schooling",1:3,sep=""),paste("hhincome",1:3,sep=""))
data<-doc.spline
#####################################################################
Considering the grouping structure among the variables age
, health
, hdegree
, schooling
, and hhincome
, we can use our algorithm to perform group level or bi-level variable selection. Below is an example implementation of the gooogle
function using gBridge
penalty.
# Fit the Gooogle method using group bridge penalty
group=c(rep(1:5,each=3),(6:14))
yvar<-names(data)[1]
xvars<-names(data)[-1]
zvars<-xvars
fit.gooogle <- gooogle(data=data,yvar=yvar,xvars=xvars,zvars=zvars,group=group,dist="negbin",penalty="gBridge")
fit.gooogle
Huang, J., Ma, S., Xie, H., and Zhang, C. (2009). A group bridge approach for variable selection. Biometrika 96(2):339–355.
Jochmann, M. (2013). What belongs where? variable selection for zero-inflated count models with an application to the demand for health care. Computational Statistics 28:1947-1964.
If you use Gooogle
in your work, please cite the following papers:
Chatterjee, S., Chowdhury, S., Mallick, H., Banerjee, P., and Garai, B. (2018). Group Regularization for Zero-inflated Negative Binomial Regression Models with An Application to Healthcare Demand in Germany. Statistics in Medicine 37(20):3012-3026.
Chowdhury, S., Chatterjee, S., Mallick, H., Banerjee, P., and Garai, B. (2019). Group Regularization for Zero-inflated Poisson Regression Models with An Application to Insurance Ratemaking. Journal of Applied Statistics 46(9):1567-1581.
Feel free to contact us at schatterjee@niu.edu, gg0658@wayne.edu, and/or hmallick@hsph.harvard.edu.