Determine the optimal cutpoint for continuous variables #41

kassambara · 2016-05-24T15:20:20Z

As you know, In the field of cancer genomics, biologists are sometimes interested to know whether the expression profile of their favorite gene, say FGFR3, is associated with patients' prognostic. To answer to this question, one can use cox univariate analysis. They want also sometimes to split patients in FGFR3high and FGFR3low. One can arbitrary split patients into 2-3-4 groups using quartiles (q1, q2, q3).

An interesting alternative is provided by maxstat(maximally selected rank statistics) which has several advantages. First, there is no need to transform the time-dependent end point. Second, the test calculates an exact cut-off point, which can be estimated using several methods and approximations, and the discrimination power is also evaluated and estimated with a P value (type I error) 1.

See some examples in this pdf (maxstat is used): http://genomicscape.com/temp/survival_TIE3ByWMo9_plot.pdf

What do you think about adding elegant maxtat visualization support in survminer?
(I have R base scripts for that)

The philosophy of the function would be as follow:

############
# Function computing maximally selected rank test for a list of variables (genes)
#############
# data: data frame containing survival info and gene expression data
# event,time: column names containing event and time data
# variables: characther vector containing gene names: c("FGFR3", "DEPDC1", "IGF1R")
# p: pvalue threshold. only gene with pvalue <= 0.05 will be kept in the final output
# p.adjust: if TRUE, pvalues will be adjusted for multiple testing correction using Benjamini-Hochberg method 
xxx_maxstat <- function(data, event, time, variables, p = 0.05,
    p.adjust = TRUE, adjust.method = "BH")
{
 - Compute max stat for each gene and estimate the optimal cut-point
 - Dichotomize each gene based on its optimal cutpoint
 - return a mxst object data frame containing gene names, optimal cutpoint and corresponding dichotomized value
}

# S3 method
ggsurvplot.mxst <- function(mxt){
 - Visualize mxstat objects
 - generate a pdf if multiple genes
}

Articles using maxstat:

MarcinKosinski · 2016-05-24T19:33:47Z

Well formated issue! I have to assume that during my cooperation with biotechnologist I have met the idea of examining expression of one gene in HIGH and LOW groups :) That's so true. I didn't know about maxstat method, so I always used median, as a natural approach.

The plot of maxstat looks very simple, but is very useful. Maybe we could somehow extend ggsurvplot to provide cutoffs for user while he/she only specifies names of the dependent continuous variables?

PS. I remember about vignette I was about to post here :P but just got invited on Bioc2016 and I am planning my journey and VISa.

MarcinKosinski · 2016-05-24T19:34:39Z

This is extra idea to arm survminer with another tools!

kassambara · 2016-05-25T11:45:24Z

no emrgency for the vignette:-)!

kassambara · 2016-10-22T18:20:41Z

New functions added for determining and visualizing the optimal cutpoint of continuous variables for survival analyses:
- surv_cutpoint(): Determine the optimal cutpoint for each variable using 'maxstat'. Methods defined for surv_cutpoint object are summary(), print() and plot().
- surv_categorize(): Divide each variable values based on the cutpoint returned by surv_cutpoint()
0. Load some data

data(myeloma)
head(myeloma)

         molecular_group chr1q21_status treatment event  time   CCND1 CRIM1 DEPDC1    IRF4   TP53   WHSC1
GSM50986      Cyclin D-1       3 copies       TT2     0 69.24  9908.4 420.9  523.5 16156.5   10.0   261.9
GSM50988      Cyclin D-2       2 copies       TT2     0 66.43 16698.8  52.0   21.1 16946.2 1056.9   363.8
GSM50989           MMSET       2 copies       TT2     0 66.50   294.5 617.9  192.9  8903.9 1762.8 10042.9
GSM50990           MMSET       3 copies       TT2     1 42.67   241.9  11.9  184.7 11894.7  946.8  4931.0
GSM50991             MAF                  TT2     0 65.00   472.6  38.8  212.0  7563.1  361.4   165.0
GSM50992    Hyperdiploid       2 copies       TT2     0 65.20   664.1  16.9  341.6 16023.4 2096.3   569.2

1. Determine the optimal cutpoint of variables

# 1. Determine the optimal cutpoint of variables
res.cut <- surv_cutpoint(myeloma, time = "time", event = "event",
   variables = c("DEPDC1", "WHSC1", "CRIM1"))

summary(res.cut)

       cutpoint statistic
DEPDC1    279.8  4.275452
WHSC1    3205.6  3.361330
CRIM1      82.3  1.968317

2. Plot cutpoint for DEPDC1

# 2. Plot cutpoint for DEPDC1
# palette = "npg" (nature publishing group), see ?ggpubr::ggpar
plot(res.cut, "DEPDC1", palette = "npg")

3. Categorize variables

# 3. Categorize variables
res.cat <- surv_categorize(res.cut)
head(res.cat)

          time event DEPDC1 WHSC1 CRIM1
GSM50986 69.24     0   high   low  high
GSM50988 66.43     0    low   low   low
GSM50989 66.50     0    low  high  high
GSM50990 42.67     1    low  high   low
GSM50991 65.00     0    low   low   low
GSM50992 65.20     0   high   low   low

4. Fit survival curves and visualize

# 3. Fit survival curves and visualize
library("survival")
fit <- survfit(Surv(time, event) ~DEPDC1, data = res.cat)
ggsurvplot(fit, risk.table = TRUE, conf.int = TRUE)

MarcinKosinski · 2016-10-25T19:01:36Z

Wonderful and very helpful :)! good job

2016-10-22 20:20 GMT+02:00 Alboukadel KASSAMBARA notifications@github.com:

New functions added for determining and visualizing the optimal
cutpoint of continuous variables for survival analyses:

surv_cutpoint(): Determine the optimal cutpoint for each variable
using 'maxstat'. Methods defined for surv_cutpoint object are summary(),
print() and plot().

surv_categorize(): Divide each variable values based on the
cutpoint returned by surv_cutpoint()

0. Load some data

data(myeloma)
head(myeloma)
     molecular_group chr1q21_status treatment event  time   CCND1 CRIM1 DEPDC1    IRF4   TP53   WHSC1
GSM50986 Cyclin D-1 3 copies TT2 0 69.24 9908.4 420.9 523.5 16156.5 10.0 261.9
GSM50988 Cyclin D-2 2 copies TT2 0 66.43 16698.8 52.0 21.1 16946.2 1056.9 363.8
GSM50989 MMSET 2 copies TT2 0 66.50 294.5 617.9 192.9 8903.9 1762.8 10042.9
GSM50990 MMSET 3 copies TT2 1 42.67 241.9 11.9 184.7 11894.7 946.8 4931.0
GSM50991 MAF TT2 0 65.00 472.6 38.8 212.0 7563.1 361.4 165.0
GSM50992 Hyperdiploid 2 copies TT2 0 65.20 664.1 16.9 341.6 16023.4 2096.3 569.2

1. Determine the optimal cutpoint of variables

1. Determine the optimal cutpoint of variablesres.cut <- surv_cutpoint(myeloma, time = "time", event = "event",

variables = c("DEPDC1", "WHSC1", "CRIM1"))

summary(res.cut)
   cutpoint statistic
DEPDC1 279.8 4.275452
WHSC1 3205.6 3.361330
CRIM1 82.3 1.968317

* 2. Plot cutpoint for DEPDC1*

2. Plot cutpoint for DEPDC1# palette = "npg" (nature publishing group), see ?ggpubr::ggpar

plot(res.cut, "DEPDC1", palette = "npg")

[image: rplot16]
https://cloud.githubusercontent.com/assets/3313355/19621578/f63b9938-9894-11e6-930c-1f0efca81919.png

3. Categorize variables

3. Categorize variablesres.cat <- surv_categorize(res.cut)

head(res.cat)
      time event DEPDC1 WHSC1 CRIM1
GSM50986 69.24 0 high low high
GSM50988 66.43 0 low low low
GSM50989 66.50 0 low high high
GSM50990 42.67 1 low high low
GSM50991 65.00 0 low low low
GSM50992 65.20 0 high low low

4. Fit survival curves and visualize

3. Fit survival curves and visualize

library("survival")fit <- survfit(Surv(time, event) ~DEPDC1, data = res.cat)
ggsurvplot(fit, risk.table = TRUE, conf.int = TRUE)

[image: rplot15]
https://cloud.githubusercontent.com/assets/3313355/19621540/59b447f4-9894-11e6-8b5b-6a5add1769a7.png

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#41 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGdazn6nuZkhLokYp7bBAdljp7BNr1Ewks5q2lP5gaJpZM4Ilmxn
.

kassambara · 2016-10-26T04:50:47Z

Thanks:-)!

Jusinnnqu · 2023-03-30T19:07:47Z

@kassambara Great work! Just wondering if this applies to the case when I try to find multiple cutpoints in one group, ie. 2 points to break into low/medium/high?

kassambara changed the title ~~support for visualizing maximally selected rank statistics~~ Support for visualizing maximally selected rank statistics May 24, 2016

kassambara added the enhancement label Oct 18, 2016

kassambara changed the title ~~Support for visualizing maximally selected rank statistics~~ Determine the optimal cutpoint for continuous variables Oct 22, 2016

kassambara added a commit that referenced this issue Oct 22, 2016

new functions for determining and visualizing optimal cutpoint #41

1f11f79

kassambara added a commit that referenced this issue Oct 22, 2016

News updated #41

e296cb9

kassambara closed this as completed Oct 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine the optimal cutpoint for continuous variables #41

Determine the optimal cutpoint for continuous variables #41

kassambara commented May 24, 2016

MarcinKosinski commented May 24, 2016

MarcinKosinski commented May 24, 2016

kassambara commented May 25, 2016

kassambara commented Oct 22, 2016 •

edited

Loading

MarcinKosinski commented Oct 25, 2016

surv_categorize(): Divide each variable values based on the
cutpoint returned by surv_cutpoint()

1. Determine the optimal cutpoint of variablesres.cut <- surv_cutpoint(myeloma, time = "time", event = "event",

2. Plot cutpoint for DEPDC1# palette = "npg" (nature publishing group), see ?ggpubr::ggpar

3. Categorize variablesres.cat <- surv_categorize(res.cut)

3. Fit survival curves and visualize

kassambara commented Oct 26, 2016

Jusinnnqu commented Mar 30, 2023

Determine the optimal cutpoint for continuous variables #41

Determine the optimal cutpoint for continuous variables #41

Comments

kassambara commented May 24, 2016

MarcinKosinski commented May 24, 2016

MarcinKosinski commented May 24, 2016

kassambara commented May 25, 2016

kassambara commented Oct 22, 2016 • edited Loading

MarcinKosinski commented Oct 25, 2016

surv_categorize(): Divide each variable values based on the cutpoint returned by surv_cutpoint()

1. Determine the optimal cutpoint of variablesres.cut <- surv_cutpoint(myeloma, time = "time", event = "event",

2. Plot cutpoint for DEPDC1# palette = "npg" (nature publishing group), see ?ggpubr::ggpar

3. Categorize variablesres.cat <- surv_categorize(res.cut)

3. Fit survival curves and visualize

kassambara commented Oct 26, 2016

Jusinnnqu commented Mar 30, 2023

kassambara commented Oct 22, 2016 •

edited

Loading

surv_categorize(): Divide each variable values based on the
cutpoint returned by surv_cutpoint()