Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine the optimal cutpoint for continuous variables #41

Closed
kassambara opened this issue May 24, 2016 · 7 comments
Closed

Determine the optimal cutpoint for continuous variables #41

kassambara opened this issue May 24, 2016 · 7 comments

Comments

@kassambara
Copy link
Owner

Hi @MarcinKosinski ,

As you know, In the field of cancer genomics, biologists are sometimes interested to know whether the expression profile of their favorite gene, say FGFR3, is associated with patients' prognostic. To answer to this question, one can use cox univariate analysis. They want also sometimes to split patients in FGFR3high and FGFR3low. One can arbitrary split patients into 2-3-4 groups using quartiles (q1, q2, q3).

An interesting alternative is provided by maxstat(maximally selected rank statistics) which has several advantages. First, there is no need to transform the time-dependent end point. Second, the test calculates an exact cut-off point, which can be estimated using several methods and approximations, and the discrimination power is also evaluated and estimated with a P value (type I error) 1.

See some examples in this pdf (maxstat is used): http://genomicscape.com/temp/survival_TIE3ByWMo9_plot.pdf

What do you think about adding elegant maxtat visualization support in survminer?
(I have R base scripts for that)

The philosophy of the function would be as follow:

############
# Function computing maximally selected rank test for a list of variables (genes)
#############
# data: data frame containing survival info and gene expression data
# event,time: column names containing event and time data
# variables: characther vector containing gene names: c("FGFR3", "DEPDC1", "IGF1R")
# p: pvalue threshold. only gene with pvalue <= 0.05 will be kept in the final output
# p.adjust: if TRUE, pvalues will be adjusted for multiple testing correction using Benjamini-Hochberg method 
xxx_maxstat <- function(data, event, time, variables, p = 0.05,
    p.adjust = TRUE, adjust.method = "BH")
{
 - Compute max stat for each gene and estimate the optimal cut-point
 - Dichotomize each gene based on its optimal cutpoint
 - return a mxst object data frame containing gene names, optimal cutpoint and corresponding dichotomized value
}

# S3 method
ggsurvplot.mxst <- function(mxt){
 - Visualize mxstat objects
 - generate a pdf if multiple genes
}

Articles using maxstat:

  1. http://www.haematologica.org/content/99/9/1410
  2. http://www.bloodjournal.org/content/bloodjournal/120/5/1087.full.pdf
  3. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4058021/pdf/oncotarget-05-2487.pdf
  4. http://www.impactjournals.com/oncotarget/index.php?journal=oncotarget&page=article&op=view&path[]=3237&path[]=6227
@kassambara kassambara changed the title support for visualizing maximally selected rank statistics Support for visualizing maximally selected rank statistics May 24, 2016
@MarcinKosinski
Copy link
Contributor

Well formated issue! I have to assume that during my cooperation with biotechnologist I have met the idea of examining expression of one gene in HIGH and LOW groups :) That's so true. I didn't know about maxstat method, so I always used median, as a natural approach.

The plot of maxstat looks very simple, but is very useful. Maybe we could somehow extend ggsurvplot to provide cutoffs for user while he/she only specifies names of the dependent continuous variables?

PS. I remember about vignette I was about to post here :P but just got invited on Bioc2016 and I am planning my journey and VISa.

@MarcinKosinski
Copy link
Contributor

  • This is extra idea to arm survminer with another tools!

@kassambara
Copy link
Owner Author

no emrgency for the vignette:-)!

@kassambara kassambara changed the title Support for visualizing maximally selected rank statistics Determine the optimal cutpoint for continuous variables Oct 22, 2016
kassambara added a commit that referenced this issue Oct 22, 2016
@kassambara
Copy link
Owner Author

kassambara commented Oct 22, 2016

  • New functions added for determining and visualizing the optimal cutpoint of continuous variables for survival analyses:
    • surv_cutpoint(): Determine the optimal cutpoint for each variable using 'maxstat'. Methods defined for surv_cutpoint object are summary(), print() and plot().
    • surv_categorize(): Divide each variable values based on the cutpoint returned by surv_cutpoint()
  • 0. Load some data
data(myeloma)
head(myeloma)
         molecular_group chr1q21_status treatment event  time   CCND1 CRIM1 DEPDC1    IRF4   TP53   WHSC1
GSM50986      Cyclin D-1       3 copies       TT2     0 69.24  9908.4 420.9  523.5 16156.5   10.0   261.9
GSM50988      Cyclin D-2       2 copies       TT2     0 66.43 16698.8  52.0   21.1 16946.2 1056.9   363.8
GSM50989           MMSET       2 copies       TT2     0 66.50   294.5 617.9  192.9  8903.9 1762.8 10042.9
GSM50990           MMSET       3 copies       TT2     1 42.67   241.9  11.9  184.7 11894.7  946.8  4931.0
GSM50991             MAF                  TT2     0 65.00   472.6  38.8  212.0  7563.1  361.4   165.0
GSM50992    Hyperdiploid       2 copies       TT2     0 65.20   664.1  16.9  341.6 16023.4 2096.3   569.2
  • 1. Determine the optimal cutpoint of variables
# 1. Determine the optimal cutpoint of variables
res.cut <- surv_cutpoint(myeloma, time = "time", event = "event",
   variables = c("DEPDC1", "WHSC1", "CRIM1"))

summary(res.cut)
       cutpoint statistic
DEPDC1    279.8  4.275452
WHSC1    3205.6  3.361330
CRIM1      82.3  1.968317
  • 2. Plot cutpoint for DEPDC1
# 2. Plot cutpoint for DEPDC1
# palette = "npg" (nature publishing group), see ?ggpubr::ggpar
plot(res.cut, "DEPDC1", palette = "npg")

rplot16

  • 3. Categorize variables
# 3. Categorize variables
res.cat <- surv_categorize(res.cut)
head(res.cat)
          time event DEPDC1 WHSC1 CRIM1
GSM50986 69.24     0   high   low  high
GSM50988 66.43     0    low   low   low
GSM50989 66.50     0    low  high  high
GSM50990 42.67     1    low  high   low
GSM50991 65.00     0    low   low   low
GSM50992 65.20     0   high   low   low
  • 4. Fit survival curves and visualize
# 3. Fit survival curves and visualize
library("survival")
fit <- survfit(Surv(time, event) ~DEPDC1, data = res.cat)
ggsurvplot(fit, risk.table = TRUE, conf.int = TRUE)

rplot15

@MarcinKosinski
Copy link
Contributor

Wonderful and very helpful :)! good job

2016-10-22 20:20 GMT+02:00 Alboukadel KASSAMBARA notifications@github.com:

New functions added for determining and visualizing the optimal
cutpoint of continuous variables for survival analyses:

  • surv_cutpoint(): Determine the optimal cutpoint for each variable
    using 'maxstat'. Methods defined for surv_cutpoint object are summary(),
    print() and plot().

    • surv_categorize(): Divide each variable values based on the
      cutpoint returned by surv_cutpoint()

    0. Load some data

data(myeloma)
head(myeloma)

     molecular_group chr1q21_status treatment event  time   CCND1 CRIM1 DEPDC1    IRF4   TP53   WHSC1

GSM50986 Cyclin D-1 3 copies TT2 0 69.24 9908.4 420.9 523.5 16156.5 10.0 261.9
GSM50988 Cyclin D-2 2 copies TT2 0 66.43 16698.8 52.0 21.1 16946.2 1056.9 363.8
GSM50989 MMSET 2 copies TT2 0 66.50 294.5 617.9 192.9 8903.9 1762.8 10042.9
GSM50990 MMSET 3 copies TT2 1 42.67 241.9 11.9 184.7 11894.7 946.8 4931.0
GSM50991 MAF TT2 0 65.00 472.6 38.8 212.0 7563.1 361.4 165.0
GSM50992 Hyperdiploid 2 copies TT2 0 65.20 664.1 16.9 341.6 16023.4 2096.3 569.2

  • 1. Determine the optimal cutpoint of variables

1. Determine the optimal cutpoint of variablesres.cut <- surv_cutpoint(myeloma, time = "time", event = "event",

variables = c("DEPDC1", "WHSC1", "CRIM1"))

summary(res.cut)

   cutpoint statistic

DEPDC1 279.8 4.275452
WHSC1 3205.6 3.361330
CRIM1 82.3 1.968317

  • * 2. Plot cutpoint for DEPDC1*

2. Plot cutpoint for DEPDC1# palette = "npg" (nature publishing group), see ?ggpubr::ggpar

plot(res.cut, "DEPDC1", palette = "npg")

[image: rplot16]
https://cloud.githubusercontent.com/assets/3313355/19621578/f63b9938-9894-11e6-930c-1f0efca81919.png

  • 3. Categorize variables

3. Categorize variablesres.cat <- surv_categorize(res.cut)

head(res.cat)

      time event DEPDC1 WHSC1 CRIM1

GSM50986 69.24 0 high low high
GSM50988 66.43 0 low low low
GSM50989 66.50 0 low high high
GSM50990 42.67 1 low high low
GSM50991 65.00 0 low low low
GSM50992 65.20 0 high low low

  • 4. Fit survival curves and visualize

3. Fit survival curves and visualize

library("survival")fit <- survfit(Surv(time, event) ~DEPDC1, data = res.cat)
ggsurvplot(fit, risk.table = TRUE, conf.int = TRUE)

[image: rplot15]
https://cloud.githubusercontent.com/assets/3313355/19621540/59b447f4-9894-11e6-8b5b-6a5add1769a7.png


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#41 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGdazn6nuZkhLokYp7bBAdljp7BNr1Ewks5q2lP5gaJpZM4Ilmxn
.

@kassambara
Copy link
Owner Author

Thanks:-)!

@Jusinnnqu
Copy link

@kassambara Great work! Just wondering if this applies to the case when I try to find multiple cutpoints in one group, ie. 2 points to break into low/medium/high?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants