Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RTX prediction notebook #18

Merged
merged 4 commits into from
Jun 24, 2018
Merged

Conversation

jaclyn-taroni
Copy link
Collaborator

@jaclyn-taroni jaclyn-taroni commented Jun 21, 2018

In this PR, I am adding a notebook that performs a supervised ML analysis using the RTX data. I attempt to predict response labels (nonresponder and two types of responders: tolerant and nontolerant) from the baseline samples expression data using multinomial LASSO. Unfortunately, there are only 36 baseline samples with response labels. With this sample size, I use leave-one-out cross-validation (LOOCV) and do not have a hold-out set. I am comparing three data entities if you will:

  • Expression data -- gene-level measurements that have been filtered to only include genes that are in the recount2 model
  • recount2 LVs -- this RTX data in the recount2 PLIER model latent space
  • RTX PLIER LVs -- dataset-specific model

My hypothesis was that the multiPLIER LVs (recount2) would outperform the expression data in this prediction task. The results in this notebook refute that hypothesis. A few notes & observations, although I have no intention of further exploring this avenue given the limitations:

  • The range of the recount2 LVs is smaller / is "flatter" than the other entities (can be seen at the bottom of the notebook). It's possible that scaling the features would provide a better comparison between the entities. In general, there's probably room for improvement in training/tuning, etc.
  • My original hypothesis wrt multiPLIER LVs is that this approach would generalize better than models trained on gene-level expression. I can't test that with this dataset + sample size. So I note this here as a future direction.

Notebook html: 25-predict_response.nb.zip

@jaclyn-taroni jaclyn-taroni changed the title [WIP] Add RTX prediction notebook Add RTX prediction notebook Jun 22, 2018
Copy link

@huqiwen0313 huqiwen0313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Some minor comments.

y = baseline.covariate.df$mainclass,
type.measure = "class",
family = "multinomial",
nfolds = nrow(baseline.exprs)) # LOOCV

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to tune lasso parameter to achieve the best performance using caret::train ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that this would be my next move if I was going to work on this a bit more!

```{r}
ggplot2::ggsave(file.path(plot.dir, "total_accuracy_CI.pdf"),
plot = ggplot2::last_plot())
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTX LVs have a prediction accuracy equal to 1... seems like overfitting ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely


```{r}
summary(as.vector(rtx.baseline.b))
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the LVs values may affect prediction, do you think scale them to the same level will make the accuracy looks more similar ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but I think there is not much to be done with this sample size (especially given the overfitting as you point out above!)

@jaclyn-taroni
Copy link
Collaborator Author

Thanks for the comments @huqiwen0313. I agree with them and am glad they will be recorded here! I am going to merge this the way it is because I will not investigate this particular avenue further until I have a larger or more appropriate dataset.

@jaclyn-taroni jaclyn-taroni merged commit 2e5cd97 into greenelab:master Jun 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants