Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCPCP000006 - Add 01-clustering template analysis #704

Closed
wants to merge 16 commits into from

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Aug 8, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

#635

What is the goal of this pull request?

As discussed in the previous PR (#699) I wanted to have one notebook that is rendered over the entire Wilms tumor dataset.

Briefly describe the general approach you took to achieve this goal.

  1. I basically took the previous .Rmd file,
  2. remove all the comments specific to one sample,
  3. remove the label transfer using the fetal kidney atlas (for the next PR)
  4. write a R script to render the notebook on all samples.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! More than one. At least:

  • A PR to build the fetal kidney reference for Azimuth,
  • A PR to include the label transfer from the fetal kidney reference to the clustering template.
  • A PR to include inferCNV and/or copyKAT to the template.
  • A PR to subdivide and re-cluster stroma and fetal nephron cells, taht contain nromal + cancer cells

Results

What types of results does your code produce (e.g., table, figure)?

HTML reports

What is your summary of the results?

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g., renv or conda) to run this code?
For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

If you have any comments, ideas, very welcome. Either on the content of the template (as I said, I think some sections will not be so usefull) or the future plans:

  • build the fetal kidney reference for Azimuth,
  • include the label transfer from the fetal kidney reference to the clustering template.
  • include inferCNV and/or copyKAT to the template.
  • subdivide and re-cluster stroma and fetal nephron cells, taht contain nromal + cancer cells

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@maud-p maud-p requested a review from jaclyn-taroni as a code owner August 8, 2024 14:26
@jaclyn-taroni jaclyn-taroni requested review from sjspielman and removed request for jaclyn-taroni August 8, 2024 14:30
@sjspielman
Copy link
Member

Hi @maud-p, thanks for filing this updated PR! I wanted to let you know that I'm going over your code now, and you can expect a review from me tomorrow.

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for filing this PR with the template notebook - we're definitely headed in the right direction with this update!! For the first round of review, let's focus on the code itself and then see where we land, and soon we can start thinking more about the specific science/analyses.

That said, I do want to ask before we get too far here:

  • Do you have a sense of how sensitive results from DEletegate are likely to be to the specific clustering? In other words, if the clustering were different, would results substantially change?
  • Do you have a sense of how much these clustering results might be used going forward? If these are not likely to be used much further than this initial exploration, then my first question may not be so important.

Now, there are two pieces of high-level feedback I'll provide, as well as other feedback. Note that throughout the code I've also left some inline comments and/or suggestions, some of which relate back to this overall comment. Since this is your first large time responding to a larger review in OpenScPCA, I recommend looking through some of our instructions about responding to review. In addition to offering specific strategies for how to respond to comments (e.g., accepting suggestions or dealing with inline comments), it may also clarify the basis for some of my feedback.

As always, please don't hesitate to ask me any follow-ups, especially since there's a lot here to think about! You can also always DM me on the Cancer Data Science Slack for more direct help too!!

High-level feedback

These two main pieces of high-level feedback are based around the concept of "modularization" I mentioned before.

Re-organize notebook and runner script

Again it's great to see you now have both a template and script to run the template across samples! But, we actually want to rearrange this a bit further so that the only thing the script does is run the template over samples. It should not have any other R code besides what is strictly needed for the for-loop. In addition, the notebook itself should be "runnable" both on its own (e.g., I should be able to click Knit and it should knit) and from the script that calls it to run over samples. Here's what I recommend to achieve this:

  • You will need to add some additional code chunks to the top of the notebook which load libraries, set options, define paths, etc. Some of this code is already in the notebook, but some is in the script and needs to be ported over. I definitely recommend our documentation on organizing notebooks too since I think this might help clarify (but again don't hesitate to ask me to clarify too :) )
    • Note also that after carefully going through your code, I am pretty sure that you only need to load these libraries: Seurat, SingleCellExperiment, tidyverse, msigbdr, and patchwork.
  • You will need to set some default R Markdown param values that the notebook can use if it's run directly with Knit. For example, in the template notebook I previously sent, you can see some default parameter values for sample_id and library_id, as well as certain metadata files which are used. For your notebook, you'll likely just need parameters for sample_id and library_id, and the code can be updated from there to read in the given SCE file. Have a look in this notebook as well for how the parameters are used to actually read in the SCE file in the first place - I think that will be helpful!
  • You will need to pare down the R script to only contain a for-loop, and any variables needed for that for-loop. Anything directly used in the notebook should be defined in the notebook.
    • Another note about this script: I recommend moving it into the scripts directory. Similarly, in your module README.md file, please update the instructions for performing this analysis to include code that states exactly how to run that script from the command line to recreate all notebooks.
  • Related to this, you do not need to be loading the _pathways.csv files (and you can also remove the Pathway_metadata.csv from the repo), since this information should be obtained directly from the msigdbr package as part of the notebook code directly.

Use functions for repeated code

Second, there are a couple places where you have a lot of repeated code within the notebook. I'm looking in particular at four code patterns that I see coming up:

  • In the Malignant markers section, each chunk looking at a gene uses the same code, where the distinguishing aspect is the gene name
    • I'll also note here: For each chunk, you're re-plotting the UMAP colored by cluster, and I don't think that's needed in this section since you already plotted it after performing the clustering.
  • In the Look at specific pathways section, each set of two chunks per pathway uses the same code, where the distinguishing aspect is the pathway name
  • In the Enrichment analysis of marker genes section, each chunk looking at a hallmark gene set uses the same code, where the distinguishing aspect is the hallmark name
  • The code chunks in the singler and cellassign section use the same code, where the distinguishing aspect is which set of annotations are being viewed

To increase code modularity (aka, avoid bugs in the future and make the code shorter + easier to navigate), it would be ideal for these to be functions which can each take an argument for the specific "distinguishing aspect." I'd recommend adding some chunks at the top of the notebook to define four functions, one for each of the cases I listed above. Then, you can use those functions throughout the notebook. In this way, each piece of code will only be written once, so it will be easier to maintain/modify over time!

That said, the plots in the Enrichment analysis of marker genes section are tricky to read and I'm not sure they are the right way to communicate these results. I recommend looking into the enrichplot package and using either a bar plot or dot plot, as described here: https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html. You might then use patchwork to add up a plot for each cluster you're comparing. Or, you might also want to use an upset plot via the upsetR package (see same link, section 15.8) to more directly compare clusters.

Other comments

  • Generally speaking, we don't want to be printing too much metadata in notebooks unless it is crucial, and I don't think it is here. This means you can actually get rid of all code reading in the single_cell_metadata.tsv file.
  • For each overall section (e.g. the "malignant genes" would be a section, not the header for each gene), it would be really helpful to have a few short explanatory sentences or even bullet points about what the plots are and what kinds of information we can glean from them (e.g., "the umap tells us this thing, and the violin plot tells us this other thing").
  • I don't think you need to be printing the marker gene tables, but if you do, you can feel free to print the literal variable you pull out from msigbdr.
  • At the bottom of the notebook, you have some future "next steps" listed out. It's great to keep these in the back (or front!) of your mind, but the notebook might not be the right place for them. How about adding these to your README.md file in a "next steps" section?
  • It would be good to have a very quick README.md file in the template-notebook directory with a brief sentence saying what this notebook is for - it's for performing and exploring clustering.
  • A general note, when you are writing code-y things (like the name of a function) in markdown, can you pop it in backticks? I've left a suggestion for one space this should happen as an example.


Open the render_01-clustering.R file and render it to all samples in the dataset using [Source].

html report will be created from the template notebook 01-clustering.Rmd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in my overall comment, can you update this to specifically provide the code to run the notebook from the R script?

analyses/cell-type-wilms-tumor-06/render_01-clustering.R Outdated Show resolved Hide resolved
Comment on lines 12 to 15
params:
sample_name: ~
sample_path: ~
out_rds_path: ~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written in my overall comment, you'll want to use some actual defaults there that will allow the notebook to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I didn't know about the params, I'll try to use it starting from now!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines 45 to 53
```{r paths2}
## to be modified when rendered
filedir <- filelist[i]
sample <- gsub(paste0(data_dir, "/"), "", filedir)

metadata <- read.table(file.path(data_dir, "/single_cell_metadata.tsv"), sep = "\t", header=TRUE)

metadata[metadata$scpca_sample_id == sample,]
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire chunk can be deleted, since you will be using a different strategy to read in files, and we don't need this metadata data frame.

@sjspielman
Copy link
Member

I want to also note that this comment #706 (comment) @jaclyn-taroni's review in your other PR could very much also apply to this notebook too. Analysis parameter choices, which you may decide to vary at some future point, are a good type of variable to make into an R Markdown parameter.

maud-p and others added 6 commits August 9, 2024 22:07
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
@maud-p
Copy link
Contributor Author

maud-p commented Aug 9, 2024

Thank you very much @sjspielman for taking the time to give such a great feedback!

I'll take some more time to read all your documentation point by point but globally from a first read it is all clear and make lot of sense!

To answer your question about the sensitivity of DElegate to the number of clusters, I think it is!
Naively, I used it here to have a rapid idea of the quality of the clustering: if it finds specific genes for each cluster, I expect each of them to have at least a different phenotype. In case it struggles finding specific genes, we might have over-clustered.
It would also allow a quick check if our "favorite" known marker genes pop up in the top 5, like CD45.

As you wrote, I would not use it for annotation. The two uses of it I can see would be (i) at the very beginning to get an idea of the dataset, and (ii) at the very end of the analysis module to find marker genes of the annotated cell types. Would that makes sense?

For plannification, I'd like to first fix the PR #706 regarding the Azimuth/Seurat reference for label transfer and then focus on this PR.

I'll try to provide you feedback hopefully by the end of next week!

Thank you again!

@sjspielman
Copy link
Member

As you wrote, I would not use it for annotation. The two uses of it I can see would be (i) at the very beginning to get an idea of the dataset, and (ii) at the very end of the analysis module to find marker genes of the annotated cell types. Would that makes sense?

Yes, this definitely makes sense! I would actually recommend adding a short "preamble" like this to the introduction section of the notebook too - even just having these as bullet points to state that these are the main goals of the notebook would be a great addition to contextualize the full analysis.

For plannification, I'd like to first fix the PR #706 regarding the Azimuth/Seurat reference for label transfer and then focus on this PR.

Sounds good! Just an FYI - I will be slow to respond next week (8/19-8/23) since the Data Lab will be traveling to teach a workshop.

@maud-p
Copy link
Contributor Author

maud-p commented Aug 14, 2024

Thank you very much again @sjspielman for your detailed review and all comments!

I tried to improve the script and RMarkdown following your advice. Main points:

  • I shortened the R script to the for-loop and moved the rest to the RMarkdown following the template.
  • I used the params in the yaml section of the RMarkdown
  • I defined functions for repetitive codes.
  • I placed the R script in /script folder, the RMarkdown template in the template folder and the notebooks in the notebook folder. I added a short READ.ME file in the `template folder.

FYI, few more changes are planed with PR #706. Not sure what is best to do first, so I just decided to inform you about the few changes I have made to this PR!

Thank you again,

@maud-p maud-p requested a review from sjspielman August 14, 2024 13:32
@sjspielman
Copy link
Member

FYI, few more changes are planed with PR #706. Not sure what is best to do first, so I just decided to inform you about the few changes I have made to this PR!

I was just checking in with some of the changes and @jaclyn-taroni's feedback in #706 when you posted this comment, so good timing 😄 ! I'll have a look at the changes in this PR today, but I do have a sense that some of the functions you are going to write as part of #706 will impact this code, too.

I think a first step towards holistically approaching both PRs would be to first "inventory" the code and figure out which functions you are going to write (e.g., function(s) for common Seurat steps). This will help us understand which functions being written in #706 should also be used here. This leads into some logistics we'll want to keep track of to make sure you don't double your work, and help you avoid conflicts in your code! Specifically, if there are functions which are going to be used in both PRs, then you'll want to tackle one PR fully at a time.

Because I anticipate you'll be writing functions as part of #706 that will also be used in this notebook, I'd suggest first working to wrap up #706. Once that is merged into main, we can update this branch (01-clustering-template) to sync changes, allowing us to use code in #706 in this PR directly (see our documentation on syncing - this can be tricky so please let me know how I can help with this more directly when we get there!). In the meantime, I'll leave feedback today on the code changes you've made so far, but I would probably recommend focusing on #706 before coming back to this PR.

@maud-p
Copy link
Contributor Author

maud-p commented Aug 14, 2024

Great thank you! I agree to first focus on PR #706 and then coming back to this one 😄

sample_metadata_file <- file.path(root_dir, "data", "current", "SCPCP000006", "single_cell_metadata.tsv")
metadata <- read.table(sample_metadata_file, sep = "\t", header = TRUE)

for (i in metadata$scpca_sample_id[9:11]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you currently have 4 rendered notebooks in your branch, but this only encompasses three samples (indices 9, 10, 11). Was this [9:11] just for testing purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was for testing! I was running the loop from the first sample and realized I could test on small samples to check it is running (9 to 11 are small)

This is just ti get a few idea.

```{r marker_genes, fig.width=12, fig.height=6, out.width='100%'}
CellType_metadata <- read_csv(file.path(module_base, "marker-sets", "CellType_metadata.csv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should read this in at the top of the file where you define your path variables like data_dir.

visualize_metadata(srat, meta = "TP53_score1", group.by = "seurat_clusters" )
```

"Normal" cells (immune, endothelial) have a slightly higher TP53 score.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence seems possibly left over from the original notebook before it was a template, since I'm not sure we will always be able to generally say this for any sample!

Comment on lines +569 to +576
tryCatch(
expr = {
Enrichment_plot(category = "C3", signatures = signatures, background = background)
},
error = function(e) {
print("No enrichment")
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this notebooks I think the repeated tryCatch blocks are fine, but just wanted to point out that if you wanted you could write a "wrapper" function to handle this, and it doesn't have to return anything! For example:

wrap_enrichment_plot <- function(provide, the same, arguments, as, the Enrichment_plot, function, takes) {
  tryCatch(
    expr =  {
  Enrichment_plot(category = "C3", signatures = signatures, background = background)
    },
    error = function(e) {
      print("No enrichment")
    }
  )
}

maud-p and others added 6 commits August 14, 2024 17:09
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…ing.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
@maud-p maud-p mentioned this pull request Aug 29, 2024
8 tasks
@maud-p
Copy link
Contributor Author

maud-p commented Aug 30, 2024

Hi @sjspielman ,
I just wanted to let you know that I will continue working on the cluster exploration now 😄
I'll work on a branch from #737 once it will be merged and open a new PR once I have the script updated! Hopefully early next week 😄

@maud-p
Copy link
Contributor Author

maud-p commented Aug 30, 2024

I think we can close this PR then.

@sjspielman
Copy link
Member

Hi @maud-p, if you think a new PR makes more sense given the other module changes you've been making, all good with me! Let's leave this PR open until a new one is filed, just in case.

Either way, I'm not sure if you've had a chance to encounter this yet, but you might find cherry-picking git commits useful as you handle logistics here. Cherry-picking allows you to literally duplicate a commit between branches. For example, if there are commits in this branch you want to include in a different branch to ultimately file a new PR, you can use cherry-picking to copy them over. Here's a tutorial on using it with GitKraken & via command line! https://www.gitkraken.com/learn/git/cherry-pick

Feel free to let me know (on Github or Slack!) if I can help with any git things along the way here :)

@maud-p
Copy link
Contributor Author

maud-p commented Aug 30, 2024

Thank you very much @sjspielman , sounds very useful and I didn't know about it - actually you can consider I don't know anything about git, learning right now :)

@sjspielman
Copy link
Member

actually you can consider I don't know anything about git, learning right now :)

You're doing great!! It's a lot to learn..always a journey!

@maud-p maud-p closed this Sep 3, 2024
@maud-p maud-p deleted the 01-clustering-template branch September 3, 2024 09:13
@maud-p maud-p restored the 01-clustering-template branch September 3, 2024 09:14
@maud-p maud-p deleted the 01-clustering-template branch September 3, 2024 09:14
@maud-p maud-p mentioned this pull request Sep 4, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants