-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Included initial exploratory analysis notebook #794
Included initial exploratory analysis notebook #794
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @danhtruong, and thank you for your submission!
I am going to split my comments into two parts: first there are a few technical issues that I want to address, and then I will talk a bit more about some of the more scientific questions.
Technical points
When you set up this module, it was quite minimal which was fine, but now that there is some code in here, we want to start adding a bit more of the reproducibility structure that we will need.
-
First, we will want to include the necessary
renv
setup. If you have not already done this, you can runrenv::init()
in the module directory and it should discover all the necessary packages, add an.Rprofile
file, and create the local library. You will then need to add.Rprofile
,renv.lock
, and therenv
directory to the branch withgit add
(most of therenv
directory will be ignored). -
The
.Rprofile
file thatrenv
creates is not exactly what we will want to have in the repo when we create a docker image, so after runningrenv::init()
you will want to manually update that file just a bit to have the following content:
# Don't activate renv in an OpenScPCA docker image
if(Sys.getenv('OPENSCPCA_DOCKER') != 'TRUE'){
source('renv/activate.R')
}
- It is also usually helpful to include a
.Rproj
file in the module directory which will make it easier to automatically use therenv
library in subdirectories (if you use RStudio). In this case the file would becell-type-dsrct.Rproj
and should have the following content (other options may be fine too):
Version: 1.0
RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
If you have any questions about these steps, don't hesitate to ask. I can also make these changes and push them to the branch if that would be easier for you.
- Please include the output of rendering your notebook: that can make review much smoother, as it may save reviewers from having to rerun the analysis. In this case, the html file may be quite large due to the number of plots. Were you maybe running into problems with
pre-commit
not allowing the files to be included? If so, we might want to save out the plots separately.
Scientific content
-
Can you provide more detail about the markers that you are using? You have listed the source for many, and the ones you find the most reliable, as "internal", but it would be very helpful if there were a somewhat more traceable source there. I recognize that this may be unpublished results, so we may have to discuss what the best way to describe that is, but we would at least want to include a name/lab so someone coming in would know where to go for more information.
-
Please use the library ids as the primary identifier as you work through the project. While in this case the sample and library have a 1:1 correspondence, that is not always the case, so we are trying to be consistent with the primary id.
-
Removing the library with low counts seems just fine to me for the final analysis, but I might have liked to see the distributions for that library as well in this exploratory notebook, just for reference
-
For your plots: I wonder if it might be worth having all libraries in a single plot rather than looping through and plotting them separately? You could combine all the data frames with something like the following:
combined_umap_df <- a <- umap_df_list |> dplyr::bind_rows(.id = "library_id")
then plot with
ggplot(combined_umap_df, aes(x = UMAP1, y = UMAP2, color = gene_expression)) +
facet_grid(rows = vars(library_id), cols = vars(gene_symbol)) +
...
It may be that this ends up looking bad, but worth a try? I think it may actually be most useful for the expression distribution plots.
It also can simplify a lot of your lapply
expressions later into single operations, with an extra group_by
term for the library, etc.
-
Speaking of the expression distributions: It looks to me like you do have pretty good and comparable distributions without using z-scaling. In almost all cases you have a large number of zero or near-zero values (a histogram might make this more clear than a density plot) with a separate group that has positive values. For now, you could find the minimum of the distribution between the peaks and use that, and we could probably find something more sophisticated later.
-
The analysis of clusters does look potentially promising, but I think you will definitely want to explore the effects of clustering on that analysis. We are in the process of making some functions to make that easier, but you might start just by trying some different clustering algorithms/parameters: while we have a default clustering that we provided, we do not think these are the best clusters possible; just a starting point. I might put this part of the exploration into a second notebook, as I expect it to take a bit more space.
Other notes
I also made a number of line-level comments throughout. Some of these may be redundant with the above, but most are more fine-grained code suggestions.
Thanks again for the contribution! If you have any questions, please don't hesitate to reply to my comments here, or send a note in Slack. I look forward to seeing the next steps in your analysis!
repository_base <- rprojroot::find_root(rprojroot::is_git_root) | ||
|
||
# The current data directory, found within the repository base directory | ||
data_dir <- file.path(repository_base, "data", "2024-07-08") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to use the current
directory where possible (a link to the latest data), so that the most current data is always used
data_dir <- file.path(repository_base, "data", "2024-07-08") | |
data_dir <- file.path(repository_base, "data", "current") |
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
|
||
```{r} | ||
# source in helper functions: make_jaccard_matrix() and jaccard() | ||
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R") | ||
source(jaccard_functions) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are not using these at the moment, and the file is not present in this PR, so I am removing this for the moment.
```{r} | |
# source in helper functions: make_jaccard_matrix() and jaccard() | |
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R") | |
source(jaccard_functions) | |
``` | |
```{r} | |
# source in helper functions: make_jaccard_matrix() and jaccard() | |
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R") | |
source(jaccard_functions) | |
``` |
tumor ST6GALNAC5 ENSG00000117069 internal | ||
tumor CACNA2D2 ENSG00000007402 internal | ||
tumor IQCJ-SCHIP1 ENSG00000283154 internal | ||
tumor PTPRQ ENSG00000139304 Internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reference that you can point to that describes why these genes were chosen? If not, we would probably want to have something referring to the lab/PI so this can be traced past this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two articles describe them but its buried in the supplementary. Other than that, we are seeing it clearly in our datasets. We have one publicly available DSRCT PDX single-cell data set that I can compare against control samples where we'd see these same genes too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably worth including the references even if they are buried in the supplement. You can also include your lab as an unreferenced source, if you think that would be more useful... I think we just want to avoid "internal" as a reference as that makes it harder to trace.
# get an RDS of the processed data | ||
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to just be the file list? Did you want to also save the extracted data tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we save the extracted data not jus the file names?
# get an RDS of the processed data | |
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds') | |
# save an RDS of the processed data | |
saveRDS(umap_df_list, file.path(module_base, 'results/SCPCP000013_umap_df_list.rds') |
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
updated the notebook to include plot outputs updated the reference to include papers and internal reference
tiff(filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.tiff'), width = 6, height = 6, units = 'in', res = 150) | ||
print(p) | ||
dev.off() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using ggsave
to allow one-command saving, and png
should be smaller than tiff (though this may be something to test).
tiff(filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.tiff'), width = 6, height = 6, units = 'in', res = 150) | |
print(p) | |
dev.off() | |
ggsave( | |
filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.png'), | |
plot = p, | |
width = 6, | |
height = 6, | |
units = 'in', | |
dpi = 150 | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also include the plot files (and the rendered html if possible) in the submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will add.
updated the notebook to include plot outputs updated the reference to include papers and internal reference added plot outputs
updated the notebook to include plot outputs updated the reference to include papers and internal reference added plot outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Danh,
Thank you for the updates; having the plots included is quite helpful.
As this is an exploratory notebook, I don't want to dwell on most of my suggestions below; they are designed mostly for future work, and I would not expect the ones about scaling to be incorporated.
The small changes of paths etc. should be easy to include though, and if there are any additional comments you would like to include, that would be good to do now as well. Then I would appreciate if you could run the notebook one more time to be sure that the html file is also up to date, and then we can get this PR merged.
Now we will transform each of the gene expression vectors by generating z-scores. | ||
Then we might be able to find a cut off we can use across samples for if marker genes are present in a cell or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for a rough look, but it is quite brittle. As you note later, some samples may contain only tumor cells. In that case, using the zscore may not be a good measure; what we really want is to find the division between the bimodal peaks. This is beyond the scope of this PR, but for your final classifications I would not expect you to want to use z-scoring.
|
||
We could either count any cell that expresses at least one marker gene > 0 as a tumor cell, or look at the combined expression. | ||
Let's start with classifying tumor cells as tumor if any marker gene is present (z-score > 0). | ||
This also means the sum of all z-scores > 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this statement. Why would having one marker gene > 0 mean the sum of all is > 0?
Again, I'd probably not want to use the z scores for this, but maybe instead scaling just by range (maybe the 90% quartile or something like that?), but leaving zero as the lower bound? The cutoff may still be hard to draw, but z scoring often breaks when there are not two relatively equal classes. (I should say, one of my current projects is to find a robust way to automatically segment these kinds of distributions.)
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd
Outdated
Show resolved
Hide resolved
# get an RDS of the processed data | ||
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we save the extracted data not jus the file names?
# get an RDS of the processed data | |
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds') | |
# save an RDS of the processed data | |
saveRDS(umap_df_list, file.path(module_base, 'results/SCPCP000013_umap_df_list.rds') |
modified the saves Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
modified the saves Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
modified the saves Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
<<<<<<< HEAD | ||
'/plots/UMAP_classification_cluster_', | ||
======= | ||
'../plots/UMAP_classification_cluster_', | ||
>>>>>>> danhtruong/danhtruong/cell-type-dsrct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Danh: there are a number of places where it seems like your latest changes have introduced merge artifacts like this one. Can you please fix these and rerender the notebook? I would expect to see a newer version of the html that matches the latest updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies... on closer inspection it looks like you did rerun the notebook, but then a later merge brought in the above error (which should have been caught by our pre-commit tests, but wasn't for some reason!). I have reverted the offending commit and merged in the changes from the main
repository, so this should be ready to go, but you will need to pull down the changes from this branch (or start a new branch from main
after this is merged in) before proceeding with your next steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving this PR to get this exploratory notebook in. I do think there may be more work to do about finding the cutoff points for marker gene expression, but it is useful to have these explorations recorded in the repository in their current state.
Hi there, thanks for your contribution!
Please fill out this template to help us review your code.
Remember, the more context you provide, the faster the review will go!
For more information about filling out this template, please see the OpenScPCA documentation on filing pull requests:
https://openscpca.readthedocs.io/en/latest/contributing-to-analyses/creating-pull-requests/
Before filing the pull request, you can also feel free to delete the italicized instruction lines in this template.
Purpose/implementation Section
In this section, tell reviewers about the scope and purpose of the code in the pull request.
Please link to the GitHub issue that this pull request addresses.
#701
If applicable, you can also link to the associated GitHub Discussion.
What is the goal of this pull request?
Assess the quality of the data set
Curate a list of marker genes associated with DSRCT cells
Detect the expression levels of DSRCT marker genes in the data set
Identify clusters in the DSRCT samples
Briefly describe the general approach you took to achieve this goal.
I analyzed the DSRCT marker expression in the dataset. I attempted to define a delineation between tumor and normal cells within the data.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes. The initial method has some flaws. Some of the data only contains DSRCT cells so there are no normal cells to compare to. The data quality is not as high as other data sets that I have analyzed. We will need to re-think how to use this dataset but I suspect we have enough DSRCT markers to identify dsrct cells. Unfortunately, the approach may be dataset dependent due to the lower quality of this dataset.
Results
Delete this section if no results are associated with your pull request.
In this section, tell reviewers about what kinds of results (if any) your code produces.
The results include plots for analyzing the data set as well as a processed RDS file containing a list of the
SingleCellExperiment
objects.What is the name of your results bucket on S3?
researcher-905418339657-us-east-2
What types of results does your code produce (e.g., table, figure)?
RDS of processed data
What is your summary of the results?
Provide directions for reviewers
In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.
What are the software and computational requirements needed to be able to run the code in this PR?
R
ggplot2
SingleCellExperiment
This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g.,
renv
orconda
) to run this code?For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.
Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.