Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Included initial exploratory analysis notebook #794

Merged
merged 11 commits into from
Oct 24, 2024

Conversation

danhtruong
Copy link
Contributor

Hi there, thanks for your contribution!
Please fill out this template to help us review your code.

Remember, the more context you provide, the faster the review will go!
For more information about filling out this template, please see the OpenScPCA documentation on filing pull requests:
https://openscpca.readthedocs.io/en/latest/contributing-to-analyses/creating-pull-requests/

Before filing the pull request, you can also feel free to delete the italicized instruction lines in this template.

Purpose/implementation Section

In this section, tell reviewers about the scope and purpose of the code in the pull request.

Please link to the GitHub issue that this pull request addresses.

#701
If applicable, you can also link to the associated GitHub Discussion.

What is the goal of this pull request?

Assess the quality of the data set
Curate a list of marker genes associated with DSRCT cells
Detect the expression levels of DSRCT marker genes in the data set
Identify clusters in the DSRCT samples

Briefly describe the general approach you took to achieve this goal.

I analyzed the DSRCT marker expression in the dataset. I attempted to define a delineation between tumor and normal cells within the data.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes. The initial method has some flaws. Some of the data only contains DSRCT cells so there are no normal cells to compare to. The data quality is not as high as other data sets that I have analyzed. We will need to re-think how to use this dataset but I suspect we have enough DSRCT markers to identify dsrct cells. Unfortunately, the approach may be dataset dependent due to the lower quality of this dataset.

Results

Delete this section if no results are associated with your pull request.

In this section, tell reviewers about what kinds of results (if any) your code produces.

The results include plots for analyzing the data set as well as a processed RDS file containing a list of the SingleCellExperiment objects.

What is the name of your results bucket on S3?

researcher-905418339657-us-east-2

What types of results does your code produce (e.g., table, figure)?

RDS of processed data

What is your summary of the results?

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

R
ggplot2
SingleCellExperiment

This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g., renv or conda) to run this code?
For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

  • [x ] This analysis module uses the analysis template and has the expected directory structure.
  • The analysis module README.md has been updated to reflect code changes in this pull request.
  • [ x] The analytical code is documented and contains comments.
  • Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@jaclyn-taroni jaclyn-taroni requested review from jashapiro and removed request for jaclyn-taroni October 7, 2024 18:59
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @danhtruong, and thank you for your submission!

I am going to split my comments into two parts: first there are a few technical issues that I want to address, and then I will talk a bit more about some of the more scientific questions.

Technical points

When you set up this module, it was quite minimal which was fine, but now that there is some code in here, we want to start adding a bit more of the reproducibility structure that we will need.

  • First, we will want to include the necessary renv setup. If you have not already done this, you can run renv::init() in the module directory and it should discover all the necessary packages, add an .Rprofile file, and create the local library. You will then need to add .Rprofile, renv.lock, and the renv directory to the branch with git add (most of the renv directory will be ignored).

  • The .Rprofile file that renv creates is not exactly what we will want to have in the repo when we create a docker image, so after running renv::init() you will want to manually update that file just a bit to have the following content:

# Don't activate renv in an OpenScPCA docker image
if(Sys.getenv('OPENSCPCA_DOCKER') != 'TRUE'){
  source('renv/activate.R')
}

  • It is also usually helpful to include a .Rproj file in the module directory which will make it easier to automatically use the renv library in subdirectories (if you use RStudio). In this case the file would be cell-type-dsrct.Rproj and should have the following content (other options may be fine too):
Version: 1.0

RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX

If you have any questions about these steps, don't hesitate to ask. I can also make these changes and push them to the branch if that would be easier for you.

  • Please include the output of rendering your notebook: that can make review much smoother, as it may save reviewers from having to rerun the analysis. In this case, the html file may be quite large due to the number of plots. Were you maybe running into problems with pre-commit not allowing the files to be included? If so, we might want to save out the plots separately.

Scientific content

  • Can you provide more detail about the markers that you are using? You have listed the source for many, and the ones you find the most reliable, as "internal", but it would be very helpful if there were a somewhat more traceable source there. I recognize that this may be unpublished results, so we may have to discuss what the best way to describe that is, but we would at least want to include a name/lab so someone coming in would know where to go for more information.

  • Please use the library ids as the primary identifier as you work through the project. While in this case the sample and library have a 1:1 correspondence, that is not always the case, so we are trying to be consistent with the primary id.

  • Removing the library with low counts seems just fine to me for the final analysis, but I might have liked to see the distributions for that library as well in this exploratory notebook, just for reference

  • For your plots: I wonder if it might be worth having all libraries in a single plot rather than looping through and plotting them separately? You could combine all the data frames with something like the following:

combined_umap_df <- a <- umap_df_list |> dplyr::bind_rows(.id = "library_id")

then plot with

ggplot(combined_umap_df, aes(x = UMAP1, y = UMAP2, color = gene_expression)) +
  facet_grid(rows = vars(library_id), cols = vars(gene_symbol)) +
  ...

It may be that this ends up looking bad, but worth a try? I think it may actually be most useful for the expression distribution plots.

It also can simplify a lot of your lapply expressions later into single operations, with an extra group_by term for the library, etc.

  • Speaking of the expression distributions: It looks to me like you do have pretty good and comparable distributions without using z-scaling. In almost all cases you have a large number of zero or near-zero values (a histogram might make this more clear than a density plot) with a separate group that has positive values. For now, you could find the minimum of the distribution between the peaks and use that, and we could probably find something more sophisticated later.

  • The analysis of clusters does look potentially promising, but I think you will definitely want to explore the effects of clustering on that analysis. We are in the process of making some functions to make that easier, but you might start just by trying some different clustering algorithms/parameters: while we have a default clustering that we provided, we do not think these are the best clusters possible; just a starting point. I might put this part of the exploration into a second notebook, as I expect it to take a bit more space.

Other notes

I also made a number of line-level comments throughout. Some of these may be redundant with the above, but most are more fine-grained code suggestions.

Thanks again for the contribution! If you have any questions, please don't hesitate to reply to my comments here, or send a note in Slack. I look forward to seeing the next steps in your analysis!

repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "2024-07-08")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use the current directory where possible (a link to the latest data), so that the most current data is always used

Suggested change
data_dir <- file.path(repository_base, "data", "2024-07-08")
data_dir <- file.path(repository_base, "data", "current")

Comment on lines 50 to 55

```{r}
# source in helper functions: make_jaccard_matrix() and jaccard()
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are not using these at the moment, and the file is not present in this PR, so I am removing this for the moment.

Suggested change
```{r}
# source in helper functions: make_jaccard_matrix() and jaccard()
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
```
```{r}
# source in helper functions: make_jaccard_matrix() and jaccard()
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
```

Comment on lines 2 to 5
tumor ST6GALNAC5 ENSG00000117069 internal
tumor CACNA2D2 ENSG00000007402 internal
tumor IQCJ-SCHIP1 ENSG00000283154 internal
tumor PTPRQ ENSG00000139304 Internal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reference that you can point to that describes why these genes were chosen? If not, we would probably want to have something referring to the lab/PI so this can be traced past this repository.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two articles describe them but its buried in the supplementary. Other than that, we are seeing it clearly in our datasets. We have one publicly available DSRCT PDX single-cell data set that I can compare against control samples where we'd see these same genes too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably worth including the references even if they are buried in the supplement. You can also include your lab as an unreferenced source, if you think that would be more useful... I think we just want to avoid "internal" as a reference as that makes it harder to trace.

Comment on lines +346 to +347
# get an RDS of the processed data
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to just be the file list? Did you want to also save the extracted data tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we save the extracted data not jus the file names?

Suggested change
# get an RDS of the processed data
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')
# save an RDS of the processed data
saveRDS(umap_df_list, file.path(module_base, 'results/SCPCP000013_umap_df_list.rds')

danhtruong and others added 3 commits October 15, 2024 10:55
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
updated the notebook to include plot outputs
updated the reference to include papers and internal reference
@danhtruong danhtruong requested a review from jashapiro October 22, 2024 15:59
Comment on lines 160 to 162
tiff(filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.tiff'), width = 6, height = 6, units = 'in', res = 150)
print(p)
dev.off()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using ggsave to allow one-command saving, and png should be smaller than tiff (though this may be something to test).

Suggested change
tiff(filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.tiff'), width = 6, height = 6, units = 'in', res = 150)
print(p)
dev.off()
ggsave(
filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.png'),
plot = p,
width = 6,
height = 6,
units = 'in',
dpi = 150
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also include the plot files (and the rendered html if possible) in the submission.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will add.

updated the notebook to include plot outputs
updated the reference to include papers and internal reference
added plot outputs
updated the notebook to include plot outputs
updated the reference to include papers and internal reference
added plot outputs
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Danh,

Thank you for the updates; having the plots included is quite helpful.

As this is an exploratory notebook, I don't want to dwell on most of my suggestions below; they are designed mostly for future work, and I would not expect the ones about scaling to be incorporated.

The small changes of paths etc. should be easy to include though, and if there are any additional comments you would like to include, that would be good to do now as well. Then I would appreciate if you could run the notebook one more time to be sure that the html file is also up to date, and then we can get this PR merged.

Comment on lines +218 to +219
Now we will transform each of the gene expression vectors by generating z-scores.
Then we might be able to find a cut off we can use across samples for if marker genes are present in a cell or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for a rough look, but it is quite brittle. As you note later, some samples may contain only tumor cells. In that case, using the zscore may not be a good measure; what we really want is to find the division between the bimodal peaks. This is beyond the scope of this PR, but for your final classifications I would not expect you to want to use z-scoring.


We could either count any cell that expresses at least one marker gene > 0 as a tumor cell, or look at the combined expression.
Let's start with classifying tumor cells as tumor if any marker gene is present (z-score > 0).
This also means the sum of all z-scores > 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this statement. Why would having one marker gene > 0 mean the sum of all is > 0?

Again, I'd probably not want to use the z scores for this, but maybe instead scaling just by range (maybe the 90% quartile or something like that?), but leaving zero as the lower bound? The cutoff may still be hard to draw, but z scoring often breaks when there are not two relatively equal classes. (I should say, one of my current projects is to find a robust way to automatically segment these kinds of distributions.)

Comment on lines +346 to +347
# get an RDS of the processed data
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we save the extracted data not jus the file names?

Suggested change
# get an RDS of the processed data
saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')
# save an RDS of the processed data
saveRDS(umap_df_list, file.path(module_base, 'results/SCPCP000013_umap_df_list.rds')

danhtruong and others added 3 commits October 23, 2024 11:28
modified the saves

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
modified the saves

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
modified the saves

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@danhtruong danhtruong requested a review from jashapiro October 23, 2024 18:59
Comment on lines 400 to 404
<<<<<<< HEAD
'/plots/UMAP_classification_cluster_',
=======
'../plots/UMAP_classification_cluster_',
>>>>>>> danhtruong/danhtruong/cell-type-dsrct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Danh: there are a number of places where it seems like your latest changes have introduced merge artifacts like this one. Can you please fix these and rerender the notebook? I would expect to see a newer version of the html that matches the latest updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies... on closer inspection it looks like you did rerun the notebook, but then a later merge brought in the above error (which should have been caught by our pre-commit tests, but wasn't for some reason!). I have reverted the offending commit and merged in the changes from the main repository, so this should be ready to go, but you will need to pull down the changes from this branch (or start a new branch from main after this is merged in) before proceeding with your next steps.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving this PR to get this exploratory notebook in. I do think there may be more work to do about finding the cutoff points for marker gene expression, but it is useful to have these explorations recorded in the repository in their current state.

@jashapiro jashapiro merged commit bb58c2d into AlexsLemonade:main Oct 24, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants