-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine tumor and normal cell annotations across all samples in SCPCP000015 #1027
Refine tumor and normal cell annotations across all samples in SCPCP000015 #1027
Conversation
…rged-annotations-ewings
About to actually look at the notebook itself, but this does seem weird. Follow-up question: Is there a chance this maybe expected to a degree? In mature T cells we expect concentrations of CD3 to be lower in the cytoplasm since it's all getting shuttled to the cell surface, and we are working with single-nuclei data so maybe we'd be biased against some of those? Of course it's the protein, not the transcript, that's getting transported, so this might be a real stretch and something else is going on..... |
I looked at it again after posting this and I do see some expression indicated in the heatmap of CD3E, its just really low. For the dotplot I set a cutoff of the genes being expressed in at least 10% of cells so that's probably why there is no dot for that group of cells. It's only expressed in a low percentage of cells... Those same cells do express PTPRC which should be in all immune cells and do not express MRC1, so they are immune but not macrophages. I think we could dig deeper into the T cells specifically, but this notebook was already getting long, so maybe one more PR after this? |
The Docker failure here is
|
This reverts commit 6697fa9.
Oh definitely separate PR if we want to dig into these more. |
It did, I just didn't run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some initial comments on code and the notebook, but FYI I did not review the whole notebook yet since I got a bit lost navigating all the moving parts and wanted to make sure that I am interpreting the notebook correctly before I keep going. The context I'm missing may not be much though, maybe only 1-2 sentences! Overall though the notebook looks good, and you made a lovely dot plot (review of that code forthcoming)!!
analyses/cell-type-ewings/template_notebooks/utils/plotting-functions.R
Outdated
Show resolved
Hide resolved
umap_df <- umap_df |> | ||
dplyr::left_join(cluster_df, by = join_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than joining in each if
, you could also build up a list of data frames as you go and do something like this in the end -
purrr::reduce(list_of_dfs, dplyr::left_join, by = join_columns)
very small comment though, not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh not sure if this code will get made if that list is empty though...
# output file to save final annotations | ||
results_dir <- file.path(module_base, "results", "final-annotations") | ||
fs::dir_create(results_dir) | ||
output_file <- file.path(results_dir, glue::glue("SCPCP000015_celltype-annotations.tsv.gz")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing is being glued here? did you want to glue anything or is this left over from when we were going sample-by-sample?
```{r, fig.height=10} | ||
auc_columns <- colnames(all_info_df)[which(startsWith(colnames(all_info_df), "auc_"))] | ||
cluster_density_plot(all_info_df, auc_columns, "consensus_lumped", "AUC") | ||
``` | ||
|
||
Here we look at the AUC values for each gene set across all `SingleR` cell types. | ||
|
||
```{r, fig.height=10} | ||
cluster_density_plot(all_info_df, auc_columns, "singler_lumped", "AUC") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quite hard to read because of size. I'd make the chunks taller for sure, and I'm not sure they need to be quite so wide. There's a chance you can get three on a row, but I'm not sure if you can tease out all patterns that way. Regardless, they definitely need to be taller, especially since they are juxtaposed with a nice big heatmap (which is good imo!).
The following input is used: | ||
|
||
- Annotations obtained by running `SingleR` with tumor cells as a reference output by `aucell-singler-annotation.sh`. | ||
- Consensus cell type annotations output from the `cell-type-consensus` module. | ||
- AUC values as calculated by `AUCell` for a set of Ewing sarcoma specific gene sets in MSigDB output by `run-aucell-ews-signatures.sh`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed it somewhere here, but I'd also add to the introduction the specific way in which each of these contributes to the final annotations (and if my understanding here is wrong, that's another good reason to include those details!) -
- SingleR --> tumor cells
- gene sets --> tumor cell states
- Consensus cell types --> normal cells
```{r, fig.height=7, warning=FALSE} | ||
mean_exp_columns <- colnames(all_info_df)[which(endsWith(colnames(all_info_df), "_mean"))] | ||
cluster_density_plot(all_info_df, mean_exp_columns, annotation_column = "consensus_lumped", "mean gene expression") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same plot size comment - more taller more better
cluster_density_plot(all_info_df, auc_columns, "singler_lumped", "AUC") | ||
``` | ||
|
||
The heatmap below shows the AUC values for all cells and all gene sets and the final refined cell type annotations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this wording - aren't you doing refining in the bottom section? These are the consensus cell types right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's make that a bit clearer throughout - for the two heatmaps with the cell_type
label, can this say which cell type? There's a lot of moving parts so it helps to be reminded that they are consensus!
|
||
### Conclusions based on workflow results | ||
|
||
Looking at these results, it looks like things labeled as "tumor" by `SingleR` and "Unknown" in the consensus cell types have high AUC values for the `EWS-FLI1` upregulated gene sets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to pause and return my first review here, since I am a bit lost in the notebook and want to make sure I'm absorbing this all correctly. Specifically, I realize I don't specifically see where the notebook shows the specific relationship between cells that SingleR labeled tumor and EWS-FLI1. Plots looking at expression generally focus on consensus cell types. You do have "tumor" in the final heatmap of the above section, but I'm forgetting how (if at all?) that label related to the SingleR labels themselves.
Punchline: there are many moving parts and a little more context for what comes after the first three UMAPs would be really helpful. In other words, I felt I had a good grasp up until the density plots landed and then I started to get a little unsure that I knew what exactly was what.
@sjspielman thanks for all your comments! I have some ideas to kind of revamp the beginning parts of this notebook to make things more clear. I'll let you know when it's ready for another look! |
@sjspielman let's try this again with hopefully a clearer notebook walking through my thought process for refining the cell types!
I also added CD3G to the list of genes for T cells and we now see expression in mature T cells as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these revisions! The notebook is much easier to follow and walks you through the annotation process really well. For this review, I left various pieces of feedback that mostly are code-oriented; I'll be honest there's a lot of wrangling here and I trust that you have done it right! But before the final.final approval I will a bit more carefully at it.
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
The first thing we will do is compare cell type annotations obtained by each method, `SingleR` with tumor cells as a reference and consensus annotations. | ||
|
||
Let's see how similar these annotations are and see if we can create a combined annotation. | ||
Note that `SingleR` will label tumor cells but the consensus cell types only label normal cells. | ||
Consensus cell types are observed when cell types from `SingleR` and `CellAssign` (using only normal tissue references) share a common ancestor. | ||
If no consensus is found, the cells are labeled with "Unknown". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add specific links to where these cell types were annotated for most inquiring minds out there:
- the singler directory in this analysis module
- probably the openscpca-nf module (not the one in this repo) for consensus since that actually generated the cell types
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
- EWS-high proliferative cells show strong expression of proliferation markers. | ||
- Tumor EWS low cells show pretty strong expression of the `EWS-low` markers. | ||
These markers are also present in fibroblasts, but to a lesser degree. | ||
Additionally, `TNC` (which has been published to be important in the metastatic phenotype in Ewing cells) is specific to the low cells and not found in the fibroblasts, which makes me more confident that those are indeed tumor cells. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doi maybe if it's handy?
These markers are also present in fibroblasts, but to a lesser degree. | ||
Additionally, `TNC` (which has been published to be important in the metastatic phenotype in Ewing cells) is specific to the low cells and not found in the fibroblasts, which makes me more confident that those are indeed tumor cells. | ||
- Endothelial cells show strong expression of `PECAM1` and `VWF`, while that is not seen in most of the other cells. | ||
- All immune cell types show `PTPRC` as expected with macrophages also showing `MRC1` and T cells showing expression of `CD3G` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yay gamma subunit!
|
||
|
||
And finally we'll look at our annotations on a UMAP! | ||
Because, why not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely
# create annotation for heatmap | ||
annotation <- ComplexHeatmap::columnAnnotation( | ||
marker_gene_category = validation_markers_df$cell_type, | ||
col = list( | ||
marker_gene_category = category_colors | ||
) | ||
) | ||
|
||
ComplexHeatmap::Heatmap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get the legend ordered in the same way as the annotation bar?
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…rged-annotations-ewings
@sjspielman I believe I have addressed all your comments and updated plots/tables accordingly! Here's the updated notebook: 08-merged-celltypes.nb.html.zip I'll address some of your inline comments below, but I also made the following changes:
So I checked and none of them are actually 0 but they are very very low. I also don't think we get very much from these minor cell types, so I decided to only look at cell types in SingleR that have at least 50 cells. I also looked at different cutoffs between 5-50 and you don't really lose anything in the visualization since those other cell numbers are so small, everything is just white. I think with 50 you get the idea that there's agreement, which is what we want to see.
I think you're right so I adjusted the cutoffs. I played around with the numbers until I saw the lines line up as close as possible with the dips in Unknown. Overall numbers are still about the same for context.
Order for the density plots has been updated. Note that memory T cell is no longer in the top cell types with the separation of EWS high/low.
I don't know that at this point we can include a true confidence indicator. I think this is just part of dealing with the biology of Ewing sarcoma. They are going to resemble tumor cells, so we can do our best to separate them based on EWS-FLI1 target expression, but we may miss some. I think if we document how things were classified and the gene sets we used to do that, then that should be sufficient.
I added a table with this and the cell types I show in the plots are found in at least 4 libraries. Perhaps the other cell types are mis-labeled but I think digging into that is outside the scope of this notebook, so I made a note of that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good! I left a few more small comments but this is almost certainly the last round!
I added a table with this and the cell types I show in the plots are found in at least 4 libraries. Perhaps the other cell types are mis-labeled but I think digging into that is outside the scope of this notebook, so I made a note of that.
Oh 💯 I agree about scope - this would certainly be separate if pursued at all.
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
|
||
Any cells that are unable to be labeled via consensus cell types are labeled as "Unknown" and I expect these will line up with cells labeled as tumor cells by `SingleR`. | ||
|
||
```{r, fig.height=5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just add a sentence saying the heatmap shows only >= 50
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-ewings/exploratory_analysis/08-merged-celltypes.Rmd
Outdated
Show resolved
Hide resolved
|
||
# get individual gene counts for all marker genes | ||
gene_cts <- logcounts(sce[genes, ]) |> | ||
as.matrix() |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need this?
# get total number of cells per final annotation group | ||
total_cells_df <- genes_df |> | ||
dplyr::select(barcodes, final_lumped) |> | ||
unique() |> | ||
dplyr::count(final_lumped, name = "total_cells") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think you need this intermediate to join in but can just tack in an add_counts()
into your piping below where you define gene_summary_df
. Using add_count()
in place of the left_join()
on line 561 for total_cells_df
I think will get you there
patchwork::plot_layout(ncol = 1, heights = c(4, 0.1)) | ||
``` | ||
|
||
A few notes from this plot: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd state that expression in the plot is capped at 2.5 but it might actually be higher
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
@sjspielman I addressed the remaining small comments, so this should be ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐈 🐈⬛ 🚀
The updated results from these annotations can be found at I also canceled the workflow run since no changes here are in CI. |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
Closes #1017
What is the goal of this pull request?
The goal of this PR is to compile the cell type annotations obtained from
SingleR
(using theaucell-singler-annotation.sh
workflow), the consensus cell types, and the AUC values from runningAUCell
on a set of Ewing specific gene sets for all samples inSCPCP000015
. As a result of this we are able to assign cell types across all cells and output both a rendered notebook summarizing the cell type annotations for the entire dataset and a TSV file with all assigned cell types.Briefly describe the general approach you took to achieve this goal.
celltype-exploration.Rmd
) as a new exploratory notebook. Much of the set up and first sections that just explore the results are copied from that template notebook. The main difference here is that we are now working with the merged object rather than a single object. I did adjust some of the setup to account for working with the merged object, including:assign-consensus-celltype.sh
from thecell-type-consensus
module) for all samples.SingleR
results for all samplesSingleR
and the consensus cell types. For the AUCell results and custom gene sets I looked at the expression across both cell types. Generally the set of cell types is similar except for the presence of chondrocytes in the SingleR annotations which line up with EWS-FLI1 low cells.combined-validation-markers.tsv
, that has a few genes for each of the expected cell types. The tumor genes are all pulled from previous marker genes we had in our lists. I added MRC1 for macrophages and CD3D and CD3E for T cells. I then looked at the mean expression of these genes across all cells in each cell type in a dot plot and a heatmap. And then finally a UMAP with our "final.final" annotations for all cells!Just a side note that I altered some plot colors while I was here and veered away from our usual viridis coloring scheme so that I could see some of the variation easier. Let me know if you hate it and I can change it back.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Depending on reviews, I am anticipating this to be the last analysis PR for this module for now.
My hope is that this helps "finalize" our annotations across all samples and gets this module to a stopping point until we want to any further cell state classification.
I'm envisioning the next PRs to be clean-up PRs to make sure all the code needed to actually produce these annotations is well-organized and all documentation is updated.
Results
What is the name of your results bucket on S3?
s3://researcher-211125375652-us-east-2/cell-type-ewings/results/final-annotations/SCPCP000015_celltype-annotations.tsv.gz
What types of results does your code produce (e.g., table, figure)?
TSV file with all annotations and a rendered notebook summarizing annotations:
08-merged-celltypes.nb.html.zip
What is your summary of the results?
I included a lot of commentary in the notebook about the results, but generally I think we are able to pull out tumor cells from normal cells. There is some detection of tumor cell markers across all cell types, but it is much lower in the normal cells. We also see pretty much no expression of the normal cell markers in the tumor cells.
Within tumor cells, we're able to pull out EWS-low, EWS-high, and EWS-proliferative. The EWS-low show pretty strong expression of the expected marker genes and the proliferative genes are exclusively expressed in the proliferative subset.
The only questionable finding is that the "mature T cells" show very low expression of CD3... But expression of that gene is generally pretty low, so not totally sure what to think about that. The mature T cell comes from the consensus cell type, so for now I'm good to leave it 🤷♀️
There are a handful of cell types that have 1-8 cells and I did not dive deeper into those to validate them. These are cell types assigned by the consensus labels, so I feel comfortable leaving them alone.
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
Originally I thought working with the merged object would be an issue, but I was able to run this notebook in like a minute on my laptop without any issues.
Is there anything that you want to discuss further?
Apologies for the larger PR here, but I felt it was more important to get the whole picture together rather than break it up for this particular scenario. I'm sure there are plots we could cut out, but I wanted to include them all at first just so you could see all the data as a reviewer.
Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.