celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #789

UTSouthwesternDSSR · 2024-10-03T00:17:29Z

Purpose/implementation Section

The goal of this PR is to annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Please link to the GitHub issue that this pull request addresses.

Issue: celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #787
Discussion: Non-early T-cell precursor T-cell acute lymphoblastic leukemia (non-ETP ALL) annotation (SCPCP000003) #630

What is the goal of this pull request?

Annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Briefly describe the general approach you took to achieve this goal.

Create marker genes list by obtaining most cell types from Azimuth human reference - bone marrow(level 1: B, CD4 T, CD8 T, DC, HSPC, Mono, NK, Other T; level 2: Macrophage, Early Eryth, Late Eryth, Plasma, Platelet, Stromal), blast cell from Bhasin et al., and erythroid precursor and cancer cell from ScType database.
Annotate cell type by running ScType with above marker gene list.
Identify tumor cells by running CopyKat, using B cells from the same sample as normal cells.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Out of 11 samples, 7 samples have B cells annotated (except SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100). Although SCPCS000099 contains B cells, these cells are not useful for the identification of tumor cells, as shown by the extremely large number of not.defined cells with default parameters.

I may try to merge the samples, and use the annotated B cells as the normal cell to run CopyKat, hoping that we could identify tumor cells from those samples without B cells. This could be another pull request, if it works.

Results

What is the name of your results bucket on S3?

rds objects are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
metadata files and sctype results are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/
umap and dot plots are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots

What types of results does your code produce (e.g., table, figure)?

rds objects
two text files for each sample: _metadata.txt (cell ID, leiden clusters, cell type annotation, low confidence cell type annotation, CopyKat prediction [for the 7 samples]) and _sctype_top10_celltypes_perCluster.txt (top 10 possible cell types with their respective sctype score in each cluster)
umap plots showing leiden clustering, cell type, and CopyKat prediction respectively
dot plots showing the average expression of group of markers for each cell type using AddModuleScore()

What is your summary of the results?

With the default threshold of having sctype score > 25% of ncells in a cluster (sctype_classification), there are a large number of cells being annotated as "Unknown" in each sample, reaching to as high as ~45% (disregarding SCPCS000099: 48% and SCPCS000100: 52%, which have much lower number of features detected).

Therefore, I relaxed the threshold from 25% to 10% (lowConfidence_annot), and the percentage of "Unknown" is now dropped to 30% (disregarding SCPCS000099: 33% and SCPCS000100: 48%), with 5 samples having no "Unknown".

I only ran CopyKat for 7 samples, excluding SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100. However,
tumor prediction does not work for SCPCS000099, given that it has 36% of "not.defined" cells, while the other samples have less than 1%. Even if I relaxed the cutoff of ngene.chr from 5 to 1, the number of "not.defined" cells decreases, but they all get annotated as "diploid".

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

The packages are installed and updated in renv.lock and conda.lock.
Analysis could be executed on a Standard-4XL virtual machine via AWS Lightsail for Research, but CopyKat runs pretty slow on this machine with one core. Therefore, I ran the CopyKat on our lab server with 50 cores (The longest time needed for a sample is ~1.5 hr).

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

I am wondering will you take a look at the results part during PR, or it will be reviewed after the deadline?
For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?

Thank you so much for any suggestions/feedback!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jaclyn-taroni

Hi @UTSouthwesternDSSR — thank you for this contribution! I am still reviewing the results but wanted to provide some initial feedback on the code.

To summarize my comments at a high level:

Your module assumes a 1:1 relationship between sample and library identifiers, but that isn’t always true in ScPCA data. From what I saw of the ETP-ALL module, I believe you’d like to use this code for other sets of samples, so making this change for safety may make things easier later. I’ve made suggestions about using the library identifier as the primary ID, but I may not have caught every reference to a sample identifier.
analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv could have more descriptive column names that make it easier for someone looking at the code for the first time to understand what it contains.
I want to discuss whether or not using an Excel file is necessary and moving the gene set list generation outside of individual functions so you can run it once instead of multiple times every time you annotate a sample.
We want to tie the scType code to a particular permalink of that repository’s code.
It would be best to create the results/rds directory programmatically instead of assuming it already exists, which causes failures on other machines.
There is at least one package missing from the renv.lock file, which I think could be captured by changing the library() calls in the annotation script to no longer use lapply().

And then for science and display questions, as I’m still working through that part of the review:

Is the same cell type label being assigned to all cells in a Leiden cluster?
Would it be possible to combine the UMAP plots into one multi-panel plot with something like patchwork? If so, I think folks would have an easier time with the review and interpretation of your results. If this change will take a lot of time, we can skip it.

Thank you again for your contribution. Please let me know if you have any questions or if there’s any way we can help!

jaclyn-taroni · 2024-10-03T13:22:53Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

@@ -43,7 +43,7 @@ run_sam <- function(ind.sample, ind.library){
  final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt"))
  final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt"))))

-  saveRDS(final.obj, file.path(out_loc,"scratch",paste0(ind.sample,".rds")))
+  saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds")))


I think you need a directory creation step somewhere in this script now that you've changed the output location based on my local testing and potentially the CI error. That can be accomplished with:

dir.create(file.path(out_loc, "results/rds", recursive = TRUE, showWarnings = FALSE))

And will re-run without error or warning if it already exists

In keeping with library ID as the primary identifier:

Suggested change

saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds")))

saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.library,".rds")))

Yes, for the ETP-ALL samples, Sample SCPCS000066 has two libraries, and I actually merge the two rds file without doing any batch effect correction. I write another function of run_multisam for those with multiple libraries. It requires a little bit of manual code to provide input to that function, but it works perfectly fine for ETP-ALL samples, since there is only one case.

That's the reason why I thought saving as sampleID making more sense, because technically we should merge multiple libraries together if they belong to the same sample? Isn't this will make the analysis more powerful, since there are more cells added?

While it seems like the batch effect with that case is not bad, I would still lean toward leaving the two libraries separate. Having more cells may not always result in better results (especially if there were a batch effect), but that is largely an empirical question; I think we would want to evaluate what the tradeoffs are for different parts of the analysis.

To me, keeping the libraries separate seems like the more conservative (and simpler) approach, so it is the one I would start with.

Sure, I will remove the run_multisam function from the ETP-ALL module script.

jaclyn-taroni · 2024-10-03T13:54:27Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+# load gene set preparation function
+#source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
+# load cell type annotation function
+source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")


We probably want to tie this to a specific commit in case any update that's not backward compatible with this code is pushed to master. The following worked for me locally:

Suggested change

source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

source("https://github.com/IanevskiAleksandr/sc-type/raw/6db9eef49f185cf4d79bfec92a20fcf1edcccafb/R/sctype_score_.R")

jaclyn-taroni · 2024-10-03T14:03:32Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+#Based on the annotated B cells, we provide them as the normal cells for CopyKat to identify tumor cells.
+#The outputs are final rds file and two umap plots showing cell type and tumor/normal cells status.
+
+lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)


I don't think you'll need openxlsx anymore, so I've removed it in the suggestion. HGNChelper is not in the renv.lock file and doesn't get captured when I tried renv::snapshot() locally, so I think we need to remove the lapply() here for that to work as expected:

Suggested change

lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)

library(dplyr)

library(Seurat)

library(HGNChelper)

jaclyn-taroni · 2024-10-03T14:05:48Z

analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv

Based on my cursory reading of the annotation script, I understand that geneSymbolmore1 is a gene set up in a cell type, and geneSymbolmore2 is a gene set down in a cell type. Can we rename these columns to be more descriptive about their contents? For example, ensembl_id_up and ensembl_id_down. We want it to be easy for folks to understand if they are seeing the file for the first time.

I realize this could be because you get the table from Azimuth, but that's nicely documented in your README, so I'm not worried about differing from Azimuth too much.

jaclyn-taroni · 2024-10-03T14:13:42Z

analyses/cell-type-nonETP-ALL-03/README.md

@@ -42,7 +42,7 @@ The `scripts/00-01_processing_rds.R` requires the processed SingleCellExperiment
 ../../download-results.py --projects SCPCP000003 --modules doublet-detection
 ```

-As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type, and *TMEM56* is not detected in our dataset, thus it is being removed as part of the markers for Late Eryth. As of now, there is no negative marker genes provided under `geneSymbolmore2`.
+As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `geneSymbolmore2`.


If you take my other suggestion:

Suggested change

As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `geneSymbolmore2`.

As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `ensembl_id_up` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `ensembl_id_down`.

But having read this, maybe ensembl_id_positive_marker and ensembl_id_negative_marker would be even better!

jaclyn-taroni · 2024-10-03T14:36:46Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+              file = file.path(out_loc,"results",paste0(ind.sam,"_metadata.txt")))
+  final.obj$sctype_classification <- seu$sctype_classification
+  final.obj$lowConfidence_annot <- seu$lowConfidence_annot
+  saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds")))


Suggested change

saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds")))

saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.lib,".rds")))

jaclyn-taroni · 2024-10-03T14:55:16Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+  cL_results <- do.call("rbind", lapply(unique(annot.obj@meta.data$leiden_clusters), function(cl){
+    es.max.cl = sort(rowSums(es.max[ ,rownames(annot.obj@meta.data[annot.obj@meta.data$leiden_clusters==cl, ])]), decreasing = !0)
+    head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(annot.obj@meta.data$leiden_clusters==cl)), 10)
+  }))


Can you add comments about what this code is doing please?

Yes, I have added the comment as follow:

merge by cluster [repetitively finding top 10 cell types with the highest total ScType score for each cluster]
es.max.cl - stores the descending value of total ScType score for each cell type in a particular cluster
The following line prints out the top 10 cell types with the highest total ScType score for a cluster, along with #cells in that cluster
which is then stored in cL_results (_sctype_top10_celltypes_perCluster.txt) - table with 4 columns (clusterID, cell type, total ScType score, #cells)

jaclyn-taroni · 2024-10-03T15:10:55Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)
+library(copykat)
+library(ggplot2)
+


You might consider setting the number of cores based on what machine you are on with something like:

n_cores <- parallel::detectCores() - 1

jaclyn-taroni · 2024-10-03T15:11:18Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+    copykat.test <- copykat(rawmat=seu@assays[["RNA"]]@counts, id.type="Ensemble",
+                            ngene.chr=5, win.size=25, KS.cut=0.1, sam.name=ind.sam,
+                            distance="euclidean", norm.cell.names=norm.cells,
+                            output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1)


Suggested change

output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1)

output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=n_cores)

jaclyn-taroni · 2024-10-03T15:13:47Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+    seu$copykat.pred <- copykat.test$prediction$copykat.pred[idx]
+    DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +
+      ggtitle(paste0(ind.sam,": copykat prediction"))
+    ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7)


Suggested change

ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7)

ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.pdf")), width = 7, height = 7)

UTSouthwesternDSSR · 2024-10-03T21:31:10Z

Is the same cell type label being assigned to all cells in a Leiden cluster?

Would it be possible to combine the UMAP plots into one multi-panel plot with something like patchwork? If so, I think folks would have an easier time with the review and interpretation of your results. If this change will take a lot of time, we can skip it.

Yes, ScType labels the entire cluster with one cell type
Sure, I will read in all 11 rds objects, and plot png plots for the 4 umaps (leiden cluster, cell type, low confidence cell type and copykat prediction).

UTSouthwesternDSSR · 2024-10-04T17:14:34Z

I have uploaded the multipanels_ umap plots in the S3 bucket and made all the changes as suggested. Looking forward to the next step!

jaclyn-taroni

Thank you for the updates, @UTSouthwesternDSSR! This looks close to being ready to merge in. I was not clear when I asked for a multipanel plot – I was looking for all UMAPs from an individual library in one multipanel figure. I've added a comment to the multipanel_plot.R file with something that I tested locally and it appeared to work.

I think with that change, we can remove the plotting of individual panels from the annotation script. I've made suggestions to that effect throughout.

I also had a question about whether or not we should be saving more output from CopyKAT, which I think is somewhat related to what I'll lay out below.

To respond to your question:

For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?

We generally want annotations assigned as part of this project to be of higher quality than those we already include on the Single-cell Pediatric Cancer Atlas Portal. Because many cancer types exist in the ScPCA Portal and we thought it was important to supply users with a "first pass" at cell type labels to save a little time, we used two automated methods to assign cell types: SingleR and CellAssign. (More information about that in the ScPCA Portal docs: https://scpca.readthedocs.io/en/stable/processing_information.html#cell-type-annotation and more information about where to find those assignments here: https://scpca.readthedocs.io/en/stable/sce_file_contents.html#singlecellexperiment-cell-metrics and https://scpca.readthedocs.io/en/stable/sce_file_contents.html#anndata-cell-metrics)

So, we'd like to see some demonstration that the labels assigned are an improvement on the SingleR and CellAssign classifications we already have, i.e., more valuable to people who might want to use the data. One way to get an improvement is to pick more appropriate references – that's where something like the blast signature from the literature you've included comes in! At the very least, we want to understand how ScType assignments differ from existing ones.

To directly answer your question, yes, I might expect a complementary (and more manual) approach may be required to meet our goals.

When it comes to ScType assignments and the nature of how those are applied (i.e., to all cells in a cluster with >25% of cells in that cluster having the highest score as some cell type, as I understand it), it appears that the cluster assignments matter a lot. So, we'd probably want to know:

Are the clusters relatively stable? Are they well-separated? You can see some of how we're thinking about cluster evaluation in a package being written in a feature branch here: https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/e2c71af46616547846aa1fa816bbd5c409043ae1/packages/rOpenScPCA/R/evaluate-clusters.R
Will ScType assign cell type labels even if the cluster assignments are not informative? What happens if an overwhelmingly dominant cell type exists in an individual sample with poorly separated clusters? An example of how to get to this question: if one were to shuffle the cluster labels, would you still get high-confidence assignments, and how frequently does that happen?

I would also like to know what cell types CopyKAT calls aneuploid, and it might be interesting to examine the differences in blast module scores grouped by CopyKAT calls.

We have had some worrisome experiences with CopyKAT in our hands (#537), but I'd expect performance to vary very much between cancer types!

I don't expect you to add anything further to this pull request unless you'd like to refactor anything to make it easier to get at these questions in a later pull request!

jaclyn-taroni · 2024-10-06T15:23:14Z

analyses/cell-type-nonETP-ALL-03/scripts/multipanel_plot.R

This is my fault for not being specific! I was looking for all the UMAPs for an individual library in the same plot, so something like the following:

#!/usr/bin/env Rscript #This script combines multiple UMAP plots from an individual library into one multi-panel plot library(Seurat) library(ggplot2) multiplot <- function(annot.obj, library.id, ct.colors, n.row = 2, variables.to.plot = c("leiden_clusters", "sctype_classification", "lowConfidence_annot", "copykat.pred")){ plot.list <- list() for (plot.type in variables.to.plot){ if (plot.type %in% c("sctype_classification", "lowConfidence_annot")) { clrs <- ct.colors } else { clrs <- NULL } tryCatch({ plot.list[[plot.type]] <- DimPlot(annot.obj, reduction = "Xumap_", group.by = plot.type, label = T, cols = clrs, repel = T) + NoLegend() }, error=function(e){}) } cowplot::plot_grid(plotlist = plot.list, nrow = n.row) + cowplot::draw_figure_label(library.id, position = "top", size = 18, fontface = "bold") ggsave(file.path(out_loc,"plots",paste0("multipanels_",library.id,".png")), width = 12, height = 12, bg = "white") } project_root <- rprojroot::find_root(rprojroot::is_git_root) projectID <- "SCPCP000003" out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03") data_loc <- file.path(project_root, "data/current",projectID) metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T) metadata <- metadata[which(metadata$scpca_project_id == projectID & metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"), ] libraryID <- metadata$scpca_library_id ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy", "chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80") names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage", "Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown") seu.list <- list() for (lib_iter in 1:length(libraryID)){ seu.list[[lib_iter]] <- readRDS(file.path(out_loc,"results/rds",paste0(libraryID[lib_iter],".rds"))) names(seu.list)[lib_iter] <- libraryID[lib_iter] } purrr::walk2( seu.list, names(seu.list), ~ multiplot(annot.obj = .x, library.id = .y, ct.colors = ct_color) )

jaclyn-taroni · 2024-10-06T15:25:53Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+  p1 <- DimPlot(seu, reduction = "Xumap_", group.by = "sctype_classification", cols = ct_color) +
+    ggtitle(paste0(ind.lib,": cell type"))
+  p2 <- DimPlot(seu, reduction = "Xumap_", group.by = "lowConfidence_annot", cols = ct_color) +
+    ggtitle("low confidence annotation")
+  p1 + p2
+  ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_celltype.png")), width = 10, height = 4, dpi = 150)


Suggested change

p1 <- DimPlot(seu, reduction = "Xumap_", group.by = "sctype_classification", cols = ct_color) +

ggtitle(paste0(ind.lib,": cell type"))

p2 <- DimPlot(seu, reduction = "Xumap_", group.by = "lowConfidence_annot", cols = ct_color) +

ggtitle("low confidence annotation")

p1 + p2

ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_celltype.png")), width = 10, height = 4, dpi = 150)

jaclyn-taroni · 2024-10-06T15:26:13Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+    DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +
+      ggtitle(paste0(ind.lib,": copykat prediction"))
+    ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150)


Suggested change

DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +

ggtitle(paste0(ind.lib,": copykat prediction"))

ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150)

jaclyn-taroni · 2024-10-06T15:26:37Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy",
+              "chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80")
+names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage",
+                     "Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown")


Suggested change

ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy",

"chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80")

names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage",

"Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown")

jaclyn-taroni · 2024-10-06T15:32:31Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+  norm.cells <- colnames(seu)[which(seu$sctype_classification=="B")]
+  n_cores <- parallel::detectCores() - 1
+  if (length(norm.cells) > 0){    #the sample has B cells annotated
+    copykat.test <- copykat(rawmat=seu@assays[["RNA"]]@counts, id.type="Ensemble",


Is there anything besides the final call from copyKAT that we want to save to use in a later analysis?

Based on their test output, I uploaded the _copykat_heatmap.jpeg and _copykat_CNA_results.txt under results/copykat_output (in S3 bucket). I also added _copykat_CNA_raw_results_gene_by_cell.txt. I am not sure if this file will be useful, but there is gene annotation.

I added the following lines in 02-03_annotation.R

dir.create(file.path(out_loc, "results/copykat_output"), showWarnings = FALSE) setwd(file.path(out_loc, "results/copykat_output"))

jaclyn-taroni · 2024-10-06T15:36:00Z

analyses/cell-type-nonETP-ALL-03/scripts/02-03_annotation.R

+  write.table(data.frame(FetchData(seu, vars = voi)), sep = "\t", quote = F,
+              file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt")))


Can we save the cell identifiers as a column rather than rownames here? Perhaps using something like the following:

voi_df <- data.frame(FetchData(seu, vars = voi)) |> tibble::rownames_to_column(var = "barcode")

Yes, I have made changes as following:

voi_df <- data.frame(FetchData(seu, vars = voi)) |> tibble::rownames_to_column(var = "barcode") write.table(voi_df, sep = "\t", quote = F, row.names = F, file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt")))

jaclyn-taroni · 2024-10-07T12:56:47Z

analyses/cell-type-nonETP-ALL-03/.gitignore

+# multipanel plots are not committed
+/plots/multipanels_*


These + the dot plot are the only ones I would commit. I've suggested a change to the annotation script to remove plotting of individual UMAPs

Just to clarify, do you mean to upload the multipanels_* to the repository? Even if I change the dpi = 150, the size of these mutipanels_*png ranges from 192kb to 676kb (only one file is below 200kb). Would it be fine if I save all of these multipanels plot as JPEG?

Ah, good point! Yes, let's try JPEGs.

Even with JPEGs (dpi = 150), the range is from 320kb to 652kb

Do you have any other ideas on how to make the size lower than 200kb?

We just merged in #792, so if you merge AlexsLemonade:main into this branch, you should be able to commit the larger PNGs. Please let us know if that doesn't work as expected!

I have merged it, but the multipanels_ png files do not show up in the right panel of GitKraken Desktop (the changes section). So I cannot commit them.

If you remove this line from the .gitignore, do they show up in GitKraken?

Yes, it works! Thank you so much.

jaclyn-taroni

This looks ready to go in 👍🏻 Thank you for this contribution! Please let us know if you'd like to discuss the next steps. We are happy to correspond via Discussions or set up a short call if needed.

UTSouthwesternDSSR and others added 13 commits October 1, 2024 21:47

update for cell annotation

f2aa909

Uncomment out pull request and workflow call triggers

610a7d8

Download relevant test data in CI

c3f162e

Fill out Dockerfile using recommendations for renv + Conda

fb68ef2

Uncomment pull request and push steps for nonETP-ALL Docker builder

4a67535

Add step for installing system dependencies

38750f5

Add conda setup steps to workflow

b7b9052

Add running Rscripts to workflow

249f0ec

Newline in Dockerfile

e905fbe

Try activating conda environment

69849ac

Remove activation step and add defaults for job

5d0162a

reticulate code depends on conda env name

f0db47e

minor update on script

cd02ec1

UTSouthwesternDSSR requested a review from jaclyn-taroni as a code owner October 3, 2024 00:17

jaclyn-taroni reviewed Oct 3, 2024

View reviewed changes

UTSouthwesternDSSR added 2 commits October 3, 2024 16:06

update results-readme

21aacb3

updated script for nonETP

79f2c4f

UTSouthwesternDSSR added 3 commits October 3, 2024 21:35

add bg option for dotplot png format

1d62461

updated plots

9db88ea

multipanel plot

55c341a

jaclyn-taroni added 2 commits October 6, 2024 10:38

Update renv.lock file to include HGNChelper

e3317f0

Add multipanel plotting step to CI

4b8eb23

jaclyn-taroni reviewed Oct 7, 2024

View reviewed changes

UTSouthwesternDSSR and others added 4 commits October 7, 2024 09:38

Merge branch 'AlexsLemonade:main' into main

58c6e00

updated script and readme

417ef62

Merge branch 'AlexsLemonade:main' into main

e53bfbd

updated plots

e0e68cf

jaclyn-taroni approved these changes Oct 8, 2024

View reviewed changes

Merge branch 'main' into main

44e7bf8

jaclyn-taroni merged commit ff20cd2 into AlexsLemonade:main Oct 8, 2024
5 checks passed

	saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds")))
	saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.library,".rds")))

	source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
	source("https://github.com/IanevskiAleksandr/sc-type/raw/6db9eef49f185cf4d79bfec92a20fcf1edcccafb/R/sctype_score_.R")

-lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)
+library(dplyr)
+library(Seurat)
+library(HGNChelper)

	saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds")))
	saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.lib,".rds")))

	output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1)
	output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=n_cores)

	ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7)
	ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.pdf")), width = 7, height = 7)

	DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +
	ggtitle(paste0(ind.lib,": copykat prediction"))
	ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150)

		write.table(data.frame(FetchData(seu, vars = voi)), sep = "\t", quote = F,
		file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt")))

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #789

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #789

Conversation

UTSouthwesternDSSR commented Oct 3, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UTSouthwesternDSSR commented Oct 3, 2024

UTSouthwesternDSSR commented Oct 4, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment