Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #789

Merged
merged 25 commits into from
Oct 8, 2024

Conversation

UTSouthwesternDSSR
Copy link
Contributor

Purpose/implementation Section

The goal of this PR is to annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Briefly describe the general approach you took to achieve this goal.

  1. Create marker genes list by obtaining most cell types from Azimuth human reference - bone marrow(level 1: B, CD4 T, CD8 T, DC, HSPC, Mono, NK, Other T; level 2: Macrophage, Early Eryth, Late Eryth, Plasma, Platelet, Stromal), blast cell from Bhasin et al., and erythroid precursor and cancer cell from ScType database.
  2. Annotate cell type by running ScType with above marker gene list.
  3. Identify tumor cells by running CopyKat, using B cells from the same sample as normal cells.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Out of 11 samples, 7 samples have B cells annotated (except SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100). Although SCPCS000099 contains B cells, these cells are not useful for the identification of tumor cells, as shown by the extremely large number of not.defined cells with default parameters.

I may try to merge the samples, and use the annotated B cells as the normal cell to run CopyKat, hoping that we could identify tumor cells from those samples without B cells. This could be another pull request, if it works.

Results

What is the name of your results bucket on S3?

  • rds objects are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
  • metadata files and sctype results are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/
  • umap and dot plots are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots

What types of results does your code produce (e.g., table, figure)?

  • rds objects
  • two text files for each sample: _metadata.txt (cell ID, leiden clusters, cell type annotation, low confidence cell type annotation, CopyKat prediction [for the 7 samples]) and _sctype_top10_celltypes_perCluster.txt (top 10 possible cell types with their respective sctype score in each cluster)
  • umap plots showing leiden clustering, cell type, and CopyKat prediction respectively
  • dot plots showing the average expression of group of markers for each cell type using AddModuleScore()

What is your summary of the results?

With the default threshold of having sctype score > 25% of ncells in a cluster (sctype_classification), there are a large number of cells being annotated as "Unknown" in each sample, reaching to as high as ~45% (disregarding SCPCS000099: 48% and SCPCS000100: 52%, which have much lower number of features detected).
Screenshot 2024-10-02 at 7 08 24 PM

Therefore, I relaxed the threshold from 25% to 10% (lowConfidence_annot), and the percentage of "Unknown" is now dropped to 30% (disregarding SCPCS000099: 33% and SCPCS000100: 48%), with 5 samples having no "Unknown".

I only ran CopyKat for 7 samples, excluding SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100. However,
tumor prediction does not work for SCPCS000099, given that it has 36% of "not.defined" cells, while the other samples have less than 1%. Even if I relaxed the cutoff of ngene.chr from 5 to 1, the number of "not.defined" cells decreases, but they all get annotated as "diploid".

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

  • The packages are installed and updated in renv.lock and conda.lock.
  • Analysis could be executed on a Standard-4XL virtual machine via AWS Lightsail for Research, but CopyKat runs pretty slow on this machine with one core. Therefore, I ran the CopyKat on our lab server with 50 cores (The longest time needed for a sample is ~1.5 hr).

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

  • I am wondering will you take a look at the results part during PR, or it will be reviewed after the deadline?
  • For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?

Thank you so much for any suggestions/feedback!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @UTSouthwesternDSSR — thank you for this contribution! I am still reviewing the results but wanted to provide some initial feedback on the code.

To summarize my comments at a high level:

  • Your module assumes a 1:1 relationship between sample and library identifiers, but that isn’t always true in ScPCA data. From what I saw of the ETP-ALL module, I believe you’d like to use this code for other sets of samples, so making this change for safety may make things easier later. I’ve made suggestions about using the library identifier as the primary ID, but I may not have caught every reference to a sample identifier.
  • analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv could have more descriptive column names that make it easier for someone looking at the code for the first time to understand what it contains.
  • I want to discuss whether or not using an Excel file is necessary and moving the gene set list generation outside of individual functions so you can run it once instead of multiple times every time you annotate a sample.
  • We want to tie the scType code to a particular permalink of that repository’s code.
  • It would be best to create the results/rds directory programmatically instead of assuming it already exists, which causes failures on other machines.
  • There is at least one package missing from the renv.lock file, which I think could be captured by changing the library() calls in the annotation script to no longer use lapply().

And then for science and display questions, as I’m still working through that part of the review:

  • Is the same cell type label being assigned to all cells in a Leiden cluster?
  • Would it be possible to combine the UMAP plots into one multi-panel plot with something like patchwork? If so, I think folks would have an easier time with the review and interpretation of your results. If this change will take a lot of time, we can skip it.

Thank you again for your contribution. Please let me know if you have any questions or if there’s any way we can help!

@@ -43,7 +43,7 @@ run_sam <- function(ind.sample, ind.library){
final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt"))
final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt"))))

saveRDS(final.obj, file.path(out_loc,"scratch",paste0(ind.sample,".rds")))
saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a directory creation step somewhere in this script now that you've changed the output location based on my local testing and potentially the CI error. That can be accomplished with:

dir.create(file.path(out_loc, "results/rds", recursive = TRUE, showWarnings = FALSE))

And will re-run without error or warning if it already exists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In keeping with library ID as the primary identifier:

Suggested change
saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds")))
saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.library,".rds")))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the ETP-ALL samples, Sample SCPCS000066 has two libraries, and I actually merge the two rds file without doing any batch effect correction. I write another function of run_multisam for those with multiple libraries. It requires a little bit of manual code to provide input to that function, but it works perfectly fine for ETP-ALL samples, since there is only one case.
Screenshot 2024-10-03 at 11 12 17 AM

That's the reason why I thought saving as sampleID making more sense, because technically we should merge multiple libraries together if they belong to the same sample? Isn't this will make the analysis more powerful, since there are more cells added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it seems like the batch effect with that case is not bad, I would still lean toward leaving the two libraries separate. Having more cells may not always result in better results (especially if there were a batch effect), but that is largely an empirical question; I think we would want to evaluate what the tradeoffs are for different parts of the analysis.

To me, keeping the libraries separate seems like the more conservative (and simpler) approach, so it is the one I would start with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove the run_multisam function from the ETP-ALL module script.

# load gene set preparation function
#source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to tie this to a specific commit in case any update that's not backward compatible with this code is pushed to master. The following worked for me locally:

Suggested change
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
source("https://github.com/IanevskiAleksandr/sc-type/raw/6db9eef49f185cf4d79bfec92a20fcf1edcccafb/R/sctype_score_.R")

#Based on the annotated B cells, we provide them as the normal cells for CopyKat to identify tumor cells.
#The outputs are final rds file and two umap plots showing cell type and tumor/normal cells status.

lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you'll need openxlsx anymore, so I've removed it in the suggestion. HGNChelper is not in the renv.lock file and doesn't get captured when I tried renv::snapshot() locally, so I think we need to remove the lapply() here for that to work as expected:

Suggested change
lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)
library(dplyr)
library(Seurat)
library(HGNChelper)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my cursory reading of the annotation script, I understand that geneSymbolmore1 is a gene set up in a cell type, and geneSymbolmore2 is a gene set down in a cell type. Can we rename these columns to be more descriptive about their contents? For example, ensembl_id_up and ensembl_id_down. We want it to be easy for folks to understand if they are seeing the file for the first time.

I realize this could be because you get the table from Azimuth, but that's nicely documented in your README, so I'm not worried about differing from Azimuth too much.

@@ -42,7 +42,7 @@ The `scripts/00-01_processing_rds.R` requires the processed SingleCellExperiment
../../download-results.py --projects SCPCP000003 --modules doublet-detection
```

As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type, and *TMEM56* is not detected in our dataset, thus it is being removed as part of the markers for Late Eryth. As of now, there is no negative marker genes provided under `geneSymbolmore2`.
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `geneSymbolmore2`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take my other suggestion:

Suggested change
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `geneSymbolmore2`.
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `ensembl_id_up` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `ensembl_id_down`.

But having read this, maybe ensembl_id_positive_marker and ensembl_id_negative_marker would be even better!

file = file.path(out_loc,"results",paste0(ind.sam,"_metadata.txt")))
final.obj$sctype_classification <- seu$sctype_classification
final.obj$lowConfidence_annot <- seu$lowConfidence_annot
saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds")))
saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.lib,".rds")))

Comment on lines +70 to +73
cL_results <- do.call("rbind", lapply(unique(annot.obj@meta.data$leiden_clusters), function(cl){
es.max.cl = sort(rowSums(es.max[ ,rownames(annot.obj@meta.data[annot.obj@meta.data$leiden_clusters==cl, ])]), decreasing = !0)
head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(annot.obj@meta.data$leiden_clusters==cl)), 10)
}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments about what this code is doing please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have added the comment as follow:

merge by cluster [repetitively finding top 10 cell types with the highest total ScType score for each cluster]
es.max.cl - stores the descending value of total ScType score for each cell type in a particular cluster
The following line prints out the top 10 cell types with the highest total ScType score for a cluster, along with #cells in that cluster
which is then stored in cL_results (_sctype_top10_celltypes_perCluster.txt) - table with 4 columns (clusterID, cell type, total ScType score, #cells)

lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)
library(copykat)
library(ggplot2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider setting the number of cores based on what machine you are on with something like:

n_cores <- parallel::detectCores() - 1

copykat.test <- copykat(rawmat=seu@assays[["RNA"]]@counts, id.type="Ensemble",
ngene.chr=5, win.size=25, KS.cut=0.1, sam.name=ind.sam,
distance="euclidean", norm.cell.names=norm.cells,
output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1)
output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=n_cores)

seu$copykat.pred <- copykat.test$prediction$copykat.pred[idx]
DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +
ggtitle(paste0(ind.sam,": copykat prediction"))
ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7)
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.pdf")), width = 7, height = 7)

@UTSouthwesternDSSR
Copy link
Contributor Author

  • Is the same cell type label being assigned to all cells in a Leiden cluster?
  • Would it be possible to combine the UMAP plots into one multi-panel plot with something like patchwork? If so, I think folks would have an easier time with the review and interpretation of your results. If this change will take a lot of time, we can skip it.
  • Yes, ScType labels the entire cluster with one cell type
  • Sure, I will read in all 11 rds objects, and plot png plots for the 4 umaps (leiden cluster, cell type, low confidence cell type and copykat prediction).

@UTSouthwesternDSSR
Copy link
Contributor Author

I have uploaded the multipanels_ umap plots in the S3 bucket and made all the changes as suggested. Looking forward to the next step!

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates, @UTSouthwesternDSSR! This looks close to being ready to merge in. I was not clear when I asked for a multipanel plot – I was looking for all UMAPs from an individual library in one multipanel figure. I've added a comment to the multipanel_plot.R file with something that I tested locally and it appeared to work.

I think with that change, we can remove the plotting of individual panels from the annotation script. I've made suggestions to that effect throughout.

I also had a question about whether or not we should be saving more output from CopyKAT, which I think is somewhat related to what I'll lay out below.

To respond to your question:

For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?

We generally want annotations assigned as part of this project to be of higher quality than those we already include on the Single-cell Pediatric Cancer Atlas Portal. Because many cancer types exist in the ScPCA Portal and we thought it was important to supply users with a "first pass" at cell type labels to save a little time, we used two automated methods to assign cell types: SingleR and CellAssign. (More information about that in the ScPCA Portal docs: https://scpca.readthedocs.io/en/stable/processing_information.html#cell-type-annotation and more information about where to find those assignments here: https://scpca.readthedocs.io/en/stable/sce_file_contents.html#singlecellexperiment-cell-metrics and https://scpca.readthedocs.io/en/stable/sce_file_contents.html#anndata-cell-metrics)

So, we'd like to see some demonstration that the labels assigned are an improvement on the SingleR and CellAssign classifications we already have, i.e., more valuable to people who might want to use the data. One way to get an improvement is to pick more appropriate references – that's where something like the blast signature from the literature you've included comes in! At the very least, we want to understand how ScType assignments differ from existing ones.

To directly answer your question, yes, I might expect a complementary (and more manual) approach may be required to meet our goals.

When it comes to ScType assignments and the nature of how those are applied (i.e., to all cells in a cluster with >25% of cells in that cluster having the highest score as some cell type, as I understand it), it appears that the cluster assignments matter a lot. So, we'd probably want to know:

I would also like to know what cell types CopyKAT calls aneuploid, and it might be interesting to examine the differences in blast module scores grouped by CopyKAT calls.

We have had some worrisome experiences with CopyKAT in our hands (#537), but I'd expect performance to vary very much between cancer types!

I don't expect you to add anything further to this pull request unless you'd like to refactor anything to make it easier to get at these questions in a later pull request!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my fault for not being specific! I was looking for all the UMAPs for an individual library in the same plot, so something like the following:

#!/usr/bin/env Rscript

#This script combines multiple UMAP plots from an individual library into one multi-panel plot

library(Seurat)
library(ggplot2)

multiplot <- function(annot.obj, 
                      library.id,
                      ct.colors,
                      n.row = 2,
                      variables.to.plot = c("leiden_clusters", 
                                            "sctype_classification",
                                            "lowConfidence_annot",
                                            "copykat.pred")){
  plot.list <- list()
  for (plot.type in variables.to.plot){
    
    if (plot.type %in% c("sctype_classification", "lowConfidence_annot")) {
      clrs <- ct.colors
    } else {
      clrs <- NULL
    }
    
    tryCatch({
      plot.list[[plot.type]] <- DimPlot(annot.obj, 
                                        reduction = "Xumap_", 
                                        group.by = plot.type, 
                                        label = T, 
                                        cols = clrs, 
                                        repel = T) +
        NoLegend()
    }, error=function(e){})
  
  }
  
  
  cowplot::plot_grid(plotlist = plot.list, nrow = n.row) +
    cowplot::draw_figure_label(library.id, 
                               position = "top",
                               size = 18,
                               fontface = "bold")
  
  ggsave(file.path(out_loc,"plots",paste0("multipanels_",library.id,".png")), 
         width = 12, 
         height = 12, 
         bg = "white")
}

project_root  <- rprojroot::find_root(rprojroot::is_git_root)
projectID <- "SCPCP000003"
out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03")
data_loc <- file.path(project_root, "data/current",projectID)

metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T)
metadata <- metadata[which(metadata$scpca_project_id == projectID &
                             metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"), ]
libraryID <- metadata$scpca_library_id

ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy",
              "chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80")
names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage",
                     "Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown")

seu.list <- list()
for (lib_iter in 1:length(libraryID)){
  seu.list[[lib_iter]] <- readRDS(file.path(out_loc,"results/rds",paste0(libraryID[lib_iter],".rds")))
  names(seu.list)[lib_iter] <- libraryID[lib_iter]
}

purrr::walk2(
  seu.list, 
  names(seu.list),
  ~ multiplot(annot.obj = .x, 
              library.id = .y,
              ct.colors = ct_color)
)

Comment on lines 117 to 122
p1 <- DimPlot(seu, reduction = "Xumap_", group.by = "sctype_classification", cols = ct_color) +
ggtitle(paste0(ind.lib,": cell type"))
p2 <- DimPlot(seu, reduction = "Xumap_", group.by = "lowConfidence_annot", cols = ct_color) +
ggtitle("low confidence annotation")
p1 + p2
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_celltype.png")), width = 10, height = 4, dpi = 150)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
p1 <- DimPlot(seu, reduction = "Xumap_", group.by = "sctype_classification", cols = ct_color) +
ggtitle(paste0(ind.lib,": cell type"))
p2 <- DimPlot(seu, reduction = "Xumap_", group.by = "lowConfidence_annot", cols = ct_color) +
ggtitle("low confidence annotation")
p1 + p2
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_celltype.png")), width = 10, height = 4, dpi = 150)

Comment on lines 136 to 138
DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +
ggtitle(paste0(ind.lib,": copykat prediction"))
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") +
ggtitle(paste0(ind.lib,": copykat prediction"))
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150)

Comment on lines 166 to 169
ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy",
"chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80")
names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage",
"Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy",
"chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80")
names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage",
"Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown")

norm.cells <- colnames(seu)[which(seu$sctype_classification=="B")]
n_cores <- parallel::detectCores() - 1
if (length(norm.cells) > 0){ #the sample has B cells annotated
copykat.test <- copykat(rawmat=seu@assays[["RNA"]]@counts, id.type="Ensemble",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything besides the final call from copyKAT that we want to save to use in a later analysis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on their test output, I uploaded the _copykat_heatmap.jpeg and _copykat_CNA_results.txt under results/copykat_output (in S3 bucket). I also added _copykat_CNA_raw_results_gene_by_cell.txt. I am not sure if this file will be useful, but there is gene annotation.

I added the following lines in 02-03_annotation.R

dir.create(file.path(out_loc, "results/copykat_output"), showWarnings = FALSE)
setwd(file.path(out_loc, "results/copykat_output"))

Comment on lines 146 to 147
write.table(data.frame(FetchData(seu, vars = voi)), sep = "\t", quote = F,
file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we save the cell identifiers as a column rather than rownames here? Perhaps using something like the following:

voi_df <- data.frame(FetchData(seu, vars = voi)) |>
  tibble::rownames_to_column(var = "barcode")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have made changes as following:

voi_df <- data.frame(FetchData(seu, vars = voi)) |> tibble::rownames_to_column(var = "barcode")
write.table(voi_df, sep = "\t", quote = F, row.names = F, 
            file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt")))

Comment on lines 5 to 6
# multipanel plots are not committed
/plots/multipanels_*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These + the dot plot are the only ones I would commit. I've suggested a change to the annotation script to remove plotting of individual UMAPs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, do you mean to upload the multipanels_* to the repository? Even if I change the dpi = 150, the size of these mutipanels_*png ranges from 192kb to 676kb (only one file is below 200kb). Would it be fine if I save all of these multipanels plot as JPEG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point! Yes, let's try JPEGs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with JPEGs (dpi = 150), the range is from 320kb to 652kb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any other ideas on how to make the size lower than 200kb?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just merged in #792, so if you merge AlexsLemonade:main into this branch, you should be able to commit the larger PNGs. Please let us know if that doesn't work as expected!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged it, but the multipanels_ png files do not show up in the right panel of GitKraken Desktop (the changes section). So I cannot commit them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this line from the .gitignore, do they show up in GitKraken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works! Thank you so much.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to go in 👍🏻 Thank you for this contribution! Please let us know if you'd like to discuss the next steps. We are happy to correspond via Discussions or set up a short call if needed.

@jaclyn-taroni jaclyn-taroni merged commit ff20cd2 into AlexsLemonade:main Oct 8, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants