Celltype annotation for non-ETP T-ALL (SCPCP000003) #767

UTSouthwesternDSSR · 2024-09-19T17:36:17Z

Purpose/implementation Section

In this PR section, we will initialize an analysis module skeleton for non-ETP T-ALL dataset in SCPCP000003 and to provide scripts for processing the provided sce objects for review.

Please link to the GitHub issue that this pull request addresses.

Issue Celltype annotation for non-ETP ALL (SCPCP000003) #761
Discussion Non-early T-cell precursor T-cell acute lymphoblastic leukemia (non-ETP ALL) annotation (SCPCP000003) #630

What is the goal of this pull request?

This PR files scripts for processing the provided sce objects using self-assembling manifold (SAM) algorithm as a soft feature selection strategy for better separation of more homogeneous populations. The saved rds objects will be used in the following analysis in my next PR.

Briefly describe the general approach you took to achieve this goal.

A raw count is provided to SAM algorithm with default parameters, where it will preprocess the data by log-normalizing the counts to the median total read count per cell and retaining the genes that expressed in less than 96% of total cells. Then, it will perform feature selection and dimensionality reduction, using 3000 genes selected based on the SAM weights prior to PCA, and with 150 principal components and 20 nearest neighbors to generate UMAP embedding. Leiden clustering is performed to provide clusterIDs. This workflow is applied to all 11 samples in the dataset.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Cell type annotation and identification of tumor cells will be filed soon in other PRs.

Results

The scripts/00-01_processing_rds.R will generate rds files in scratch/ and umap plots showing leiden clustering results in plot/00-01_processing_rds/.

What is the name of your results bucket on S3?

s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots/00-01_processing_rds

What types of results does your code produce (e.g., table, figure)?

Intermediate rds files are saved in scratch/ for further analysis, and the umap plots showing leiden clustering are saved in plot/00-01_processing_rds/.

What is your summary of the results?

The intermediate rds files are ready for cell type annotation and subsequently for tumor cells identification.

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

Analysis could be executed in a standard 4XL virtual machine via AWS Lightsail for Research.
To set up the computing environment, run:

conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
cd /home/lightsail-user/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03
conda env create -f environment.yml -y -n openscpca
conda activate openscpca
Rscript scripts/00-01_processing_rds.R

Are there particularly areas you'd like reviewers to have a close look at?

For the first PR, we would like to check if our way to setup the module skeleton, computing environment, and documentation has met the need.

Is there anything that you want to discuss further?

I failed to push the environment.yml file to the github, since it gives the warning/error of 3 leaks found in the pre-commit stage, as shown in the image below. These are the lines in environment.yml causing error:

line 99: - keyring=25.3.0=pyha804496_0
line 249: - r-rstudioapi=0.16.0=r44hc72bb7e_1

I generated the environment.yml via conda env export -n openscpca > environment.yml. I am wondering if you have any suggestion in solving this?

Thank you for reviewing and much appreciation for any suggestion provided!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jashapiro

Hello @UTSouthwesternDSSR and thank you for your contribution to the OpenScPCA project!

Overall, this first contribution looks to be structured as we would like, so you have done a great job there, and thank you for providing an informative README! It was very helpful to see the goals of the analysis and the overall steps that you are taking.

Let me start by addressing the question about the "leaks" (thank you for setting up pre-commit!)

I generated the environment.yml via conda env export -n openscpca > environment.yml. I am wondering if you have any suggestion in solving this?

The issue here is that sometimes the build hashes from conda packages look like API keys, so pre-commit (specifically gitleaks) is flagging them. There are a few solutions that could allow gitleaks to accept these strings, but in fact the inclusion of build ids is something we would rather avoid, as they can often result in cross-system incompatibilities.

In this project, we are trying to take a different approach from simply using conda env export, as we have found that even if you include the --no-builds option (which would remove this particular issue) it can create incompatibilities across systems in the environment file that can be hard to track down.

Instead, we are recommending the use of conda-lock. Rather than repeating too much, I am going to direct you to the page of our docs that describes the approach: https://openscpca.readthedocs.io/en/latest/ensuring-repro/managing-software/using-conda/

In short, you will want to add the packages you need (optionally with version numbers) to the environment.yml file. The goal here is to have a high-level of list of the packages needed, but without all of dependencies that may be system-specific. Then you can run conda-lock --file environment.yml, which will create a conda-lock.yml file which includes all version information, including system-specific dependencies. This can then be installed on a new system with conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml

Other comments

I noticed that a big part of the first section of your script is converting from SCE to Seurat and then from that to h5ad. I imagine that is a bit slow, as it involves a number of writes to disk. Would using the h5ad files that we provide make that any easier or more efficient? You can get those files with the following command using our download script:

./download-data.py --projects SCPCP000003 --format AnnData

I am also somewhat concerned about the use of SeuratDisk in general, as it seems like the package has not been updated in quite a while. Would it be possible to avoid that package?
It also seems like the SAM software can work directly from a matrix; would it be more efficient to write out a matrix directly with Matrix::writeMM() (and additional geneID and cellID files)
You may be able to handle reading the h5ad file as a Seurat object with https://github.com/cellgeni/schard; I am not sure whether it supports everything you need though.
I'm a little unclear about what the preprocessing and filtering is that SAM does; does that affect the data as read back in? Would it be possible to write out only the clustering results and read those in directly to add to the existing objects?

I also left some line-level comments, though I suspect if you do some restructuring those may become moot.

Thank you again for your contribution, and I look forward to working with you as we work to get this module added to the project!

jashapiro · 2024-09-19T17:46:43Z

analyses/cell-type-nonETP-ALL-03/README.md

+```         
+conda env create -f environment.yml --name openscpca
+conda activate openscpca
+```


For consistency and to prevent replacement of other conda environments from other modules, please name this with the name provided in the environment.yml file, which includes the module name: openscpca-cell-type-nonETP-ALL-03
Also, since the python code and conda environment are run through reticulate, does this need to be activated?

Suggested change

```

conda env create -f environment.yml --name openscpca

conda activate openscpca

```

```

conda env create -f environment.yml --name openscpca-cell-type-nonETP-ALL-03

```

Note this would also be replaced if you convert to using conda-lock as suggested above.

jashapiro · 2024-09-19T19:03:01Z

analyses/cell-type-nonETP-ALL-03/notebook-template.Rmd

you are welcome to remove this file if you do not need it.

jashapiro · 2024-09-19T19:06:48Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+#sudo apt install libglpk40
+#sudo apt install libcurl4-openssl-dev 


Could you please add the requirements for system packages to the readme? These particular packages I am aware of and will hopefully soon be installed on future Lightsail instances, but for an existing instances this will still be required.

Yes, they are now at line 27 and 28 of readme

Thank you, I think I had noticed that they were also there previously and I forgot to delete this comment!

jashapiro · 2024-09-19T19:09:01Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+use_condaenv("openscpca")
+os <- import("os")
+anndata <- import("anndata") #conda install conda-forge::anndata
+samalg <- import("samalg") #pip install sam-algorithm


If this package is not available via conda, you will need to lines like the following to the environment.yml file:

dependencies: - python=3.11 ... - pip - pip: - sam-algorithm

jashapiro · 2024-09-19T19:14:16Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+out_loc <- "~/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03/"
+data_loc <- "~/OpenScPCA-analysis/data/2024-08-22/SCPCP000003/"
+doublet_loc <- "~/OpenScPCA-analysis/data/2024-08-22/results/doublet-detection/SCPCP000003/"


To preserve portability, we recommend using relative paths wherever possible. In addition, all data paths should use the current alias rather than specific releases. In this case, you can also take advantage of the rprojroot package to find the base of the project:

Suggested change

out_loc <- "~/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03/"

data_loc <- "~/OpenScPCA-analysis/data/2024-08-22/SCPCP000003/"

doublet_loc <- "~/OpenScPCA-analysis/data/2024-08-22/results/doublet-detection/SCPCP000003/"

project_root <- rprojroot::find_root(rprojroot::is_git_root)

out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03")

data_loc <- file.path(project_root, "data/current/SCPCP000003")

doublet_loc <- file.path(project_root, data/current/results/doublet-detection/SCPCP000003")

jashapiro · 2024-09-19T19:15:50Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+  sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))
+  seu <- CreateSeuratObject(counts = counts(sce), meta.data = as.data.frame(colData(sce)))
+  seu@assays[["RNA"]]@data <- logcounts(sce)
+  seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-")


I am not sure that this will work given the rownames in the project are Ensembl ids?

Thank you for pointing it out! I have changed it to the following codes:

sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds"))) geneConversion <- rowData(sce)[,c(1:2)] mito.features <- geneConversion$gene_ids[which(grepl("^MT-",geneConversion$gene_symbol))] seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)

Just to protect against potential reordering and variation in the row names, I would slightly modify this:

mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]

jashapiro · 2024-09-19T19:18:37Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+
+  DimPlot(final.obj, reduction = "umap", group.by = "leiden_clusters", label = T) + 
+    ggtitle(paste0(sampleID[i],": leiden_clusters"))
+  ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf")))


Suggested change

ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf")))

ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sampleID[i],".pdf")))

jashapiro · 2024-09-19T19:25:12Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+  SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)
+  Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)


Just for clarity and robustness, can you include the package identifier when calling functions that may have somewhat ambiguous names? e.g.:

Suggested change

SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)

Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)

SeuratDisk::SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)

SeuratDisk::Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)

jashapiro · 2024-09-19T19:38:48Z

analyses/cell-type-nonETP-ALL-03/README.md

+#run in terminal
+conda activate openscpca               #activating conda environment
+sudo apt install libglpk40
+sudo apt install libcurl4-openssl-dev  #before installing Seurat
+sudo apt-get install libhdf5-dev       #before installing SeuratDisk
+conda install conda-forge::anndata
+pip install sam-algorithm
+sudo apt-get install libxml2-dev libfontconfig1-dev libharfbuzz-dev  libfribidi-dev libtiff5-dev  #before installing devtools
+
+#run in R
+install.packages(c("Seurat","HGNChelper","openxlsx","devtools","renv","remotes"))
+BiocManager::install("SingleCellExperiment")
+install.packages("hdf5r", configure.args="--with-hdf5=/usr/bin/h5cc") #before installing SeuratDisk
+remotes::install_github("mojaveazure/seurat-disk")
+devtools::install_github("navinlabcode/copykat")


If renv and conda-lock files are set up, these instructions should be able to be simplified to use those tools:

# system packages installation sudo apt-get install libglpk40 ... conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml Rscript -e "renv::restore()"

jashapiro · 2024-09-19T20:02:58Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+  os$remove(paste0("sam.",sampleID[i],".h5ad"))
+  os$remove(paste0("sam.",sampleID[i],".h5seurat"))
+  os$remove(paste0(sampleID[i],".h5ad"))
+  os$remove(paste0(sampleID[i],".h5Seurat"))


You can save yourself a python import by doing this directly from R:

Suggested change

os$remove(paste0("sam.",sampleID[i],".h5ad"))

os$remove(paste0("sam.",sampleID[i],".h5seurat"))

os$remove(paste0(sampleID[i],".h5ad"))

os$remove(paste0(sampleID[i],".h5Seurat"))

unlink(paste0("sam.",sampleID[i],".h5ad"))

unlink(paste0("sam.",sampleID[i],".h5seurat"))

unlink(paste0(sampleID[i],".h5ad"))

unlink(paste0(sampleID[i],".h5Seurat"))

UTSouthwesternDSSR · 2024-09-20T18:21:56Z

Hi,

I am sorry that I accidentally closed the pull request. Regarding the SeuratDisk package, I managed to avoid it by providing the matrix file to SAM and using schard to read in h5ad file:

sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))), 
                             r_to_py(as.array(colnames(seu)))))
  sam$preprocess_data()
  sam$run()
  sam$clustering(method = "leiden") #leiden clustering is the default algorithm in SAM
  sam$save_anndata(paste0("sam.",sampleID[i],".h5ad"))
  final.obj <- schard::h5ad2seurat(paste0("sam.",sampleID[i],".h5ad"))

In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file.
I deleted the anndata format due to disk storage issue. But I would need more storage for the other 30 ETP T-ALL samples.

jashapiro · 2024-09-20T18:31:02Z

No worries about the accidental closing!

I deleted the anndata format due to disk storage issue. But I would need more storage for the other 30 ETP T-ALL samples.

If you need more storage on an AWS Lightsail system, you will probably want to follow the directions here to attach additional storage: https://openscpca.readthedocs.io/en/latest/aws/lsfr/working-with-volumes/
One limitation is that you can only attach 128GB per volume, but you can attach multiple volumes if needed.

Another approach that I might recommend is to refactor your scripts a bit to work on a single sample at a time: This will allow you to only keep on disk the files for a single sample, and will also help us out in the future when we adapt the module for our workflow system.

jashapiro · 2024-09-20T19:13:11Z

In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file.

I guess I want to confirm that the SAM output still includes the original data and possibly previous normalizations and projections. For context, I am usually quite skeptical of 2D projections as a metric of cell separation; I would personally want to confirm the cluster separation based on a more direct measure of distance. But I know this is a thorny problem! I want to hold off a bit for now on this kind of evaluation, but I want us to be able to look at it later, which is the main reason that I am asking about what the output is. Either way, this should not be too much of an issue either way, as we can always go back to the original input data as needed.

UTSouthwesternDSSR · 2024-09-20T20:03:32Z

Actually SAM output doesn't retain the original data. I only provide the raw count, geneID, and cellID. Their log-normalization is different from the previous normalization provided, and I save the SAM log-normalization values and their projections in the rds file.

jashapiro · 2024-09-23T14:43:57Z

analyses/cell-type-nonETP-ALL-03/environment.yml

+  - anndata=0.10.9
+  - leidenalg=0.10.2
+  - pip
+  - pip:
+      - sam-algorithm==1.0.2


When I tried to install the environment using conda-lock install I ran into an error with during pip installation part of the process. This has happened to me in the past, and the usual solution is to try to move a few more dependencies into the conda-managed section of the environment. I also recommend generally leaving third decimals off of the environment.yaml file; they will be captured by conda-lock either way but allow a bit more flexibility in initial solving of the environment.

The suggestion below moves the dependencies that I could out of pip (this also implicitly includes the scikit-learn and numba packages, which seem most likely to cause installation errors).

Suggested change

- anndata=0.10.9

- leidenalg=0.10.2

- pip

- pip:

- sam-algorithm==1.0.2

- anndata=0.10

- leidenalg=0.10

- dill

- harmonypy

- umap-learn

- pip

- pip:

- sam-algorithm==1.0.2

After updating this file, I was able to run conda-lock --update '' (The quoted empty string forces re-resolving the entire environment) which resulted in a conda-lock.yml that installed without trouble, at least at my end.

I made the suggested changes, and ran conda-lock --update ''. Then I created a new conda environment with conda-lock install --name tmp conda-lock.yml, but it fails to import samalg. Did you manage to import samalg in python?

Thanks for giving it a shot! I was able to import samalg, but when I tried to recreate my steps I ran into trouble again with the conda-lock --update '' command. I was able to get past that by removing (renaming) conda-lock.yml and then running the base conda-lock command again. This created a clean build which I was able to both install and import from Python, both natively and using reticulate in R.

Update with more testing... I did sometimes run into a case where conda-lock would report a successful installation but I would not be able to import samalg. I am not sure what the final cause is, but it may be related to pip caching, so I might try running python -m pip cache purge and seeing if that can help fix the installation. It may also help to do a manual install with pip install sam-algorithm from within the environment: you should first check that the correct pip is active with which pip; at least once I have had the environment fail to install its own pip for some reason.

It still doesn't work for me. But if you don't mind, could we just do the following for system installations, so it will always work?

conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml conda activate openscpca-cell-type-nonETP-ALL-03 pip install sam-algorithm

Yes, I think that should be fine. I will do some further testing to see if I can figure out why it is not being reliable, but we do not need to have that hold things up for now.

I did a bit more testing and pushed 353bcb7 which removed some of the pip installed packages from the conda-lock file that may have been causing trouble. (I didn't do this manually, but by deleting the conda-lock.yml file and regenerating it, but the diff is quite clean).

If you have a chance to test conda-lock install with this file, it would be great to know if that solved the problem.

Yes, I managed to install the conda environment perfectly, and I can import samalg. Thank you so much for your help!
Just to clarify, which of the pip installed packages did you remove from the conda-lock file? I am wondering how could I generate this conda-lock file that works, if I happen to update the environment.yml in the future and thus need to change the conda-lock.yml?

You can see the lines that were removed if you click on the link to the commit: 353bcb7, but I think the core thing I learned is that updating an existing conda-lock.yml file seems to sometimes leave some extra values, especially when there are pip installations involved (at least with the current version of conda-lock). It seems to work better, or at least more consistently, to remove the existing conda-lock.yml file and then generate the new one fresh.

jashapiro

I did not fully finish my review here, but I had a few suggestions that I had made so far, and I wanted to make sure to get back to you about the conda issues we have been seeing. I made some suggestions, but there remains a bit of a mystery about why samalg is not always installing correctly in the environment!

jashapiro · 2024-09-23T15:49:22Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+  sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))), 
+                             r_to_py(as.array(colnames(seu)))))


Just reformatting a bit here, but I wanted to say that I do very much like this solution to avoid writing any files!

Suggested change

sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))),

r_to_py(as.array(colnames(seu)))))

sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)),

r_to_py(as.array(rownames(seu))),

r_to_py(as.array(colnames(seu)))))

jashapiro · 2024-09-23T15:51:56Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+use_condaenv("openscpca-cell-type-nonETP-ALL-03")
+os <- import("os")
+anndata <- import("anndata")
+samalg <- import("samalg")


It might be worth adding a comment here as well to allow easy reference:

Suggested change

samalg <- import("samalg")

samalg <- import("samalg") # https://github.com/atarashansky/self-assembling-manifold/tree/master

jashapiro · 2024-09-23T15:56:31Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+
+project_root  <- rprojroot::find_root(rprojroot::is_git_root)
+out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03")
+data_loc <- file.path(project_root, "data/current/SCPCP000003")


I am suggesting making a variable for the project ID, as you can use it here and in the next section (filtering the metadata) to ensure parity.

Suggested change

data_loc <- file.path(project_root, "data/current/SCPCP000003")

projectID <- "SCPCP000003"

data_loc <- file.path(project_root, "data/current", projectID)

jashapiro · 2024-09-23T15:58:18Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T)
+sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
+libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]


For future-proofing, we probably want to filter the metadata here as well (we could get other NonETP-ALL samples, which would require other changes to accommodate, but we can at least prevent breaking this script!)

Suggested change

metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T)

sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = TRUE)

metadata <- metadata[which(metadata$scpca_project_id == projectID), ]

sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

jashapiro · 2024-09-23T15:59:09Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
+libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
+
+options(Seurat.object.assay.version = 'v3')


My guess is that this was here for compatibility with SeuratDisk. Is it still needed, or can we move to Seurat v5 objects for future compatibility?

jashapiro · 2024-09-23T16:06:20Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+for (i in 1:length(sampleID)){
+  sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))


To make this a bit more portable and modular, I think it might be worth defining the steps in this loop as a function or a set of functions.

Something like:

run_sam <- function(sample, library){ sce <- readRDS(file.path(data_loc, sample, paste0(library,"_processed.rds"))) ...

Which you could then call using the purrr package replacing the explicit loop as follows:

purrr::walk2(run_sam, sampleID, libraryID)

Note that this will go through the sampleID and libraryID vectors in parallel, just as your previous version used i to track the position.

Sure, I have changed them to run_sam function, and call the function via purrr::walk2(sampleID, libraryID, run_sam).

jashapiro · 2024-09-23T17:12:21Z

analyses/cell-type-nonETP-ALL-03/environment.yml

+  - anndata=0.10.9
+  - leidenalg=0.10.2
+  - pip
+  - pip:
+      - sam-algorithm==1.0.2


Thanks for giving it a shot! I was able to import samalg, but when I tried to recreate my steps I ran into trouble again with the conda-lock --update '' command. I was able to get past that by removing (renaming) conda-lock.yml and then running the base conda-lock command again. This created a clean build which I was able to both install and import from Python, both natively and using reticulate in R.

jashapiro · 2024-09-23T17:59:35Z

analyses/cell-type-nonETP-ALL-03/environment.yml

+  - anndata=0.10.9
+  - leidenalg=0.10.2
+  - pip
+  - pip:
+      - sam-algorithm==1.0.2


Update with more testing... I did sometimes run into a case where conda-lock would report a successful installation but I would not be able to import samalg. I am not sure what the final cause is, but it may be related to pip caching, so I might try running python -m pip cache purge and seeing if that can help fix the installation. It may also help to do a manual install with pip install sam-algorithm from within the environment: you should first check that the correct pip is active with which pip; at least once I have had the environment fail to install its own pip for some reason.

jashapiro

I think with all of the updates you have made, this is looking almost ready. I have just a few cleanup comments at this point, and one somewhat larger question about the reference file, but I think the reference question can wait for a later PR, when it is actually being used, though that is really up to you.

And thank you for testing the conda-lock environment! I am glad that it did work out, and hopefully will not be too much trouble to maintain!

jashapiro · 2024-09-24T01:43:45Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+os <- import("os")
+anndata <- import("anndata")


I don't think either of these is used any more? Unless anndata is required for samalg?

Suggested change

os <- import("os")

anndata <- import("anndata")

jashapiro · 2024-09-24T13:21:46Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+library(dplyr)
+library(cowplot)
+library(ggplot2)
+library(future)


It looks like some number of these are unused? It is a bit difficult to say, because there may be functions that are not distinctive that are being used, but if a function is only used once it may be preferable to use :: syntax for clarity. (ggplot2 and dplyr are possible exceptions given their ubiquity.

I remove these libraries, but I am keeping library(ggplot2) for plotting the plot and saving them.

jashapiro · 2024-09-24T13:23:57Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

@@ -0,0 +1,64 @@
+library(Seurat)


Please include #! lines for all scripts as possible. An opening comment that describes what this script does would also be helpful here. You probably don't want to use what I wrote below exactly, but something like that, including a comment about the output files would be helpful for later users.

Suggested change

library(Seurat)

#!/usr/bin/env Rscript

# This script uses SAM to perform clusters for the nonETP-ALL samples in SPCP000003 and outputs...

library(Seurat)

jashapiro · 2024-09-24T13:39:41Z

analyses/cell-type-nonETP-ALL-03/environment.yml

+  - anndata=0.10.9
+  - leidenalg=0.10.2
+  - pip
+  - pip:
+      - sam-algorithm==1.0.2


You can see the lines that were removed if you click on the link to the commit: 353bcb7, but I think the core thing I learned is that updating an existing conda-lock.yml file seems to sometimes leave some extra values, especially when there are pip installations involved (at least with the current version of conda-lock). It seems to work better, or at least more consistently, to remove the existing conda-lock.yml file and then generate the new one fresh.

jashapiro · 2024-09-24T19:49:55Z

analyses/cell-type-nonETP-ALL-03/README.md

+
+We first aim to annotate the cell types in non-ETP T-ALL, and use the annotated B cells in the sample as the "normal" cells to identify tumor cells, since T-ALL is caused by the clonal proliferation of immature T-cell [<https://www.nature.com/articles/s41375-018-0127-8>].
+
+-   We use the cell type marker (`Azimuth_BM_level1.xlsx`) from [Azimuth Human Bone Marrow reference](https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow). In total, there are 14 cell types: B, CD4T, CD8T, Other T, DC, Monocytes, Macrophages, NK, Early Erythrocytes, Late Erythrocytes, Plasma, Platelet, Stromal, and Hematopoietic Stem and Progenitor Cells (HSPC). Based on the exploratory analysis, we believe that most of the cells in these samples do not express adequate markers to be distinguished at finer cell type level (eg. naive vs memory, CD14 vs CD16 etc.), and majority of the cells should belong to T-cells. In addition, we include the marker genes for cancer cell in immune system from [ScType](https://sctype.app/database.php) database.


I think this can probably wait for the next PR (or when you are actually using the file), but I am wondering how this file was compiled. Was it downloaded directly or created from the website table? If the latter, we probably want to include the direct URL to the file, if we can.

If the latter, it would probably be better to have this as a text table for easier tracing through changes. I would also recommend including the Cell Ontology IDs for the cell types if possible, which we can use to track across different naming conventions.

It was created from the website table (combination of celltype.l2 and celltype.l1), and I converted the markers from gene symbol to ensembl ID. Do you mean to provide the URL (https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow) as part of the header of the file?

I am saving it as an excel file because it is the input format required by ScType, unless I go and change their codes. If I understand the Cell Ontology IDs correctly, it provides an ID for a specific cell type (eg. CL:0000904 for central memory CD4+ T cell). This is the reason I am using celltype.l1, because I need the cell type annotation to be very general (eg. just CD4+ T, instead of specifying naive or memory).
I did try with finer cell type label, but I found out that the cells (e.g. different types of CD4 and CD8 and even progenitors) all cluster together (fail to be separated), thus they cannot be annotated at cluster level. I also try to do annotation at cell level using anchor transfer etc. The prediction results is really bad. Thus, I suspect that they don't express enough markers to be separated at the finer level.

Thank you for the clarification. I don't think you need to provide the URL in the header as you do already have it in the readme, but I am surprised that ScType can not take a .csv file! (I just looked at the code and I do have an idea for how to get around this: I think you should be able to run something like gene_sets_prepare(openxlsx::buildWorkbook(marker_df), "Immune system") to work from any data frame, but this is not something we really need to address now.

As for Cell Ontology IDs, they are hierarchical, so you should be able to use a higher-level term if needed. But I think that you can use the values given as links in the Azimuth table to start: For example for B cells they link to CL:0000945, and for the CD4 T cells it links to CL:0000624, which is a more general ID; The central memory CD4 cells, CL:0000904 can be found as a child term to the more general term.

This is the primary benefit of using the ontology, in fact! We can use these terms to harmonize across annotations even if different samples are annotated to different depths by various groups.

I made the changes as suggested for using .csv file (the codes will be shown up in scripts/02-03_annotation.R in next pull request).
As of now, I put the Cell Ontology IDs under the column of shortName in Azimuth_BM_level1.csv. I don't think there is an ID for "Other T", since it is such a broad category, and I could not find IDs for Early and Late Eryth. I just leave them as how it is now.

Actually, I have another concern regarding SAM and ScType. Each time I run SAM, the clustering output changes, although the seed is set as 0 (by default) in sam$run(). The log is shown as image below.

Since ScType annotates by cluster, so the cell annotation may change if they are found in another cluster, with different annotation. This is actually quite likely to happen, especially for those cells that don't have clear markers expressed. Eg. Depending on the clustering results, they may be annotated as Unknown, if the score of a cluster is less than a quarter of #cells in that cluster.

jashapiro · 2024-09-24T19:59:04Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+metadata <- metadata[which(metadata$scpca_project_id == projectID), ]
+sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
+libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]


A little simplification.

Suggested change

metadata <- metadata[which(metadata$scpca_project_id == projectID), ]

sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

metadata <- metadata[which(metadata$scpca_project_id == projectID & metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"), ]

sampleID <- metadata$scpca_sample_id

libraryID <- metadata$scpca_library_id

In my own work, this is actually a place I would use dplyr to make things a bit simpler (at least in my mind), as shown below, but either method is fine!

Suggested change

metadata <- metadata[which(metadata$scpca_project_id == projectID), ]

sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

metadata <- metadata |>

dplyr::select(

scpca_project_id == projectID,

diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"

)

sampleID <- metadata$scpca_sample_id

libraryID <- metadata$scpca_library_id

jashapiro · 2024-09-25T01:33:16Z

Actually, I have another concern regarding SAM and ScType. Each time I run SAM, the clustering output changes, although the seed is set as 0 (by default) in sam$run(). The log is shown as image below.

Since ScType annotates by cluster, so the cell annotation may change if they are found in another cluster, with different annotation. This is actually quite likely to happen, especially for those cells that don't have clear markers expressed. Eg. Depending on the clustering results, they may be annotated as Unknown, if the score of a cluster is less than a quarter of #cells in that cluster.

This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to sam.run(distance='correlation', seed=0) may make the calculations consistent.

I think this limitation is actually quite out of date, and the hsnw library now supports seeding, but fixing that would require changes to samalg, and the package unfortunately seems not to be regularly updated. I might consider forking the package and testing whether the implementation can be updated to support the newer versions of the library, but that would require a fair amount of effort!

jashapiro · 2024-09-25T12:24:09Z

This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to sam.run(distance='correlation', seed=0) may make the calculations consistent.

One other thought I had was whether you need the UMAP visualizations and clusters that are calculated by SAM? Can you recalculate those with Seurat? The reason I ask is that if the cell assignments in scType are sensitive to the clustering, you may well want to explore how different clustering parameters affect the quality of the assignments.

If you want to go down that route, I would probably start by breaking up the run_sam function to make it a bit more modular. One function could take the SCE object, pass it to SAM, and write the h5ad file, then a separate function would read the h5ad file and perform downstream steps like clustering. This will make it easier to modify those steps for exploration, assuming that the data required to pass to the clustering algorithm is properly read in from the h5ad file.

jashapiro

I did some testing to confirm that using distance="correlation" does result in reproducible results. I also suggested a few changes to only create one Seurat object at the end, and to make sure it has all of the metadata from the original object, in case that is useful.

I do still think it may be worth breaking up the SAM run and the Seurat object creation, but I think with the changes here it may be in a good enough state to move on. Clustering exploration can still happen if SAM does a first pass!

jashapiro · 2024-09-25T14:52:19Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+  seu <- CreateSeuratObject(counts = counts(sce), data = logcounts(sce), meta.data = as.data.frame(colData(sce)))
+  # create a new assay to store ADT information
+  seu[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce)))
+  seu[["ADT"]]@data <- logcounts(altExp(sce))
+  mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]
+  seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)
+
+  #detecting doublets
+  doublets.data <- read.table(file = file.path(doublet_loc,sample,paste0(library,"_processed_scdblfinder.tsv")),sep = "\t", header = T)
+  idx <- match(colnames(seu), doublets.data$barcodes)
+  seu$doublet_class <- doublets.data$class[idx]
+  seu <- subset(seu, subset = percent.mt < 25)
+
+  #step 01 feature selection/dimensionality using SAM
+  sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@layers[["counts"]])), 
+                              r_to_py(as.array(rownames(seu))),
+                              r_to_py(as.array(colnames(seu)))))


I realized that for this section there is no need to do the Seurat conversion, so this should be a bit more efficient:

Suggested change

seu <- CreateSeuratObject(counts = counts(sce), data = logcounts(sce), meta.data = as.data.frame(colData(sce)))

# create a new assay to store ADT information

seu[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce)))

seu[["ADT"]]@data <- logcounts(altExp(sce))

mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]

seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)

#detecting doublets

doublets.data <- read.table(file = file.path(doublet_loc,sample,paste0(library,"_processed_scdblfinder.tsv")),sep = "\t", header = T)

idx <- match(colnames(seu), doublets.data$barcodes)

seu$doublet_class <- doublets.data$class[idx]

seu <- subset(seu, subset = percent.mt < 25)

#step 01 feature selection/dimensionality using SAM

sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@layers[["counts"]])),

r_to_py(as.array(rownames(seu))),

r_to_py(as.array(colnames(seu)))))

doublets.data <- read.table(

file = file.path(doublet_loc, sample, paste0(library,"_processed_scdblfinder.tsv")),

sep = "\t", header = T

)

idx <- match(colnames(sce), doublets.data$barcodes)

sce$doublet_class <- doublets.data$class[idx]

sce <- sce[, which(sce$subsets_mito_percent < 25)]

#step 01 feature selection/dimensionality using SAM

sam = samalg$SAM(counts = c(r_to_py(t(counts(sce))),

r_to_py(as.array(rownames(sce))),

r_to_py(as.array(colnames(sce)))))

jashapiro · 2024-09-25T14:53:10Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+                              r_to_py(as.array(rownames(seu))),
+                              r_to_py(as.array(colnames(seu)))))
+  sam$preprocess_data()
+  sam$run()


Suggested change

sam$run()

# using correlation for the distance measure allows for reproducible results

sam$run(distance='correlation')

jashapiro · 2024-09-25T14:59:30Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+  final.obj <- AddMetaData(final.obj, seu@meta.data)
+  final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]]
+  # create a new assay to store ADT information
+  final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts)
+  final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data


If we do not create the Seurat object, we do need to modify this a bit to get the data from the SCE object:

Suggested change

final.obj <- AddMetaData(final.obj, seu@meta.data)

final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]]

# create a new assay to store ADT information

final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts)

final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data

final.obj <- AddMetaData(final.obj, as.data.frame(colData(sce)))

final.obj[["RNA"]]@counts <- counts(sce)

final.obj[["RNA"]] <- AddMetaData(final.obj[["RNA"]], as.data.frame(rowData(sce)))

final.obj[["percent.mt"]] <- final.obj[["subsets_mito_percent"]] # add Seurat expected column name

# create a new assay to store ADT information

final.obj[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce, "adt")))

final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt"))

final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt"))))

Thank you for simplifying it! Since percent.mt and subsets_mito_percent are storing the same information, I only keep subsets_mito_percent. Just to check with you, are nCount_RNA and nFeature_RNA metadata stored in sce object? Because these two metadata give different values between creating seurat object via CreateSeuratObject (named as seu) and schard::h5ad2seurat (named as final.obj)

It seems like CreateSeuratObject is giving the right value, since the min_gene_cutoff > 200, so the minimum value of nFeature_RNA should be larger than 200, right?

Just to check with you, are `nCount_RNA and nFeature_RNA metadata stored in sce object?

The equivalent values should be stored as sce$sum and sce$detected, respectively.

It seems like CreateSeuratObject is giving the right value, since the min_gene_cutoff > 200, so the minimum value of nFeature_RNA should be larger than 200, right?

That sounds correct to me. I am not sure how SAM is populating that field, as it is certainly doing some preprocessing. Perhaps it is only including the features/genes that it has not filtered (as denoted by mask_genes); I think the filtered genes may have their values set to zero for all samples.

Thank you for the clarification! I have now set them as:

final.obj$nCount_RNA <- final.obj$sum final.obj$nFeature_RNA <- final.obj$detected

jashapiro · 2024-09-25T15:14:05Z

analyses/cell-type-nonETP-ALL-03/scripts/00-01_processing_rds.R

+
+  DimPlot(final.obj, reduction = "Xumap_", group.by = "leiden_clusters", label = T) + 
+    ggtitle(paste0(sample,": leiden_clusters"))
+  ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")))


One other minor comment is that we should probably set a size here to ensure consistency:

Suggested change

ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")))

ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")),

width = 7, height = 7))

jashapiro · 2024-09-25T17:20:41Z

analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv

@@ -0,0 +1,16 @@
+tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName


Thank you for updating this file to one that we can more easily track changes to!

I wonder if it makes more sense to have a separate column for the cell ontology? I am not sure if that would upset ScType, but I suspect it will not matter, or we can remove it before passing it along.

As to some of the ambiguous cell ontology values, I think it probably makes the most sense to just use whatever was provided by Azimuth as the best equivalent. In the case of early and late erythrocytes this will just mean that they have the same CL value: CL_0000764, and for "Other T" that would just be the generic T cell: CL_0000084.

I think the separate column is probably the way to go here, so you can have both the full names and short names be human readable.

Suggested change

tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName

tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName,ontologyID

jashapiro

@UTSouthwesternDSSR I wasn't sure if you have completed the changes that you intended to make here, but things look good for now. I think my final suggestion is the one made in a comment here which is just to make the Cell Ontology ID a separate column in the .csv file. I might also make the cellName column unabbreviated, leaving the abbreviated labels for shortName though I admit as I am unfamiliar with the output of ScType, that might not be a great idea...

In the future, when you are ready for a review, please don't hesitate to let me know either with a comment or just by clicking the re-request review button in the upper right of the PR page:

jashapiro · 2024-09-26T13:38:49Z

analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv

@@ -0,0 +1,16 @@
+tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName


I think the separate column is probably the way to go here, so you can have both the full names and short names be human readable.

Suggested change

tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName

tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName,ontologyID

UTSouthwesternDSSR · 2024-09-27T13:50:09Z

Sure, I will make changes to the .csv file as suggested. ScType will provide label by returning the cellName column. As for the shortName column, it was used in another part of ScType, which I am not using. Just to check with you, which label would you all prefer for the cell type annotation, or it doesn't really matter?

jashapiro · 2024-09-27T13:58:36Z

ScType will provide label by returning the cellName column. As for the shortName column, it was used in another part of ScType, which I am not using. Just to check with you, which label would you all prefer for the cell type annotation, or it doesn't really matter?

I don't think it really matters, as long as the names are clear. At some point we will almost certainly have to do some harmonization across different references, but that will be well down the line (and not part of this module).

UTSouthwesternDSSR · 2024-09-27T14:10:26Z

Yes, I have completed all the changes that I have intended for this part (including editing the .csv file), and I am ready for a review. I will have to re-request review for it to be merged into the main branch, right?

Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run CopyKat. I plan to download the rds file from S3 bucket to my local server for analysis with ./sync-results.py cell-type-nonETP-ALL-03 --bucket researcher-650251722463-us-east-2 --download --profile my-dev-profile, but it doesn't work. It is looking for git directory. I am wondering if I should clone the repository folder, or I will have to wait until my pull request get merged into the main branch?

jashapiro

Yes, I have completed all the changes that I have intended for this part (including editing the .csv file). I will have to re-request review for it to be merged into the main branch, right?

You do not necessarily have to re-request review; that is just a good signal to me that you have completed the changes you intended to make so I know when a good time to look at it is.

Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run CopyKat. I plan to download the rds file from S3 bucket to my local server for analysis with ./sync-results.py cell-type-nonETP-ALL-03 --bucket researcher-650251722463-us-east-2 --download --profile my-dev-profile, but it doesn't work. It is looking for git directory. I am wondering if I should clone the repository folder, or I will have to wait until my pull request get merged into the main branch?

The sync-results.py script does expect that you have cloned the repository, but you should not have to wait for the pull request to be merged; you can simply check out whatever branch has the next set of changes in it (or make a new branch).

Working on your local machine is fine and may be more convenient, but I wonder if there may be other ways to reduce the memory usage. I obviously can't make any recommendations without seeing the code, but I don't seem to recall CopyKat itself being particularly memory-intensive in our previous use.

jashapiro · 2024-09-27T14:38:33Z

analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv

@@ -0,0 +1,16 @@
+tissueType,cellName,geneSymbolmore1,geneSymbolmore2,fullName,ontologyID


Thank you for adding the extra columns. Having the full name seems like it could be useful for reference, so I appreciate having those.

UTSouthwesternDSSR and others added 4 commits September 13, 2024 20:54

init module

fadc050

Merge branch 'AlexsLemonade:main' into UTSouthwesternDSSR/jwl

9c63daf

Merge branch 'AlexsLemonade:main' into UTSouthwesternDSSR/jwl

c401169

Init module skeleton

1cde6bd

UTSouthwesternDSSR requested a review from jaclyn-taroni as a code owner September 19, 2024 17:36

jashapiro self-requested a review September 19, 2024 17:45

jashapiro reviewed Sep 19, 2024

View reviewed changes

UTSouthwesternDSSR closed this Sep 20, 2024

UTSouthwesternDSSR reopened this Sep 20, 2024

jaclyn-taroni removed their request for review September 20, 2024 19:02

resolving comments

aa7625f

jashapiro reviewed Sep 23, 2024

View reviewed changes

UTSouthwesternDSSR and others added 5 commits September 23, 2024 20:58

resolving installation and minor update

c0fa782

update readme

ec8fec6

reset conda-lock

353bcb7

Merge branch 'main' into UTSouthwesternDSSR/jwl

f7ac899

updated system installations part

67dca5d

jashapiro reviewed Sep 24, 2024

View reviewed changes

UTSouthwesternDSSR added 2 commits September 24, 2024 22:22

change marker file format

4be6b2b

added script comment

fbc62c1

jashapiro reviewed Sep 25, 2024

View reviewed changes

UTSouthwesternDSSR and others added 2 commits September 25, 2024 20:13

updated files

c829104

Merge branch 'AlexsLemonade:main' into UTSouthwesternDSSR/jwl

55a9469

jashapiro reviewed Sep 27, 2024

View reviewed changes

update marker file

8a103fd

jashapiro approved these changes Sep 27, 2024

View reviewed changes

jashapiro merged commit baee0e6 into AlexsLemonade:main Sep 27, 2024
2 checks passed

jashapiro mentioned this pull request Sep 29, 2024

Anchor transfer (with Seurat) for Wilms tumor annotation (SCPCP000014) #784

Merged

8 tasks

		#sudo apt install libglpk40
		#sudo apt install libcurl4-openssl-dev

	ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf")))
	ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sampleID[i],".pdf")))

		SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)
		Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)

		sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))),
		r_to_py(as.array(colnames(seu)))))

	samalg <- import("samalg")
	samalg <- import("samalg") # https://github.com/atarashansky/self-assembling-manifold/tree/master

	data_loc <- file.path(project_root, "data/current/SCPCP000003")
	projectID <- "SCPCP000003"
	data_loc <- file.path(project_root, "data/current", projectID)

		for (i in 1:length(sampleID)){
		sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))


		We first aim to annotate the cell types in non-ETP T-ALL, and use the annotated B cells in the sample as the "normal" cells to identify tumor cells, since T-ALL is caused by the clonal proliferation of immature T-cell [<https://www.nature.com/articles/s41375-018-0127-8>].

		- We use the cell type marker (`Azimuth_BM_level1.xlsx`) from [Azimuth Human Bone Marrow reference](https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow). In total, there are 14 cell types: B, CD4T, CD8T, Other T, DC, Monocytes, Macrophages, NK, Early Erythrocytes, Late Erythrocytes, Plasma, Platelet, Stromal, and Hematopoietic Stem and Progenitor Cells (HSPC). Based on the exploratory analysis, we believe that most of the cells in these samples do not express adequate markers to be distinguished at finer cell type level (eg. naive vs memory, CD14 vs CD16 etc.), and majority of the cells should belong to T-cells. In addition, we include the marker genes for cancer cell in immune system from [ScType](https://sctype.app/database.php) database.

	sam$run()
	# using correlation for the distance measure allows for reproducible results
	sam$run(distance='correlation')

-  final.obj <- AddMetaData(final.obj, seu@meta.data)
-  final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]]
-  # create a new assay to store ADT information
-  final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts)
-  final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data
+  final.obj <- AddMetaData(final.obj, as.data.frame(colData(sce)))
+  final.obj[["RNA"]]@counts <- counts(sce)
+  final.obj[["RNA"]] <- AddMetaData(final.obj[["RNA"]], as.data.frame(rowData(sce)))
+  final.obj[["percent.mt"]] <- final.obj[["subsets_mito_percent"]] # add Seurat expected column name
+  # create a new assay to store ADT information
+  final.obj[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce, "adt")))
+  final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt"))
+  final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt"))))

	ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")))
	ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")),
	width = 7, height = 7))

		@@ -0,0 +1,16 @@
		tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName

Celltype annotation for non-ETP T-ALL (SCPCP000003) #767

Celltype annotation for non-ETP T-ALL (SCPCP000003) #767

Conversation

UTSouthwesternDSSR commented Sep 19, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jashapiro left a comment

Choose a reason for hiding this comment

Other comments

jashapiro Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UTSouthwesternDSSR commented Sep 20, 2024

jashapiro commented Sep 20, 2024

jashapiro commented Sep 20, 2024

UTSouthwesternDSSR commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Sep 25, 2024 • edited Loading

jashapiro commented Sep 25, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Sep 19, 2024 •

edited

Loading

jashapiro Sep 24, 2024 •

edited

Loading

jashapiro commented Sep 25, 2024 •

edited

Loading