-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celltype annotation for non-ETP T-ALL (SCPCP000003) #767
Celltype annotation for non-ETP T-ALL (SCPCP000003) #767
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @UTSouthwesternDSSR and thank you for your contribution to the OpenScPCA project!
Overall, this first contribution looks to be structured as we would like, so you have done a great job there, and thank you for providing an informative README! It was very helpful to see the goals of the analysis and the overall steps that you are taking.
Let me start by addressing the question about the "leaks" (thank you for setting up pre-commit
!)
I generated the
environment.yml
viaconda env export -n openscpca > environment.yml
. I am wondering if you have any suggestion in solving this?
The issue here is that sometimes the build hashes from conda packages look like API keys, so pre-commit (specifically gitleaks) is flagging them. There are a few solutions that could allow gitleaks to accept these strings, but in fact the inclusion of build ids is something we would rather avoid, as they can often result in cross-system incompatibilities.
In this project, we are trying to take a different approach from simply using conda env export
, as we have found that even if you include the --no-builds
option (which would remove this particular issue) it can create incompatibilities across systems in the environment file that can be hard to track down.
Instead, we are recommending the use of conda-lock
. Rather than repeating too much, I am going to direct you to the page of our docs that describes the approach: https://openscpca.readthedocs.io/en/latest/ensuring-repro/managing-software/using-conda/
In short, you will want to add the packages you need (optionally with version numbers) to the environment.yml
file. The goal here is to have a high-level of list of the packages needed, but without all of dependencies that may be system-specific. Then you can run conda-lock --file environment.yml
, which will create a conda-lock.yml
file which includes all version information, including system-specific dependencies. This can then be installed on a new system with conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml
Other comments
- I noticed that a big part of the first section of your script is converting from SCE to Seurat and then from that to h5ad. I imagine that is a bit slow, as it involves a number of writes to disk. Would using the h5ad files that we provide make that any easier or more efficient? You can get those files with the following command using our download script:
./download-data.py --projects SCPCP000003 --format AnnData
- I am also somewhat concerned about the use of
SeuratDisk
in general, as it seems like the package has not been updated in quite a while. Would it be possible to avoid that package? - It also seems like the SAM software can work directly from a matrix; would it be more efficient to write out a matrix directly with
Matrix::writeMM()
(and additional geneID and cellID files) - You may be able to handle reading the h5ad file as a Seurat object with https://github.com/cellgeni/schard; I am not sure whether it supports everything you need though.
- I'm a little unclear about what the preprocessing and filtering is that SAM does; does that affect the data as read back in? Would it be possible to write out only the clustering results and read those in directly to add to the existing objects?
I also left some line-level comments, though I suspect if you do some restructuring those may become moot.
Thank you again for your contribution, and I look forward to working with you as we work to get this module added to the project!
``` | ||
conda env create -f environment.yml --name openscpca | ||
conda activate openscpca | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and to prevent replacement of other conda environments from other modules, please name this with the name provided in the environment.yml
file, which includes the module name: openscpca-cell-type-nonETP-ALL-03
Also, since the python code and conda environment are run through reticulate
, does this need to be activated?
``` | |
conda env create -f environment.yml --name openscpca | |
conda activate openscpca | |
``` | |
``` | |
conda env create -f environment.yml --name openscpca-cell-type-nonETP-ALL-03 | |
``` |
Note this would also be replaced if you convert to using conda-lock
as suggested above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are welcome to remove this file if you do not need it.
#sudo apt install libglpk40 | ||
#sudo apt install libcurl4-openssl-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the requirements for system packages to the readme? These particular packages I am aware of and will hopefully soon be installed on future Lightsail instances, but for an existing instances this will still be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are now at line 27 and 28 of readme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think I had noticed that they were also there previously and I forgot to delete this comment!
use_condaenv("openscpca") | ||
os <- import("os") | ||
anndata <- import("anndata") #conda install conda-forge::anndata | ||
samalg <- import("samalg") #pip install sam-algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this package is not available via conda, you will need to lines like the following to the environment.yml
file:
dependencies:
- python=3.11
...
- pip
- pip:
- sam-algorithm
out_loc <- "~/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03/" | ||
data_loc <- "~/OpenScPCA-analysis/data/2024-08-22/SCPCP000003/" | ||
doublet_loc <- "~/OpenScPCA-analysis/data/2024-08-22/results/doublet-detection/SCPCP000003/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To preserve portability, we recommend using relative paths wherever possible. In addition, all data paths should use the current
alias rather than specific releases. In this case, you can also take advantage of the rprojroot
package to find the base of the project:
out_loc <- "~/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03/" | |
data_loc <- "~/OpenScPCA-analysis/data/2024-08-22/SCPCP000003/" | |
doublet_loc <- "~/OpenScPCA-analysis/data/2024-08-22/results/doublet-detection/SCPCP000003/" | |
project_root <- rprojroot::find_root(rprojroot::is_git_root) | |
out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03") | |
data_loc <- file.path(project_root, "data/current/SCPCP000003") | |
doublet_loc <- file.path(project_root, data/current/results/doublet-detection/SCPCP000003") |
sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds"))) | ||
seu <- CreateSeuratObject(counts = counts(sce), meta.data = as.data.frame(colData(sce))) | ||
seu@assays[["RNA"]]@data <- logcounts(sce) | ||
seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this will work given the rownames in the project are Ensembl ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing it out! I have changed it to the following codes:
sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))
geneConversion <- rowData(sce)[,c(1:2)]
mito.features <- geneConversion$gene_ids[which(grepl("^MT-",geneConversion$gene_symbol))]
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to protect against potential reordering and variation in the row names, I would slightly modify this:
mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]
|
||
DimPlot(final.obj, reduction = "umap", group.by = "leiden_clusters", label = T) + | ||
ggtitle(paste0(sampleID[i],": leiden_clusters")) | ||
ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf"))) | |
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sampleID[i],".pdf"))) |
SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T) | ||
Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity and robustness, can you include the package identifier when calling functions that may have somewhat ambiguous names? e.g.:
SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T) | |
Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T) | |
SeuratDisk::SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T) | |
SeuratDisk::Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T) |
#run in terminal | ||
conda activate openscpca #activating conda environment | ||
sudo apt install libglpk40 | ||
sudo apt install libcurl4-openssl-dev #before installing Seurat | ||
sudo apt-get install libhdf5-dev #before installing SeuratDisk | ||
conda install conda-forge::anndata | ||
pip install sam-algorithm | ||
sudo apt-get install libxml2-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libtiff5-dev #before installing devtools | ||
|
||
#run in R | ||
install.packages(c("Seurat","HGNChelper","openxlsx","devtools","renv","remotes")) | ||
BiocManager::install("SingleCellExperiment") | ||
install.packages("hdf5r", configure.args="--with-hdf5=/usr/bin/h5cc") #before installing SeuratDisk | ||
remotes::install_github("mojaveazure/seurat-disk") | ||
devtools::install_github("navinlabcode/copykat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If renv and conda-lock files are set up, these instructions should be able to be simplified to use those tools:
# system packages installation
sudo apt-get install libglpk40
...
conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml
Rscript -e "renv::restore()"
os$remove(paste0("sam.",sampleID[i],".h5ad")) | ||
os$remove(paste0("sam.",sampleID[i],".h5seurat")) | ||
os$remove(paste0(sampleID[i],".h5ad")) | ||
os$remove(paste0(sampleID[i],".h5Seurat")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can save yourself a python import by doing this directly from R:
os$remove(paste0("sam.",sampleID[i],".h5ad")) | |
os$remove(paste0("sam.",sampleID[i],".h5seurat")) | |
os$remove(paste0(sampleID[i],".h5ad")) | |
os$remove(paste0(sampleID[i],".h5Seurat")) | |
unlink(paste0("sam.",sampleID[i],".h5ad")) | |
unlink(paste0("sam.",sampleID[i],".h5seurat")) | |
unlink(paste0(sampleID[i],".h5ad")) | |
unlink(paste0(sampleID[i],".h5Seurat")) |
Hi, I am sorry that I accidentally closed the pull request. Regarding the
In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file. |
No worries about the accidental closing!
If you need more storage on an AWS Lightsail system, you will probably want to follow the directions here to attach additional storage: https://openscpca.readthedocs.io/en/latest/aws/lsfr/working-with-volumes/ Another approach that I might recommend is to refactor your scripts a bit to work on a single sample at a time: This will allow you to only keep on disk the files for a single sample, and will also help us out in the future when we adapt the module for our workflow system. |
I guess I want to confirm that the SAM output still includes the original data and possibly previous normalizations and projections. For context, I am usually quite skeptical of 2D projections as a metric of cell separation; I would personally want to confirm the cluster separation based on a more direct measure of distance. But I know this is a thorny problem! I want to hold off a bit for now on this kind of evaluation, but I want us to be able to look at it later, which is the main reason that I am asking about what the output is. Either way, this should not be too much of an issue either way, as we can always go back to the original input data as needed. |
Actually SAM output doesn't retain the original data. I only provide the raw count, geneID, and cellID. Their log-normalization is different from the previous normalization provided, and I save the SAM log-normalization values and their projections in the rds file. |
- anndata=0.10.9 | ||
- leidenalg=0.10.2 | ||
- pip | ||
- pip: | ||
- sam-algorithm==1.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to install the environment using conda-lock install
I ran into an error with during pip
installation part of the process. This has happened to me in the past, and the usual solution is to try to move a few more dependencies into the conda-managed section of the environment. I also recommend generally leaving third decimals off of the environment.yaml
file; they will be captured by conda-lock
either way but allow a bit more flexibility in initial solving of the environment.
The suggestion below moves the dependencies that I could out of pip
(this also implicitly includes the scikit-learn
and numba
packages, which seem most likely to cause installation errors).
- anndata=0.10.9 | |
- leidenalg=0.10.2 | |
- pip | |
- pip: | |
- sam-algorithm==1.0.2 | |
- anndata=0.10 | |
- leidenalg=0.10 | |
- dill | |
- harmonypy | |
- umap-learn | |
- pip | |
- pip: | |
- sam-algorithm==1.0.2 |
After updating this file, I was able to run conda-lock --update ''
(The quoted empty string forces re-resolving the entire environment) which resulted in a conda-lock.yml
that installed without trouble, at least at my end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the suggested changes, and ran conda-lock --update ''
. Then I created a new conda environment with conda-lock install --name tmp conda-lock.yml
, but it fails to import samalg. Did you manage to import samalg in python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving it a shot! I was able to import samalg, but when I tried to recreate my steps I ran into trouble again with the conda-lock --update ''
command. I was able to get past that by removing (renaming) conda-lock.yml
and then running the base conda-lock
command again. This created a clean build which I was able to both install and import from Python, both natively and using reticulate
in R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update with more testing... I did sometimes run into a case where conda-lock
would report a successful installation but I would not be able to import samalg
. I am not sure what the final cause is, but it may be related to pip
caching, so I might try running python -m pip cache purge
and seeing if that can help fix the installation. It may also help to do a manual install with pip install sam-algorithm
from within the environment: you should first check that the correct pip
is active with which pip
; at least once I have had the environment fail to install its own pip
for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still doesn't work for me. But if you don't mind, could we just do the following for system installations, so it will always work?
conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml
conda activate openscpca-cell-type-nonETP-ALL-03
pip install sam-algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that should be fine. I will do some further testing to see if I can figure out why it is not being reliable, but we do not need to have that hold things up for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit more testing and pushed 353bcb7 which removed some of the pip
installed packages from the conda-lock file that may have been causing trouble. (I didn't do this manually, but by deleting the conda-lock.yml
file and regenerating it, but the diff is quite clean).
If you have a chance to test conda-lock install
with this file, it would be great to know if that solved the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I managed to install the conda environment perfectly, and I can import samalg. Thank you so much for your help!
Just to clarify, which of the pip
installed packages did you remove from the conda-lock file? I am wondering how could I generate this conda-lock file that works, if I happen to update the environment.yml
in the future and thus need to change the conda-lock.yml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the lines that were removed if you click on the link to the commit: 353bcb7, but I think the core thing I learned is that updating an existing conda-lock.yml
file seems to sometimes leave some extra values, especially when there are pip
installations involved (at least with the current version of conda-lock
). It seems to work better, or at least more consistently, to remove the existing conda-lock.yml
file and then generate the new one fresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not fully finish my review here, but I had a few suggestions that I had made so far, and I wanted to make sure to get back to you about the conda issues we have been seeing. I made some suggestions, but there remains a bit of a mystery about why samalg
is not always installing correctly in the environment!
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))), | ||
r_to_py(as.array(colnames(seu))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reformatting a bit here, but I wanted to say that I do very much like this solution to avoid writing any files!
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))), | |
r_to_py(as.array(colnames(seu))))) | |
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), | |
r_to_py(as.array(rownames(seu))), | |
r_to_py(as.array(colnames(seu))))) |
use_condaenv("openscpca-cell-type-nonETP-ALL-03") | ||
os <- import("os") | ||
anndata <- import("anndata") | ||
samalg <- import("samalg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a comment here as well to allow easy reference:
samalg <- import("samalg") | |
samalg <- import("samalg") # https://github.com/atarashansky/self-assembling-manifold/tree/master |
|
||
project_root <- rprojroot::find_root(rprojroot::is_git_root) | ||
out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03") | ||
data_loc <- file.path(project_root, "data/current/SCPCP000003") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting making a variable for the project ID, as you can use it here and in the next section (filtering the metadata) to ensure parity.
data_loc <- file.path(project_root, "data/current/SCPCP000003") | |
projectID <- "SCPCP000003" | |
data_loc <- file.path(project_root, "data/current", projectID) |
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T) | ||
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | ||
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future-proofing, we probably want to filter the metadata here as well (we could get other NonETP-ALL samples, which would require other changes to accommodate, but we can at least prevent breaking this script!)
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T) | |
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = TRUE) | |
metadata <- metadata[which(metadata$scpca_project_id == projectID), ] | |
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] |
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | ||
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | ||
|
||
options(Seurat.object.assay.version = 'v3') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that this was here for compatibility with SeuratDisk
. Is it still needed, or can we move to Seurat v5 objects for future compatibility?
for (i in 1:length(sampleID)){ | ||
sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this a bit more portable and modular, I think it might be worth defining the steps in this loop as a function or a set of functions.
Something like:
run_sam <- function(sample, library){
sce <- readRDS(file.path(data_loc, sample, paste0(library,"_processed.rds")))
...
Which you could then call using the purrr
package replacing the explicit loop as follows:
purrr::walk2(run_sam, sampleID, libraryID)
Note that this will go through the sampleID
and libraryID
vectors in parallel, just as your previous version used i
to track the position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have changed them to run_sam
function, and call the function via purrr::walk2(sampleID, libraryID, run_sam)
.
- anndata=0.10.9 | ||
- leidenalg=0.10.2 | ||
- pip | ||
- pip: | ||
- sam-algorithm==1.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving it a shot! I was able to import samalg, but when I tried to recreate my steps I ran into trouble again with the conda-lock --update ''
command. I was able to get past that by removing (renaming) conda-lock.yml
and then running the base conda-lock
command again. This created a clean build which I was able to both install and import from Python, both natively and using reticulate
in R.
- anndata=0.10.9 | ||
- leidenalg=0.10.2 | ||
- pip | ||
- pip: | ||
- sam-algorithm==1.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update with more testing... I did sometimes run into a case where conda-lock
would report a successful installation but I would not be able to import samalg
. I am not sure what the final cause is, but it may be related to pip
caching, so I might try running python -m pip cache purge
and seeing if that can help fix the installation. It may also help to do a manual install with pip install sam-algorithm
from within the environment: you should first check that the correct pip
is active with which pip
; at least once I have had the environment fail to install its own pip
for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with all of the updates you have made, this is looking almost ready. I have just a few cleanup comments at this point, and one somewhat larger question about the reference file, but I think the reference question can wait for a later PR, when it is actually being used, though that is really up to you.
And thank you for testing the conda-lock environment! I am glad that it did work out, and hopefully will not be too much trouble to maintain!
os <- import("os") | ||
anndata <- import("anndata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think either of these is used any more? Unless anndata
is required for samalg
?
os <- import("os") | |
anndata <- import("anndata") |
library(dplyr) | ||
library(cowplot) | ||
library(ggplot2) | ||
library(future) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some number of these are unused? It is a bit difficult to say, because there may be functions that are not distinctive that are being used, but if a function is only used once it may be preferable to use ::
syntax for clarity. (ggplot2
and dplyr
are possible exceptions given their ubiquity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove these libraries, but I am keeping library(ggplot2)
for plotting the plot and saving them.
@@ -0,0 +1,64 @@ | |||
library(Seurat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include #!
lines for all scripts as possible. An opening comment that describes what this script does would also be helpful here. You probably don't want to use what I wrote below exactly, but something like that, including a comment about the output files would be helpful for later users.
library(Seurat) | |
#!/usr/bin/env Rscript | |
# This script uses SAM to perform clusters for the nonETP-ALL samples in SPCP000003 and outputs... | |
library(Seurat) |
- anndata=0.10.9 | ||
- leidenalg=0.10.2 | ||
- pip | ||
- pip: | ||
- sam-algorithm==1.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the lines that were removed if you click on the link to the commit: 353bcb7, but I think the core thing I learned is that updating an existing conda-lock.yml
file seems to sometimes leave some extra values, especially when there are pip
installations involved (at least with the current version of conda-lock
). It seems to work better, or at least more consistently, to remove the existing conda-lock.yml
file and then generate the new one fresh.
|
||
We first aim to annotate the cell types in non-ETP T-ALL, and use the annotated B cells in the sample as the "normal" cells to identify tumor cells, since T-ALL is caused by the clonal proliferation of immature T-cell [<https://www.nature.com/articles/s41375-018-0127-8>]. | ||
|
||
- We use the cell type marker (`Azimuth_BM_level1.xlsx`) from [Azimuth Human Bone Marrow reference](https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow). In total, there are 14 cell types: B, CD4T, CD8T, Other T, DC, Monocytes, Macrophages, NK, Early Erythrocytes, Late Erythrocytes, Plasma, Platelet, Stromal, and Hematopoietic Stem and Progenitor Cells (HSPC). Based on the exploratory analysis, we believe that most of the cells in these samples do not express adequate markers to be distinguished at finer cell type level (eg. naive vs memory, CD14 vs CD16 etc.), and majority of the cells should belong to T-cells. In addition, we include the marker genes for cancer cell in immune system from [ScType](https://sctype.app/database.php) database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can probably wait for the next PR (or when you are actually using the file), but I am wondering how this file was compiled. Was it downloaded directly or created from the website table? If the latter, we probably want to include the direct URL to the file, if we can.
If the latter, it would probably be better to have this as a text table for easier tracing through changes. I would also recommend including the Cell Ontology IDs for the cell types if possible, which we can use to track across different naming conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was created from the website table (combination of celltype.l2
and celltype.l1
), and I converted the markers from gene symbol to ensembl ID. Do you mean to provide the URL (https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow) as part of the header of the file?
I am saving it as an excel file because it is the input format required by ScType, unless I go and change their codes. If I understand the Cell Ontology IDs correctly, it provides an ID for a specific cell type (eg. CL:0000904 for central memory CD4+ T cell). This is the reason I am using celltype.l1
, because I need the cell type annotation to be very general (eg. just CD4+ T, instead of specifying naive or memory).
I did try with finer cell type label, but I found out that the cells (e.g. different types of CD4 and CD8 and even progenitors) all cluster together (fail to be separated), thus they cannot be annotated at cluster level. I also try to do annotation at cell level using anchor transfer etc. The prediction results is really bad. Thus, I suspect that they don't express enough markers to be separated at the finer level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification. I don't think you need to provide the URL in the header as you do already have it in the readme, but I am surprised that ScType can not take a .csv
file! (I just looked at the code and I do have an idea for how to get around this: I think you should be able to run something like gene_sets_prepare(openxlsx::buildWorkbook(marker_df), "Immune system")
to work from any data frame, but this is not something we really need to address now.
As for Cell Ontology IDs, they are hierarchical, so you should be able to use a higher-level term if needed. But I think that you can use the values given as links in the Azimuth table to start: For example for B cells they link to CL:0000945, and for the CD4 T cells it links to CL:0000624, which is a more general ID; The central memory CD4 cells, CL:0000904 can be found as a child term to the more general term.
This is the primary benefit of using the ontology, in fact! We can use these terms to harmonize across annotations even if different samples are annotated to different depths by various groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the changes as suggested for using .csv
file (the codes will be shown up in scripts/02-03_annotation.R
in next pull request).
As of now, I put the Cell Ontology IDs under the column of shortName
in Azimuth_BM_level1.csv
. I don't think there is an ID for "Other T", since it is such a broad category, and I could not find IDs for Early and Late Eryth. I just leave them as how it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have another concern regarding SAM and ScType. Each time I run SAM, the clustering output changes, although the seed is set as 0 (by default) in sam$run()
. The log is shown as image below.
Since ScType annotates by cluster, so the cell annotation may change if they are found in another cluster, with different annotation. This is actually quite likely to happen, especially for those cells that don't have clear markers expressed. Eg. Depending on the clustering results, they may be annotated as Unknown
, if the score of a cluster is less than a quarter of #cells in that cluster.
metadata <- metadata[which(metadata$scpca_project_id == projectID), ] | ||
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | ||
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little simplification.
metadata <- metadata[which(metadata$scpca_project_id == projectID), ] | |
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
metadata <- metadata[which(metadata$scpca_project_id == projectID & metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"), ] | |
sampleID <- metadata$scpca_sample_id | |
libraryID <- metadata$scpca_library_id |
In my own work, this is actually a place I would use dplyr
to make things a bit simpler (at least in my mind), as shown below, but either method is fine!
metadata <- metadata[which(metadata$scpca_project_id == projectID), ] | |
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")] | |
metadata <- metadata |> | |
dplyr::select( | |
scpca_project_id == projectID, | |
diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia" | |
) | |
sampleID <- metadata$scpca_sample_id | |
libraryID <- metadata$scpca_library_id |
This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to I think this limitation is actually quite out of date, and the |
One other thought I had was whether you need the UMAP visualizations and clusters that are calculated by SAM? Can you recalculate those with Seurat? The reason I ask is that if the cell assignments in scType are sensitive to the clustering, you may well want to explore how different clustering parameters affect the quality of the assignments. If you want to go down that route, I would probably start by breaking up the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some testing to confirm that using distance="correlation"
does result in reproducible results. I also suggested a few changes to only create one Seurat object at the end, and to make sure it has all of the metadata from the original object, in case that is useful.
I do still think it may be worth breaking up the SAM run and the Seurat object creation, but I think with the changes here it may be in a good enough state to move on. Clustering exploration can still happen if SAM does a first pass!
seu <- CreateSeuratObject(counts = counts(sce), data = logcounts(sce), meta.data = as.data.frame(colData(sce))) | ||
# create a new assay to store ADT information | ||
seu[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce))) | ||
seu[["ADT"]]@data <- logcounts(altExp(sce)) | ||
mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))] | ||
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features) | ||
|
||
#detecting doublets | ||
doublets.data <- read.table(file = file.path(doublet_loc,sample,paste0(library,"_processed_scdblfinder.tsv")),sep = "\t", header = T) | ||
idx <- match(colnames(seu), doublets.data$barcodes) | ||
seu$doublet_class <- doublets.data$class[idx] | ||
seu <- subset(seu, subset = percent.mt < 25) | ||
|
||
#step 01 feature selection/dimensionality using SAM | ||
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@layers[["counts"]])), | ||
r_to_py(as.array(rownames(seu))), | ||
r_to_py(as.array(colnames(seu))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that for this section there is no need to do the Seurat conversion, so this should be a bit more efficient:
seu <- CreateSeuratObject(counts = counts(sce), data = logcounts(sce), meta.data = as.data.frame(colData(sce))) | |
# create a new assay to store ADT information | |
seu[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce))) | |
seu[["ADT"]]@data <- logcounts(altExp(sce)) | |
mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))] | |
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features) | |
#detecting doublets | |
doublets.data <- read.table(file = file.path(doublet_loc,sample,paste0(library,"_processed_scdblfinder.tsv")),sep = "\t", header = T) | |
idx <- match(colnames(seu), doublets.data$barcodes) | |
seu$doublet_class <- doublets.data$class[idx] | |
seu <- subset(seu, subset = percent.mt < 25) | |
#step 01 feature selection/dimensionality using SAM | |
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@layers[["counts"]])), | |
r_to_py(as.array(rownames(seu))), | |
r_to_py(as.array(colnames(seu))))) | |
doublets.data <- read.table( | |
file = file.path(doublet_loc, sample, paste0(library,"_processed_scdblfinder.tsv")), | |
sep = "\t", header = T | |
) | |
idx <- match(colnames(sce), doublets.data$barcodes) | |
sce$doublet_class <- doublets.data$class[idx] | |
sce <- sce[, which(sce$subsets_mito_percent < 25)] | |
#step 01 feature selection/dimensionality using SAM | |
sam = samalg$SAM(counts = c(r_to_py(t(counts(sce))), | |
r_to_py(as.array(rownames(sce))), | |
r_to_py(as.array(colnames(sce))))) |
r_to_py(as.array(rownames(seu))), | ||
r_to_py(as.array(colnames(seu))))) | ||
sam$preprocess_data() | ||
sam$run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sam$run() | |
# using correlation for the distance measure allows for reproducible results | |
sam$run(distance='correlation') |
final.obj <- AddMetaData(final.obj, seu@meta.data) | ||
final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]] | ||
# create a new assay to store ADT information | ||
final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts) | ||
final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not create the Seurat object, we do need to modify this a bit to get the data from the SCE object:
final.obj <- AddMetaData(final.obj, seu@meta.data) | |
final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]] | |
# create a new assay to store ADT information | |
final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts) | |
final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data | |
final.obj <- AddMetaData(final.obj, as.data.frame(colData(sce))) | |
final.obj[["RNA"]]@counts <- counts(sce) | |
final.obj[["RNA"]] <- AddMetaData(final.obj[["RNA"]], as.data.frame(rowData(sce))) | |
final.obj[["percent.mt"]] <- final.obj[["subsets_mito_percent"]] # add Seurat expected column name | |
# create a new assay to store ADT information | |
final.obj[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce, "adt"))) | |
final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt")) | |
final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for simplifying it! Since percent.mt
and subsets_mito_percent
are storing the same information, I only keep subsets_mito_percent
. Just to check with you, are nCount_RNA
and nFeature_RNA
metadata stored in sce
object? Because these two metadata give different values between creating seurat object via CreateSeuratObject
(named as seu) and schard::h5ad2seurat
(named as final.obj)
It seems like CreateSeuratObject
is giving the right value, since the min_gene_cutoff > 200
, so the minimum value of nFeature_RNA
should be larger than 200, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check with you, are `nCount_RNA and nFeature_RNA metadata stored in sce object?
The equivalent values should be stored as sce$sum
and sce$detected
, respectively.
It seems like
CreateSeuratObject
is giving the right value, since themin_gene_cutoff > 200
, so the minimum value ofnFeature_RNA
should be larger than 200, right?
That sounds correct to me. I am not sure how SAM is populating that field, as it is certainly doing some preprocessing. Perhaps it is only including the features/genes that it has not filtered (as denoted by mask_genes
); I think the filtered genes may have their values set to zero for all samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification! I have now set them as:
final.obj$nCount_RNA <- final.obj$sum
final.obj$nFeature_RNA <- final.obj$detected
|
||
DimPlot(final.obj, reduction = "Xumap_", group.by = "leiden_clusters", label = T) + | ||
ggtitle(paste0(sample,": leiden_clusters")) | ||
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other minor comment is that we should probably set a size here to ensure consistency:
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf"))) | |
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")), | |
width = 7, height = 7)) |
@@ -0,0 +1,16 @@ | |||
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating this file to one that we can more easily track changes to!
I wonder if it makes more sense to have a separate column for the cell ontology? I am not sure if that would upset ScType, but I suspect it will not matter, or we can remove it before passing it along.
As to some of the ambiguous cell ontology values, I think it probably makes the most sense to just use whatever was provided by Azimuth as the best equivalent. In the case of early and late erythrocytes this will just mean that they have the same CL value: CL_0000764
, and for "Other T" that would just be the generic T cell: CL_0000084
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the separate column is probably the way to go here, so you can have both the full names and short names be human readable.
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName | |
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName,ontologyID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@UTSouthwesternDSSR I wasn't sure if you have completed the changes that you intended to make here, but things look good for now. I think my final suggestion is the one made in a comment here which is just to make the Cell Ontology ID a separate column in the .csv
file. I might also make the cellName
column unabbreviated, leaving the abbreviated labels for shortName
though I admit as I am unfamiliar with the output of ScType, that might not be a great idea...
In the future, when you are ready for a review, please don't hesitate to let me know either with a comment or just by clicking the re-request review button in the upper right of the PR page:
@@ -0,0 +1,16 @@ | |||
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the separate column is probably the way to go here, so you can have both the full names and short names be human readable.
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName | |
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName,ontologyID |
Sure, I will make changes to the |
I don't think it really matters, as long as the names are clear. At some point we will almost certainly have to do some harmonization across different references, but that will be well down the line (and not part of this module). |
Yes, I have completed all the changes that I have intended for this part (including editing the Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have completed all the changes that I have intended for this part (including editing the
.csv
file). I will have to re-request review for it to be merged into the main branch, right?
You do not necessarily have to re-request review; that is just a good signal to me that you have completed the changes you intended to make so I know when a good time to look at it is.
Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run
CopyKat
. I plan to download the rds file from S3 bucket to my local server for analysis with./sync-results.py cell-type-nonETP-ALL-03 --bucket researcher-650251722463-us-east-2 --download --profile my-dev-profile
, but it doesn't work. It is looking for git directory. I am wondering if I should clone the repository folder, or I will have to wait until my pull request get merged into the main branch?
The sync-results.py
script does expect that you have cloned the repository, but you should not have to wait for the pull request to be merged; you can simply check out whatever branch has the next set of changes in it (or make a new branch).
Working on your local machine is fine and may be more convenient, but I wonder if there may be other ways to reduce the memory usage. I obviously can't make any recommendations without seeing the code, but I don't seem to recall CopyKat itself being particularly memory-intensive in our previous use.
@@ -0,0 +1,16 @@ | |||
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,fullName,ontologyID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the extra columns. Having the full name seems like it could be useful for reference, so I appreciate having those.
Purpose/implementation Section
In this PR section, we will initialize an analysis module skeleton for non-ETP T-ALL dataset in SCPCP000003 and to provide scripts for processing the provided
sce
objects for review.Please link to the GitHub issue that this pull request addresses.
What is the goal of this pull request?
This PR files scripts for processing the provided
sce
objects using self-assembling manifold (SAM) algorithm as a soft feature selection strategy for better separation of more homogeneous populations. The savedrds
objects will be used in the following analysis in my next PR.Briefly describe the general approach you took to achieve this goal.
A raw count is provided to SAM algorithm with default parameters, where it will preprocess the data by log-normalizing the counts to the median total read count per cell and retaining the genes that expressed in less than 96% of total cells. Then, it will perform feature selection and dimensionality reduction, using 3000 genes selected based on the SAM weights prior to PCA, and with 150 principal components and 20 nearest neighbors to generate UMAP embedding. Leiden clustering is performed to provide clusterIDs. This workflow is applied to all 11 samples in the dataset.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Cell type annotation and identification of tumor cells will be filed soon in other PRs.
Results
The
scripts/00-01_processing_rds.R
will generaterds
files inscratch/
and umap plots showing leiden clustering results inplot/00-01_processing_rds/
.What is the name of your results bucket on S3?
s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots/00-01_processing_rds
What types of results does your code produce (e.g., table, figure)?
Intermediate
rds
files are saved inscratch/
for further analysis, and the umap plots showing leiden clustering are saved inplot/00-01_processing_rds/
.What is your summary of the results?
The intermediate
rds
files are ready for cell type annotation and subsequently for tumor cells identification.Provide directions for reviewers
In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.
What are the software and computational requirements needed to be able to run the code in this PR?
Are there particularly areas you'd like reviewers to have a close look at?
For the first PR, we would like to check if our way to setup the module skeleton, computing environment, and documentation has met the need.
Is there anything that you want to discuss further?
I failed to push the
environment.yml
file to the github, since it gives the warning/error of 3 leaks found in the pre-commit stage, as shown in the image below. These are the lines inenvironment.yml
causing error:I generated the
environment.yml
viaconda env export -n openscpca > environment.yml
. I am wondering if you have any suggestion in solving this?Thank you for reviewing and much appreciation for any suggestion provided!
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.