Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celltype annotation for non-ETP T-ALL (SCPCP000003) #767

Merged

Conversation

UTSouthwesternDSSR
Copy link
Contributor

Purpose/implementation Section

In this PR section, we will initialize an analysis module skeleton for non-ETP T-ALL dataset in SCPCP000003 and to provide scripts for processing the provided sce objects for review.

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

This PR files scripts for processing the provided sce objects using self-assembling manifold (SAM) algorithm as a soft feature selection strategy for better separation of more homogeneous populations. The saved rds objects will be used in the following analysis in my next PR.

Briefly describe the general approach you took to achieve this goal.

A raw count is provided to SAM algorithm with default parameters, where it will preprocess the data by log-normalizing the counts to the median total read count per cell and retaining the genes that expressed in less than 96% of total cells. Then, it will perform feature selection and dimensionality reduction, using 3000 genes selected based on the SAM weights prior to PCA, and with 150 principal components and 20 nearest neighbors to generate UMAP embedding. Leiden clustering is performed to provide clusterIDs. This workflow is applied to all 11 samples in the dataset.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Cell type annotation and identification of tumor cells will be filed soon in other PRs.

Results

The scripts/00-01_processing_rds.R will generate rds files in scratch/ and umap plots showing leiden clustering results in plot/00-01_processing_rds/.

What is the name of your results bucket on S3?

s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots/00-01_processing_rds

What types of results does your code produce (e.g., table, figure)?

Intermediate rds files are saved in scratch/ for further analysis, and the umap plots showing leiden clustering are saved in plot/00-01_processing_rds/.

What is your summary of the results?

The intermediate rds files are ready for cell type annotation and subsequently for tumor cells identification.

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

  • Analysis could be executed in a standard 4XL virtual machine via AWS Lightsail for Research.
  • To set up the computing environment, run:
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
cd /home/lightsail-user/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03
conda env create -f environment.yml -y -n openscpca
conda activate openscpca
Rscript scripts/00-01_processing_rds.R

Are there particularly areas you'd like reviewers to have a close look at?

For the first PR, we would like to check if our way to setup the module skeleton, computing environment, and documentation has met the need.

Is there anything that you want to discuss further?

I failed to push the environment.yml file to the github, since it gives the warning/error of 3 leaks found in the pre-commit stage, as shown in the image below. These are the lines in environment.yml causing error:

line 99: - keyring=25.3.0=pyha804496_0
line 249: - r-rstudioapi=0.16.0=r44hc72bb7e_1

I generated the environment.yml via conda env export -n openscpca > environment.yml. I am wondering if you have any suggestion in solving this?
Screenshot 2024-09-19 at 12 18 18 PM
Thank you for reviewing and much appreciation for any suggestion provided!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @UTSouthwesternDSSR and thank you for your contribution to the OpenScPCA project!

Overall, this first contribution looks to be structured as we would like, so you have done a great job there, and thank you for providing an informative README! It was very helpful to see the goals of the analysis and the overall steps that you are taking.

Let me start by addressing the question about the "leaks" (thank you for setting up pre-commit!)

I generated the environment.yml via conda env export -n openscpca > environment.yml. I am wondering if you have any suggestion in solving this?

The issue here is that sometimes the build hashes from conda packages look like API keys, so pre-commit (specifically gitleaks) is flagging them. There are a few solutions that could allow gitleaks to accept these strings, but in fact the inclusion of build ids is something we would rather avoid, as they can often result in cross-system incompatibilities.

In this project, we are trying to take a different approach from simply using conda env export, as we have found that even if you include the --no-builds option (which would remove this particular issue) it can create incompatibilities across systems in the environment file that can be hard to track down.

Instead, we are recommending the use of conda-lock. Rather than repeating too much, I am going to direct you to the page of our docs that describes the approach: https://openscpca.readthedocs.io/en/latest/ensuring-repro/managing-software/using-conda/

In short, you will want to add the packages you need (optionally with version numbers) to the environment.yml file. The goal here is to have a high-level of list of the packages needed, but without all of dependencies that may be system-specific. Then you can run conda-lock --file environment.yml, which will create a conda-lock.yml file which includes all version information, including system-specific dependencies. This can then be installed on a new system with conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml

Other comments

  • I noticed that a big part of the first section of your script is converting from SCE to Seurat and then from that to h5ad. I imagine that is a bit slow, as it involves a number of writes to disk. Would using the h5ad files that we provide make that any easier or more efficient? You can get those files with the following command using our download script:
./download-data.py --projects SCPCP000003 --format AnnData
  • I am also somewhat concerned about the use of SeuratDisk in general, as it seems like the package has not been updated in quite a while. Would it be possible to avoid that package?
  • It also seems like the SAM software can work directly from a matrix; would it be more efficient to write out a matrix directly with Matrix::writeMM() (and additional geneID and cellID files)
  • You may be able to handle reading the h5ad file as a Seurat object with https://github.com/cellgeni/schard; I am not sure whether it supports everything you need though.
  • I'm a little unclear about what the preprocessing and filtering is that SAM does; does that affect the data as read back in? Would it be possible to write out only the clustering results and read those in directly to add to the existing objects?

I also left some line-level comments, though I suspect if you do some restructuring those may become moot.

Thank you again for your contribution, and I look forward to working with you as we work to get this module added to the project!

Comment on lines 94 to 97
```
conda env create -f environment.yml --name openscpca
conda activate openscpca
```
Copy link
Member

@jashapiro jashapiro Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency and to prevent replacement of other conda environments from other modules, please name this with the name provided in the environment.yml file, which includes the module name: openscpca-cell-type-nonETP-ALL-03
Also, since the python code and conda environment are run through reticulate, does this need to be activated?

Suggested change
```
conda env create -f environment.yml --name openscpca
conda activate openscpca
```
```
conda env create -f environment.yml --name openscpca-cell-type-nonETP-ALL-03
```

Note this would also be replaced if you convert to using conda-lock as suggested above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are welcome to remove this file if you do not need it.

Comment on lines 1 to 2
#sudo apt install libglpk40
#sudo apt install libcurl4-openssl-dev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the requirements for system packages to the readme? These particular packages I am aware of and will hopefully soon be installed on future Lightsail instances, but for an existing instances this will still be required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are now at line 27 and 28 of readme

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think I had noticed that they were also there previously and I forgot to delete this comment!

use_condaenv("openscpca")
os <- import("os")
anndata <- import("anndata") #conda install conda-forge::anndata
samalg <- import("samalg") #pip install sam-algorithm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this package is not available via conda, you will need to lines like the following to the environment.yml file:

dependencies:
  - python=3.11
...
  - pip
  - pip:
      - sam-algorithm

Comment on lines 21 to 23
out_loc <- "~/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03/"
data_loc <- "~/OpenScPCA-analysis/data/2024-08-22/SCPCP000003/"
doublet_loc <- "~/OpenScPCA-analysis/data/2024-08-22/results/doublet-detection/SCPCP000003/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve portability, we recommend using relative paths wherever possible. In addition, all data paths should use the current alias rather than specific releases. In this case, you can also take advantage of the rprojroot package to find the base of the project:

Suggested change
out_loc <- "~/OpenScPCA-analysis/analyses/cell-type-nonETP-ALL-03/"
data_loc <- "~/OpenScPCA-analysis/data/2024-08-22/SCPCP000003/"
doublet_loc <- "~/OpenScPCA-analysis/data/2024-08-22/results/doublet-detection/SCPCP000003/"
project_root <- rprojroot::find_root(rprojroot::is_git_root)
out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03")
data_loc <- file.path(project_root, "data/current/SCPCP000003")
doublet_loc <- file.path(project_root, data/current/results/doublet-detection/SCPCP000003")

sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))
seu <- CreateSeuratObject(counts = counts(sce), meta.data = as.data.frame(colData(sce)))
seu@assays[["RNA"]]@data <- logcounts(sce)
seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this will work given the rownames in the project are Ensembl ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out! I have changed it to the following codes:

sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))
geneConversion <- rowData(sce)[,c(1:2)]
mito.features <- geneConversion$gene_ids[which(grepl("^MT-",geneConversion$gene_symbol))]
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to protect against potential reordering and variation in the row names, I would slightly modify this:

mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]


DimPlot(final.obj, reduction = "umap", group.by = "leiden_clusters", label = T) +
ggtitle(paste0(sampleID[i],": leiden_clusters"))
ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ggsave(file.path(out_loc,"plots/00-01_processing_rds/",paste0(sampleID[i],".pdf")))
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sampleID[i],".pdf")))

Comment on lines 43 to 44
SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)
Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity and robustness, can you include the package identifier when calling functions that may have somewhat ambiguous names? e.g.:

Suggested change
SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)
Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)
SeuratDisk::SaveH5Seurat(seu, filename = paste0(sampleID[i],".h5Seurat"), overwrite = T)
SeuratDisk::Convert(paste0(sampleID[i],".h5Seurat"), dest = "h5ad", overwrite = T)

Comment on lines 35 to 49
#run in terminal
conda activate openscpca #activating conda environment
sudo apt install libglpk40
sudo apt install libcurl4-openssl-dev #before installing Seurat
sudo apt-get install libhdf5-dev #before installing SeuratDisk
conda install conda-forge::anndata
pip install sam-algorithm
sudo apt-get install libxml2-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libtiff5-dev #before installing devtools

#run in R
install.packages(c("Seurat","HGNChelper","openxlsx","devtools","renv","remotes"))
BiocManager::install("SingleCellExperiment")
install.packages("hdf5r", configure.args="--with-hdf5=/usr/bin/h5cc") #before installing SeuratDisk
remotes::install_github("mojaveazure/seurat-disk")
devtools::install_github("navinlabcode/copykat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If renv and conda-lock files are set up, these instructions should be able to be simplified to use those tools:

# system packages installation
sudo apt-get install libglpk40
...

conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml
Rscript -e "renv::restore()"

Comment on lines 63 to 66
os$remove(paste0("sam.",sampleID[i],".h5ad"))
os$remove(paste0("sam.",sampleID[i],".h5seurat"))
os$remove(paste0(sampleID[i],".h5ad"))
os$remove(paste0(sampleID[i],".h5Seurat"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can save yourself a python import by doing this directly from R:

Suggested change
os$remove(paste0("sam.",sampleID[i],".h5ad"))
os$remove(paste0("sam.",sampleID[i],".h5seurat"))
os$remove(paste0(sampleID[i],".h5ad"))
os$remove(paste0(sampleID[i],".h5Seurat"))
unlink(paste0("sam.",sampleID[i],".h5ad"))
unlink(paste0("sam.",sampleID[i],".h5seurat"))
unlink(paste0(sampleID[i],".h5ad"))
unlink(paste0(sampleID[i],".h5Seurat"))

@UTSouthwesternDSSR
Copy link
Contributor Author

Hi,

I am sorry that I accidentally closed the pull request. Regarding the SeuratDisk package, I managed to avoid it by providing the matrix file to SAM and using schard to read in h5ad file:

sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))), 
                             r_to_py(as.array(colnames(seu)))))
  sam$preprocess_data()
  sam$run()
  sam$clustering(method = "leiden") #leiden clustering is the default algorithm in SAM
  sam$save_anndata(paste0("sam.",sampleID[i],".h5ad"))
  final.obj <- schard::h5ad2seurat(paste0("sam.",sampleID[i],".h5ad"))

In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file.
I deleted the anndata format due to disk storage issue. But I would need more storage for the other 30 ETP T-ALL samples.

@jashapiro
Copy link
Member

No worries about the accidental closing!

I deleted the anndata format due to disk storage issue. But I would need more storage for the other 30 ETP T-ALL samples.

If you need more storage on an AWS Lightsail system, you will probably want to follow the directions here to attach additional storage: https://openscpca.readthedocs.io/en/latest/aws/lsfr/working-with-volumes/
One limitation is that you can only attach 128GB per volume, but you can attach multiple volumes if needed.

Another approach that I might recommend is to refactor your scripts a bit to work on a single sample at a time: This will allow you to only keep on disk the files for a single sample, and will also help us out in the future when we adapt the module for our workflow system.

@jaclyn-taroni jaclyn-taroni removed their request for review September 20, 2024 19:02
@jashapiro
Copy link
Member

In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file.

I guess I want to confirm that the SAM output still includes the original data and possibly previous normalizations and projections. For context, I am usually quite skeptical of 2D projections as a metric of cell separation; I would personally want to confirm the cluster separation based on a more direct measure of distance. But I know this is a thorny problem! I want to hold off a bit for now on this kind of evaluation, but I want us to be able to look at it later, which is the main reason that I am asking about what the output is. Either way, this should not be too much of an issue either way, as we can always go back to the original input data as needed.

@UTSouthwesternDSSR
Copy link
Contributor Author

Actually SAM output doesn't retain the original data. I only provide the raw count, geneID, and cellID. Their log-normalization is different from the previous normalization provided, and I save the SAM log-normalization values and their projections in the rds file.

Comment on lines 16 to 20
- anndata=0.10.9
- leidenalg=0.10.2
- pip
- pip:
- sam-algorithm==1.0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried to install the environment using conda-lock install I ran into an error with during pip installation part of the process. This has happened to me in the past, and the usual solution is to try to move a few more dependencies into the conda-managed section of the environment. I also recommend generally leaving third decimals off of the environment.yaml file; they will be captured by conda-lock either way but allow a bit more flexibility in initial solving of the environment.

The suggestion below moves the dependencies that I could out of pip (this also implicitly includes the scikit-learn and numba packages, which seem most likely to cause installation errors).

Suggested change
- anndata=0.10.9
- leidenalg=0.10.2
- pip
- pip:
- sam-algorithm==1.0.2
- anndata=0.10
- leidenalg=0.10
- dill
- harmonypy
- umap-learn
- pip
- pip:
- sam-algorithm==1.0.2

After updating this file, I was able to run conda-lock --update '' (The quoted empty string forces re-resolving the entire environment) which resulted in a conda-lock.yml that installed without trouble, at least at my end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the suggested changes, and ran conda-lock --update ''. Then I created a new conda environment with conda-lock install --name tmp conda-lock.yml, but it fails to import samalg. Did you manage to import samalg in python?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving it a shot! I was able to import samalg, but when I tried to recreate my steps I ran into trouble again with the conda-lock --update '' command. I was able to get past that by removing (renaming) conda-lock.yml and then running the base conda-lock command again. This created a clean build which I was able to both install and import from Python, both natively and using reticulate in R.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update with more testing... I did sometimes run into a case where conda-lock would report a successful installation but I would not be able to import samalg. I am not sure what the final cause is, but it may be related to pip caching, so I might try running python -m pip cache purge and seeing if that can help fix the installation. It may also help to do a manual install with pip install sam-algorithm from within the environment: you should first check that the correct pip is active with which pip; at least once I have had the environment fail to install its own pip for some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still doesn't work for me. But if you don't mind, could we just do the following for system installations, so it will always work?

conda-lock install --name openscpca-cell-type-nonETP-ALL-03 conda-lock.yml
conda activate openscpca-cell-type-nonETP-ALL-03
pip install sam-algorithm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that should be fine. I will do some further testing to see if I can figure out why it is not being reliable, but we do not need to have that hold things up for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit more testing and pushed 353bcb7 which removed some of the pip installed packages from the conda-lock file that may have been causing trouble. (I didn't do this manually, but by deleting the conda-lock.yml file and regenerating it, but the diff is quite clean).

If you have a chance to test conda-lock install with this file, it would be great to know if that solved the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I managed to install the conda environment perfectly, and I can import samalg. Thank you so much for your help!
Just to clarify, which of the pip installed packages did you remove from the conda-lock file? I am wondering how could I generate this conda-lock file that works, if I happen to update the environment.yml in the future and thus need to change the conda-lock.yml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the lines that were removed if you click on the link to the commit: 353bcb7, but I think the core thing I learned is that updating an existing conda-lock.yml file seems to sometimes leave some extra values, especially when there are pip installations involved (at least with the current version of conda-lock). It seems to work better, or at least more consistently, to remove the existing conda-lock.yml file and then generate the new one fresh.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not fully finish my review here, but I had a few suggestions that I had made so far, and I wanted to make sure to get back to you about the conda issues we have been seeing. I made some suggestions, but there remains a bit of a mystery about why samalg is not always installing correctly in the environment!

Comment on lines 42 to 43
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))),
r_to_py(as.array(colnames(seu)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reformatting a bit here, but I wanted to say that I do very much like this solution to avoid writing any files!

Suggested change
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))),
r_to_py(as.array(colnames(seu)))))
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)),
r_to_py(as.array(rownames(seu))),
r_to_py(as.array(colnames(seu)))))

use_condaenv("openscpca-cell-type-nonETP-ALL-03")
os <- import("os")
anndata <- import("anndata")
samalg <- import("samalg")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a comment here as well to allow easy reference:

Suggested change
samalg <- import("samalg")
samalg <- import("samalg") # https://github.com/atarashansky/self-assembling-manifold/tree/master


project_root <- rprojroot::find_root(rprojroot::is_git_root)
out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03")
data_loc <- file.path(project_root, "data/current/SCPCP000003")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting making a variable for the project ID, as you can use it here and in the next section (filtering the metadata) to ensure parity.

Suggested change
data_loc <- file.path(project_root, "data/current/SCPCP000003")
projectID <- "SCPCP000003"
data_loc <- file.path(project_root, "data/current", projectID)

Comment on lines 20 to 22
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T)
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future-proofing, we probably want to filter the metadata here as well (we could get other NonETP-ALL samples, which would require other changes to accommodate, but we can at least prevent breaking this script!)

Suggested change
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T)
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = TRUE)
metadata <- metadata[which(metadata$scpca_project_id == projectID), ]
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]

options(Seurat.object.assay.version = 'v3')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that this was here for compatibility with SeuratDisk. Is it still needed, or can we move to Seurat v5 objects for future compatibility?

Comment on lines 25 to 26
for (i in 1:length(sampleID)){
sce <- readRDS(file.path(data_loc,sampleID[i],paste0(libraryID[i],"_processed.rds")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this a bit more portable and modular, I think it might be worth defining the steps in this loop as a function or a set of functions.

Something like:

run_sam <- function(sample, library){
  sce <- readRDS(file.path(data_loc, sample, paste0(library,"_processed.rds")))
  ...

Which you could then call using the purrr package replacing the explicit loop as follows:

purrr::walk2(run_sam, sampleID, libraryID)

Note that this will go through the sampleID and libraryID vectors in parallel, just as your previous version used i to track the position.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have changed them to run_sam function, and call the function via purrr::walk2(sampleID, libraryID, run_sam).

Comment on lines 16 to 20
- anndata=0.10.9
- leidenalg=0.10.2
- pip
- pip:
- sam-algorithm==1.0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving it a shot! I was able to import samalg, but when I tried to recreate my steps I ran into trouble again with the conda-lock --update '' command. I was able to get past that by removing (renaming) conda-lock.yml and then running the base conda-lock command again. This created a clean build which I was able to both install and import from Python, both natively and using reticulate in R.

Comment on lines 16 to 20
- anndata=0.10.9
- leidenalg=0.10.2
- pip
- pip:
- sam-algorithm==1.0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update with more testing... I did sometimes run into a case where conda-lock would report a successful installation but I would not be able to import samalg. I am not sure what the final cause is, but it may be related to pip caching, so I might try running python -m pip cache purge and seeing if that can help fix the installation. It may also help to do a manual install with pip install sam-algorithm from within the environment: you should first check that the correct pip is active with which pip; at least once I have had the environment fail to install its own pip for some reason.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with all of the updates you have made, this is looking almost ready. I have just a few cleanup comments at this point, and one somewhat larger question about the reference file, but I think the reference question can wait for a later PR, when it is actually being used, though that is really up to you.

And thank you for testing the conda-lock environment! I am glad that it did work out, and hopefully will not be too much trouble to maintain!

Comment on lines 11 to 12
os <- import("os")
anndata <- import("anndata")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either of these is used any more? Unless anndata is required for samalg?

Suggested change
os <- import("os")
anndata <- import("anndata")

Comment on lines 3 to 6
library(dplyr)
library(cowplot)
library(ggplot2)
library(future)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some number of these are unused? It is a bit difficult to say, because there may be functions that are not distinctive that are being used, but if a function is only used once it may be preferable to use :: syntax for clarity. (ggplot2 and dplyr are possible exceptions given their ubiquity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove these libraries, but I am keeping library(ggplot2) for plotting the plot and saving them.

@@ -0,0 +1,64 @@
library(Seurat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include #! lines for all scripts as possible. An opening comment that describes what this script does would also be helpful here. You probably don't want to use what I wrote below exactly, but something like that, including a comment about the output files would be helpful for later users.

Suggested change
library(Seurat)
#!/usr/bin/env Rscript
# This script uses SAM to perform clusters for the nonETP-ALL samples in SPCP000003 and outputs...
library(Seurat)

Comment on lines 16 to 20
- anndata=0.10.9
- leidenalg=0.10.2
- pip
- pip:
- sam-algorithm==1.0.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the lines that were removed if you click on the link to the commit: 353bcb7, but I think the core thing I learned is that updating an existing conda-lock.yml file seems to sometimes leave some extra values, especially when there are pip installations involved (at least with the current version of conda-lock). It seems to work better, or at least more consistently, to remove the existing conda-lock.yml file and then generate the new one fresh.


We first aim to annotate the cell types in non-ETP T-ALL, and use the annotated B cells in the sample as the "normal" cells to identify tumor cells, since T-ALL is caused by the clonal proliferation of immature T-cell [<https://www.nature.com/articles/s41375-018-0127-8>].

- We use the cell type marker (`Azimuth_BM_level1.xlsx`) from [Azimuth Human Bone Marrow reference](https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow). In total, there are 14 cell types: B, CD4T, CD8T, Other T, DC, Monocytes, Macrophages, NK, Early Erythrocytes, Late Erythrocytes, Plasma, Platelet, Stromal, and Hematopoietic Stem and Progenitor Cells (HSPC). Based on the exploratory analysis, we believe that most of the cells in these samples do not express adequate markers to be distinguished at finer cell type level (eg. naive vs memory, CD14 vs CD16 etc.), and majority of the cells should belong to T-cells. In addition, we include the marker genes for cancer cell in immune system from [ScType](https://sctype.app/database.php) database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can probably wait for the next PR (or when you are actually using the file), but I am wondering how this file was compiled. Was it downloaded directly or created from the website table? If the latter, we probably want to include the direct URL to the file, if we can.

If the latter, it would probably be better to have this as a text table for easier tracing through changes. I would also recommend including the Cell Ontology IDs for the cell types if possible, which we can use to track across different naming conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was created from the website table (combination of celltype.l2 and celltype.l1), and I converted the markers from gene symbol to ensembl ID. Do you mean to provide the URL (https://azimuth.hubmapconsortium.org/references/#Human%20-%20Bone%20Marrow) as part of the header of the file?

I am saving it as an excel file because it is the input format required by ScType, unless I go and change their codes. If I understand the Cell Ontology IDs correctly, it provides an ID for a specific cell type (eg. CL:0000904 for central memory CD4+ T cell). This is the reason I am using celltype.l1, because I need the cell type annotation to be very general (eg. just CD4+ T, instead of specifying naive or memory).
I did try with finer cell type label, but I found out that the cells (e.g. different types of CD4 and CD8 and even progenitors) all cluster together (fail to be separated), thus they cannot be annotated at cluster level. I also try to do annotation at cell level using anchor transfer etc. The prediction results is really bad. Thus, I suspect that they don't express enough markers to be separated at the finer level.

Copy link
Member

@jashapiro jashapiro Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification. I don't think you need to provide the URL in the header as you do already have it in the readme, but I am surprised that ScType can not take a .csv file! (I just looked at the code and I do have an idea for how to get around this: I think you should be able to run something like gene_sets_prepare(openxlsx::buildWorkbook(marker_df), "Immune system") to work from any data frame, but this is not something we really need to address now.

As for Cell Ontology IDs, they are hierarchical, so you should be able to use a higher-level term if needed. But I think that you can use the values given as links in the Azimuth table to start: For example for B cells they link to CL:0000945, and for the CD4 T cells it links to CL:0000624, which is a more general ID; The central memory CD4 cells, CL:0000904 can be found as a child term to the more general term.

This is the primary benefit of using the ontology, in fact! We can use these terms to harmonize across annotations even if different samples are annotated to different depths by various groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the changes as suggested for using .csv file (the codes will be shown up in scripts/02-03_annotation.R in next pull request).
As of now, I put the Cell Ontology IDs under the column of shortName in Azimuth_BM_level1.csv. I don't think there is an ID for "Other T", since it is such a broad category, and I could not find IDs for Early and Late Eryth. I just leave them as how it is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have another concern regarding SAM and ScType. Each time I run SAM, the clustering output changes, although the seed is set as 0 (by default) in sam$run(). The log is shown as image below.
Screenshot 2024-09-24 at 5 12 45 PM

Since ScType annotates by cluster, so the cell annotation may change if they are found in another cluster, with different annotation. This is actually quite likely to happen, especially for those cells that don't have clear markers expressed. Eg. Depending on the clustering results, they may be annotated as Unknown, if the score of a cluster is less than a quarter of #cells in that cluster.

Comment on lines 60 to 62
metadata <- metadata[which(metadata$scpca_project_id == projectID), ]
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little simplification.

Suggested change
metadata <- metadata[which(metadata$scpca_project_id == projectID), ]
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
metadata <- metadata[which(metadata$scpca_project_id == projectID & metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"), ]
sampleID <- metadata$scpca_sample_id
libraryID <- metadata$scpca_library_id

In my own work, this is actually a place I would use dplyr to make things a bit simpler (at least in my mind), as shown below, but either method is fine!

Suggested change
metadata <- metadata[which(metadata$scpca_project_id == projectID), ]
sampleID <- metadata$scpca_sample_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
libraryID <- metadata$scpca_library_id[which(metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia")]
metadata <- metadata |>
dplyr::select(
scpca_project_id == projectID,
diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"
)
sampleID <- metadata$scpca_sample_id
libraryID <- metadata$scpca_library_id

@jashapiro
Copy link
Member

jashapiro commented Sep 25, 2024

Actually, I have another concern regarding SAM and ScType. Each time I run SAM, the clustering output changes, although the seed is set as 0 (by default) in sam$run(). The log is shown as image below. Screenshot 2024-09-24 at 5 12 45 PM

Since ScType annotates by cluster, so the cell annotation may change if they are found in another cluster, with different annotation. This is actually quite likely to happen, especially for those cells that don't have clear markers expressed. Eg. Depending on the clustering results, they may be annotated as Unknown, if the score of a cluster is less than a quarter of #cells in that cluster.

This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to sam.run(distance='correlation', seed=0) may make the calculations consistent.

I think this limitation is actually quite out of date, and the hsnw library now supports seeding, but fixing that would require changes to samalg, and the package unfortunately seems not to be regularly updated. I might consider forking the package and testing whether the implementation can be updated to support the newer versions of the library, but that would require a fair amount of effort!

@jashapiro
Copy link
Member

This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to sam.run(distance='correlation', seed=0) may make the calculations consistent.

One other thought I had was whether you need the UMAP visualizations and clusters that are calculated by SAM? Can you recalculate those with Seurat? The reason I ask is that if the cell assignments in scType are sensitive to the clustering, you may well want to explore how different clustering parameters affect the quality of the assignments.

If you want to go down that route, I would probably start by breaking up the run_sam function to make it a bit more modular. One function could take the SCE object, pass it to SAM, and write the h5ad file, then a separate function would read the h5ad file and perform downstream steps like clustering. This will make it easier to modify those steps for exploration, assuming that the data required to pass to the clustering algorithm is properly read in from the h5ad file.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing to confirm that using distance="correlation" does result in reproducible results. I also suggested a few changes to only create one Seurat object at the end, and to make sure it has all of the metadata from the original object, in case that is useful.

I do still think it may be worth breaking up the SAM run and the Seurat object creation, but I think with the changes here it may be in a good enough state to move on. Clustering exploration can still happen if SAM does a first pass!

Comment on lines 17 to 33
seu <- CreateSeuratObject(counts = counts(sce), data = logcounts(sce), meta.data = as.data.frame(colData(sce)))
# create a new assay to store ADT information
seu[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce)))
seu[["ADT"]]@data <- logcounts(altExp(sce))
mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)

#detecting doublets
doublets.data <- read.table(file = file.path(doublet_loc,sample,paste0(library,"_processed_scdblfinder.tsv")),sep = "\t", header = T)
idx <- match(colnames(seu), doublets.data$barcodes)
seu$doublet_class <- doublets.data$class[idx]
seu <- subset(seu, subset = percent.mt < 25)

#step 01 feature selection/dimensionality using SAM
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@layers[["counts"]])),
r_to_py(as.array(rownames(seu))),
r_to_py(as.array(colnames(seu)))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that for this section there is no need to do the Seurat conversion, so this should be a bit more efficient:

Suggested change
seu <- CreateSeuratObject(counts = counts(sce), data = logcounts(sce), meta.data = as.data.frame(colData(sce)))
# create a new assay to store ADT information
seu[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce)))
seu[["ADT"]]@data <- logcounts(altExp(sce))
mito.features <- rownames(sce)[which(grepl("^MT-", rowData(sce)$gene_symbol))]
seu[["percent.mt"]] <- PercentageFeatureSet(seu, features = mito.features)
#detecting doublets
doublets.data <- read.table(file = file.path(doublet_loc,sample,paste0(library,"_processed_scdblfinder.tsv")),sep = "\t", header = T)
idx <- match(colnames(seu), doublets.data$barcodes)
seu$doublet_class <- doublets.data$class[idx]
seu <- subset(seu, subset = percent.mt < 25)
#step 01 feature selection/dimensionality using SAM
sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@layers[["counts"]])),
r_to_py(as.array(rownames(seu))),
r_to_py(as.array(colnames(seu)))))
doublets.data <- read.table(
file = file.path(doublet_loc, sample, paste0(library,"_processed_scdblfinder.tsv")),
sep = "\t", header = T
)
idx <- match(colnames(sce), doublets.data$barcodes)
sce$doublet_class <- doublets.data$class[idx]
sce <- sce[, which(sce$subsets_mito_percent < 25)]
#step 01 feature selection/dimensionality using SAM
sam = samalg$SAM(counts = c(r_to_py(t(counts(sce))),
r_to_py(as.array(rownames(sce))),
r_to_py(as.array(colnames(sce)))))

r_to_py(as.array(rownames(seu))),
r_to_py(as.array(colnames(seu)))))
sam$preprocess_data()
sam$run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sam$run()
# using correlation for the distance measure allows for reproducible results
sam$run(distance='correlation')

Comment on lines 40 to 44
final.obj <- AddMetaData(final.obj, seu@meta.data)
final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]]
# create a new assay to store ADT information
final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts)
final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not create the Seurat object, we do need to modify this a bit to get the data from the SCE object:

Suggested change
final.obj <- AddMetaData(final.obj, seu@meta.data)
final.obj@assays[["RNA"]]@counts <- seu@assays[["RNA"]]@layers[["counts"]]
# create a new assay to store ADT information
final.obj[["ADT"]] <- CreateAssayObject(counts = seu@assays[["ADT"]]@counts)
final.obj@assays[["ADT"]]@data <- seu@assays[["ADT"]]@data
final.obj <- AddMetaData(final.obj, as.data.frame(colData(sce)))
final.obj[["RNA"]]@counts <- counts(sce)
final.obj[["RNA"]] <- AddMetaData(final.obj[["RNA"]], as.data.frame(rowData(sce)))
final.obj[["percent.mt"]] <- final.obj[["subsets_mito_percent"]] # add Seurat expected column name
# create a new assay to store ADT information
final.obj[["ADT"]] <- CreateAssayObject(counts = counts(altExp(sce, "adt")))
final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt"))
final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt"))))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for simplifying it! Since percent.mt and subsets_mito_percent are storing the same information, I only keep subsets_mito_percent. Just to check with you, are nCount_RNA and nFeature_RNA metadata stored in sce object? Because these two metadata give different values between creating seurat object via CreateSeuratObject (named as seu) and schard::h5ad2seurat (named as final.obj)

Screenshot 2024-09-25 at 11 16 28 AM

It seems like CreateSeuratObject is giving the right value, since the min_gene_cutoff > 200, so the minimum value of nFeature_RNA should be larger than 200, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check with you, are `nCount_RNA and nFeature_RNA metadata stored in sce object?

The equivalent values should be stored as sce$sum and sce$detected, respectively.

It seems like CreateSeuratObject is giving the right value, since the min_gene_cutoff > 200, so the minimum value of nFeature_RNA should be larger than 200, right?

That sounds correct to me. I am not sure how SAM is populating that field, as it is certainly doing some preprocessing. Perhaps it is only including the features/genes that it has not filtered (as denoted by mask_genes); I think the filtered genes may have their values set to zero for all samples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification! I have now set them as:

final.obj$nCount_RNA <- final.obj$sum
final.obj$nFeature_RNA <- final.obj$detected


DimPlot(final.obj, reduction = "Xumap_", group.by = "leiden_clusters", label = T) +
ggtitle(paste0(sample,": leiden_clusters"))
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other minor comment is that we should probably set a size here to ensure consistency:

Suggested change
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")))
ggsave(file.path(out_loc,"plots/00-01_processing_rds",paste0(sample,".pdf")),
width = 7, height = 7))

@@ -0,0 +1,16 @@
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating this file to one that we can more easily track changes to!

I wonder if it makes more sense to have a separate column for the cell ontology? I am not sure if that would upset ScType, but I suspect it will not matter, or we can remove it before passing it along.

As to some of the ambiguous cell ontology values, I think it probably makes the most sense to just use whatever was provided by Azimuth as the best equivalent. In the case of early and late erythrocytes this will just mean that they have the same CL value: CL_0000764, and for "Other T" that would just be the generic T cell: CL_0000084.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the separate column is probably the way to go here, so you can have both the full names and short names be human readable.

Suggested change
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName,ontologyID

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@UTSouthwesternDSSR I wasn't sure if you have completed the changes that you intended to make here, but things look good for now. I think my final suggestion is the one made in a comment here which is just to make the Cell Ontology ID a separate column in the .csv file. I might also make the cellName column unabbreviated, leaving the abbreviated labels for shortName though I admit as I am unfamiliar with the output of ScType, that might not be a great idea...

In the future, when you are ready for a review, please don't hesitate to let me know either with a comment or just by clicking the re-request review button in the upper right of the PR page:
Screenshot 2024-09-27 at 9 35 18 AM

@@ -0,0 +1,16 @@
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the separate column is probably the way to go here, so you can have both the full names and short names be human readable.

Suggested change
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,shortName,ontologyID

@UTSouthwesternDSSR
Copy link
Contributor Author

Sure, I will make changes to the .csv file as suggested. ScType will provide label by returning the cellName column. As for the shortName column, it was used in another part of ScType, which I am not using. Just to check with you, which label would you all prefer for the cell type annotation, or it doesn't really matter?

@jashapiro
Copy link
Member

ScType will provide label by returning the cellName column. As for the shortName column, it was used in another part of ScType, which I am not using. Just to check with you, which label would you all prefer for the cell type annotation, or it doesn't really matter?

I don't think it really matters, as long as the names are clear. At some point we will almost certainly have to do some harmonization across different references, but that will be well down the line (and not part of this module).

@UTSouthwesternDSSR
Copy link
Contributor Author

UTSouthwesternDSSR commented Sep 27, 2024

Yes, I have completed all the changes that I have intended for this part (including editing the .csv file), and I am ready for a review. I will have to re-request review for it to be merged into the main branch, right?

Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run CopyKat. I plan to download the rds file from S3 bucket to my local server for analysis with ./sync-results.py cell-type-nonETP-ALL-03 --bucket researcher-650251722463-us-east-2 --download --profile my-dev-profile, but it doesn't work. It is looking for git directory. I am wondering if I should clone the repository folder, or I will have to wait until my pull request get merged into the main branch?

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have completed all the changes that I have intended for this part (including editing the .csv file). I will have to re-request review for it to be merged into the main branch, right?

You do not necessarily have to re-request review; that is just a good signal to me that you have completed the changes you intended to make so I know when a good time to look at it is.

Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run CopyKat. I plan to download the rds file from S3 bucket to my local server for analysis with ./sync-results.py cell-type-nonETP-ALL-03 --bucket researcher-650251722463-us-east-2 --download --profile my-dev-profile, but it doesn't work. It is looking for git directory. I am wondering if I should clone the repository folder, or I will have to wait until my pull request get merged into the main branch?

The sync-results.py script does expect that you have cloned the repository, but you should not have to wait for the pull request to be merged; you can simply check out whatever branch has the next set of changes in it (or make a new branch).

Working on your local machine is fine and may be more convenient, but I wonder if there may be other ways to reduce the memory usage. I obviously can't make any recommendations without seeing the code, but I don't seem to recall CopyKat itself being particularly memory-intensive in our previous use.

@@ -0,0 +1,16 @@
tissueType,cellName,geneSymbolmore1,geneSymbolmore2,fullName,ontologyID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the extra columns. Having the full name seems like it could be useful for reference, so I appreciate having those.

@jashapiro jashapiro merged commit baee0e6 into AlexsLemonade:main Sep 27, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants