Adding SCTransform as an alternative to anchor transfer-Wilms tumor annotation (SCPCP000014) #836

JingxuanChen7 · 2024-10-21T16:30:25Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Proposal discussion: #628
Issue: #782

What is the goal of this pull request?

Normalization method can make a difference in anchor transfer strategy. In my previous PR, I filed results generated by standard Seurat workflow, i.e, NormalizeData, FindVariableFeatures, followed by ScaleData. Here, I tried out an alternative normalization method called SCTransform for anchor transfer. In addition, the default UMAP for initial visualization was generated with SCTransform.

Briefly describe the general approach you took to achieve this goal.

I modified my previous code in following aspects:

Created Kidney fetal atlas object with SCTransform.
In pre-processing, include both layers RNA for standard Seurat workflow and SCT for SCTransform.
Initial UMAP was switched to SCTransform-based instead of standard Seurat workflow.
Run anchor transfer with both normalization strategies.
Since majority of cells are labeled as fetal_nephronin my samples, I removed subtype annotation for stroma, immune and endothelium for cleaner annotation. Please let me know if you still want to keep these sub-levels for minor compartments.

If known, do you anticipate filing additional pull requests to complete this analysis module?

NA

Results

What is the name of your results bucket on S3?

researcher-009160072044-us-east-2

What types of results does your code produce (e.g., table, figure)?

Results are uploaded to s3://researcher-009160072044-us-east-2/cell-type-wilms-tumor-14/results/01_anchor_transfer_seurat

.
├── [sample_id]_celltype.csv
└── [sample_id]_compartment.csv

The label transfer analysis was performed in two levels: celltype and compartment.
[sample_id]_[level].csv label transfer result table including cell ID, predicted cell type, along with predicted scores.
Anchor transfer was performed with two normalization methods in subfolders:
- results/01_anchor_transfer_seurat/RNA: Results generated by normalization method LogNormalize.
- results/01_anchor_transfer_seurat/SCT: Results generated by normalization method SCTransform.

What is your summary of the results?

Overally, there is only small differences in terms of anchor transfer results.

E.g., in library SCPCL000850

Anchor transfer with LogNormalize
Anchor transfer with SCTransform

It looks that LogNormalize method shows an overall higher score, with more anchors identified.

Anchor transfer with LogNormalize
Anchor transfer with SCTransform

However, in UMAP visualization made from SCTransform, I was able to "fix" the strange visualization for library SCPCL001109 (old version here).
The anchor transfer results with SCTransform for this library also make more sense to me, since I think it is problematic to observe scattered immune within fetal_nephron.

Anchor transfer with LogNormalize
Anchor transfer with SCTransform
In this case, I think it would be useful to keep codes/results for SCTransform.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Analysis could be executed on a virtual computer (Standard-4XL) via AWS Lightsail for Research.
Required R packages checked in renv.lock file.

Are there particularly areas you'd like reviewers to have a close look at?

NA

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…nto JingxuanChen7/wilms14_anchor2

JingxuanChen7 · 2024-10-21T18:07:06Z

Regarding the error in CI, it could due to missing library(glmGamPoi) to run SCTransform. It could also due to out of RAM or other resources, since running SCTransform on fetal reference dataset could be computationally intensive. Any suggestions would be highly appreciated on this issue!

FYI, I was able to finish the computing on a lightsail 4XL machine only after I put the conserve.memory = TRUE option for SCTransform.

jashapiro · 2024-10-25T17:54:01Z

Hi @JingxuanChen7: I wanted to let you know that I have started to look at this submission. I am starting with a bit of debugging to see why it might have failed in CI, so you may see a few changes pushed to the branch. I will then be looking more directly at the changes you made; it does look like SCTransform may be performing better for this application.

jashapiro

Overall, I think this looks pretty good. I made some suggestions in a few places, including using purrr::pwalk to allow you to remove some duplicated code. I haven't fully tested that code though, so there may well be a couple of typos in there!

I also pushed in some changes to actually skip the SCT transformation of the reference during CI, as I suspect that was what was causing the CI to fail. I have not yet finished testing that, but hopefully it will work out! I am also working on a separate branch to add a docker image for this module, which will hopefully speed up testing as well.

analyses/cell-type-wilms-tumor-14/exploratory_analysis/03_cnv/03_runCopyKat.R

analyses/cell-type-wilms-tumor-14/exploratory_analysis/03_cnv/03_runInferCNV.R

...ll-type-wilms-tumor-14/plots/01_anchor_transfer_seurat/archive/SCPCL000846_celltype_core.png

analyses/cell-type-wilms-tumor-14/scripts/01_anchor_transfer_seurat.R

analyses/cell-type-wilms-tumor-14/scripts/utils/01_anchor_transfer_seurat_functions.R

jashapiro · 2024-10-25T19:19:47Z

analyses/cell-type-wilms-tumor-14/scripts/utils/01_anchor_transfer_seurat_functions.R

+      mutate(annot = celltype) %>%
+      mutate(annot = case_when(compartment == "stroma" ~ "stroma",
+                              compartment == "immune" ~ "immune",
+                              compartment == "endothelium" ~ "endothelium",
+                              TRUE ~ annot))


I'm not sure I understand the logic here. While I can understand that at the end you might not want to retain the fine annotations, I feel like you would want to keep them during the anchor transfer step, and maybe combine the two labels afterward?

Thanks for the comment.

My goal here is to improve the prediction scores for stroma, immune, and endothelium and make the annotation cleaner in visualization. If detailed cell types are included here, we may end up getting no cell types related to stroma, immune, or endothelium other than Unknown, for two reasons: (i) In general too few stroma, immune, and endothelium cells are present in the dataset; (ii) Prediction scores for detailed cell types are getting lower which creates a lot of Unknown.

However, I'm not sure if this is the correct thing to do. As you mentioned, I could also try to keep cell types during anchor transfer step, and combine labels later on (or even keep all Unknowns as is?). @jashapiro Any suggestions?

JingxuanChen7 · 2024-10-28T15:28:34Z

I also pushed in some changes to actually skip the SCT transformation of the reference during CI, as I suspect that was what was causing the CI to fail.

Thanks for helping troubleshoot the CI. It makes sense to skip SCT when processing reference datasets. I agree that it may cause CI failure.

I will then be looking more directly at the changes you made; it does look like SCTransform may be performing better for this application.

Actually, LogNormalize generates overally higher prediction scores in most samples, while SCTransform performs better in UMAP visualization especially in the low-quality sample. Both sets of results could be useful.

I made some suggestions in a few places, including using purrr::pwalk to allow you to remove some duplicated code. I haven't fully tested that code though, so there may well be a couple of typos in there!

Thanks for the suggestion! I applied it and finished re-run the step!

@jashapiro I think I finished dealing with all review suggestions so far (07fa1a5). Please let me know how you would like to proceed with the last conversation (#836 (comment)). Thanks again for all the help!

jashapiro · 2024-10-28T20:13:07Z

I will then be looking more directly at the changes you made; it does look like SCTransform may be performing better for this application.

Actually, LogNormalize generates overally higher prediction scores in most samples, while SCTransform performs better in UMAP visualization especially in the low-quality sample. Both sets of results could be useful.

@JingxuanChen7 I think what I meant by this was that SCTransform is giving results that make more sense with respect to the immune cells scatterered within the fetal_nephron. But when I was just looking into this a bit more, I wonder if there might be an error in the code?

Specifically, I see that when you call FindTransferAnchors() in the code below you are not including the normalization.method argument, which I would expect to be used to be set to "SCT" in the case of SCTransform data. The docs don't seem to indicate that this is set automatically?

https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/07fa1a51bbf1ad66f3101ab186ecaab9744e2097/analyses/cell-type-wilms-tumor-14/scripts/utils/01_anchor_transfer_seurat_functions.R#L32C3-L33C49

I do not know if this would make a difference in the results, but it seems worth looking at?

As to #836 (comment)), I think that what you are doing by combining the compartment and finer level detail might be useful, but I think I would probably call that a fully separate method: so you might want to test the original compartment and celltype annotations as you had them originally, but add a combined method that uses the annotation value that you created here.

Without testing and comparison, I don't think we can say which of these would be best, if doing the combination in the reference annotation works better or worse than combining the results after transfer.

JingxuanChen7 · 2024-10-29T14:55:47Z

Hi @jashapiro , thank you so much for the comments!

Specifically, I see that when you call FindTransferAnchors() in the code below you are not including the normalization.method argument, which I would expect to be used to be set to "SCT" in the case of SCTransform data. The docs don't seem to indicate that this is set automatically?

When I tested my codes, it looked like as long as the default assay is set to "SCT", FindTransferAnchors would automatically use SCT data for transforming. In this case, I put two lines to set default assays of the object.

DefaultAssay(sample_obj) <- obj_assay
DefaultAssay(ref_obj) <- obj_assay

In my latest commit (9e334c0), I specified normalization.method option to make sure everything is working as expected & avoid confusion of codes.

As to #836 (comment)), I think that what you are doing by combining the compartment and finer level detail might be useful, but I think I would probably call that a fully separate method: so you might want to test the original compartment and celltype annotations as you had them originally, but add a combined method that uses the annotation value that you created here.

Thank you so much for the suggestion. I agreed that the cell types can be merged later on as needed. In my latest commit, I changed back to original compartment and celltype annotations.

In addition, I realized that since the submission due date is approaching very soon, I'm not sure if I have time to merge another PR. Therefore, I also added a script to create a summary table for results in this PR.

Besides, I have some other codes (notebooks) aiming to explore feature selection & clustering methods, which wouldn't affect cell type results and not part of the final table. Would that be fine to file them after due date (probably as exploratory analysis as well)?

Thank you again for all the efforts!

jashapiro

I saw this come in and I had a small suggestion for the concatenation function.

analyses/cell-type-wilms-tumor-14/scripts/summary_results.R

jashapiro · 2024-10-29T15:19:43Z

In my latest commit (9e334c0), I specified normalization.method option to make sure everything is working as expected & avoid confusion of codes.

Thank you for adding the explicit setting; I saw that you were changing the default assay, but I was unsure if there might be some internal differences in how the function is applied that this setting enables. It does seem like the results may have changed a bit, but I am not sure. The main thing I was curious about was if the prediction scores might have changed with the explicit setting.

In addition, I realized that since the submission due date is approaching very soon, I'm not sure if I have time to merge another PR. Therefore, I also added a script to create a summary table for results in this PR.

Yes, thank you for doing this! I would just ask that you include this script as well in the main analysis shell script so it can be tested (though I realize that this may require changes to skip the SCT files which won't exist in testing)

Can you also please include this table in the PR? To do this, you may want to modify the .gitignore file to add a section like the one below. This gets a bit annoying with nested folders, as you have to not ignore the folder first, then ignore its contents again, then finally not ignore the file you want to keep. (If this is the only file in the directory, you can include the full directory in one line)

# include cell type summary table
!/results/01_anchor_transfer_seurat
/results/01_anchor_transfer_seurat/*
!/results/01_anchor_transfer_seurat/summary_results.csv

Besides, I have some other codes (notebooks) aiming to explore feature selection & clustering methods, which wouldn't affect cell type results and not part of the final table. Would that be fine to file them after due date (probably as exploratory analysis as well)?

Abosolutely! We would be very happy for you to continue to work to improve your annotations after the deadline. We hope that you can stay involved as much as possible as we continue the project and add more analyses!

JingxuanChen7 · 2024-10-29T16:18:15Z

Can you also please include this table in the PR? To do this, you may want to modify the .gitignore file to add a section like the one below. This gets a bit annoying with nested folders, as you have to not ignore the folder first, then ignore its contents again, then finally not ignore the file you want to keep. (If this is the only file in the directory, you can include the full directory in one line)

Hi @jashapiro , I tried to commit the summary table. However, the original file size is ~7MB, even the gzipped version is 600KB+. In this case, pre-commit of git-kraken doesn't allow me to commit it since its size exceeds limit (200KB). However, I think some of my previous png files are larger than 200kb as well. Any suggestions? Thanks!

General file size limit..................................................Failed
- hook id: check-added-large-files
- exit code: 1

analyses/cell-type-wilms-tumor-14/results/01_anchor_transfer_seurat/summary_results.csv.gz (666 KB) exceeds 200 KB.

Large file size limit....................................................Passed
detect aws credentials...............................(no files to check)Skipped
detect private key...................................(no files to check)Skipped
forbid submodules....................................(no files to check)Skipped
check for case conflicts.................................................Passed
check for merge conflicts............................(no files to check)Skipped
Detect hardcoded secrets.................................................Passed
Check for included environments......................(no files to check)Skipped
Hook process exited.

jashapiro · 2024-10-29T16:30:29Z

Hi @jashapiro , I tried to commit the summary table. However, the original file size is ~7MB, even the gzipped version is 600KB+. In this case, pre-commit of git-kraken doesn't allow me to commit it since its size exceeds limit (200KB). However, I think some of my previous png files are larger than 200kb as well. Any suggestions? Thanks!

General file size limit..................................................Failed
- hook id: check-added-large-files
- exit code: 1

analyses/cell-type-wilms-tumor-14/results/01_anchor_transfer_seurat/summary_results.csv.gz (666 KB) exceeds 200 KB.

Large file size limit....................................................Passed
detect aws credentials...............................(no files to check)Skipped
detect private key...................................(no files to check)Skipped
forbid submodules....................................(no files to check)Skipped
check for case conflicts.................................................Passed
check for merge conflicts............................(no files to check)Skipped
Detect hardcoded secrets.................................................Passed
Check for included environments......................(no files to check)Skipped
Hook process exited.

I guess I was forgetting that this table has a row for every cell in the whole project, and barcodes don't compress very well! Please ignore my suggestion of including the table! Instead, maybe just put a note in the readme with the path to the final table on S3, for convenience in retrieving that file.

jashapiro

@JingxuanChen7 I think this all looks good for this stage.

For the future, I would be interested to hear how you might want to proceed toward refining and improving these annotations, especially if you have more thoughts on identifying tumor cells in particular. It would be great to have some more discussion posts on that front, if you have anything you might want to share.

We might also want to do some comparisons between your annotations and the annotations that were provided as part of our original pipeline using SingleR and CellAssign with our default references.

Finally, I do want to re-emphasize that as I said earlier, we would love it if you are able to continue to contribute to the project beyond the deadline for these annotations. We obviously have a good deal more work to do, and your efforts and expertise will be welcome as we continue!

jashapiro · 2024-10-31T13:11:36Z

Hi @JingxuanChen7,

I wanted to just check in with you about whether you were expecting to submit any more PRs with a final set of cell type assignments? In particular, I want to note that the submission guidelines require specific table headings, and in particular require a tumor_cell_classification column, which you do not yet have. Do you expect to add that or further update the table before the Nov 1 deadline?

JingxuanChen7 · 2024-10-31T14:24:52Z

Hi @JingxuanChen7,

I wanted to just check in with you about whether you were expecting to submit any more PRs with a final set of cell type assignments? In particular, I want to note that the submission guidelines require specific table headings, and in particular require a tumor_cell_classification column, which you do not yet have. Do you expect to add that or further update the table before the Nov 1 deadline?

Good morning @jashapiro , thank you so much for asking!

Since the due date is tomorrow, I don't think I have sufficient time to complete another PR.

Regarding tumor_cell_classification column, an idea I got from the community and also a biologist from our institution is that we could regard all Cap mesenchyme as tumor cells since CM should have gone since born. And probably regard immune & endothelium cells as normal cells. However, I think this is a relatively rough classification and I haven't found a way to validate it (inferCNV is one possibility but couldn't apply to all samples especially those without reference). I could make a quick update on result table if the rough classification makes sense to you.

jashapiro · 2024-10-31T14:32:44Z

I think a rough classification would still be useful, so I would encourage you to add it. Having a complete table, even if the results are subject to refinement, is going to be required based on the submission guidelines.

If you are able to also run copyKAT, rough as it is, that might be worth doing as well; using immune cells as reference when possible is something that seems to have worked in some other cases.

JingxuanChen7 · 2024-10-31T14:38:16Z

I think a rough classification would still be useful, so I would encourage you to add it. Having a complete table, even if the results are subject to refinement, is going to be required based on the submission guidelines.

If you are able to also run copyKAT, rough as it is, that might be worth doing as well; using immune cells as reference when possible is something that seems to have worked in some other cases.

@jashapiro Thank you so much for the prompt reply! Should I open another short PR to make this modification?

Regarding CopyKat, in my case the result seems not as good as inferCNV in the sample I used. I decided not to go ahead with this software since it could be misleading.

jashapiro · 2024-10-31T14:43:51Z

@jashapiro Thank you so much for the prompt reply! Should I open another short PR to make this modification?

Regarding CopyKat, in my case the result seems not as good as inferCNV in the sample I used. I decided not to go ahead with this software since it could be misleading.

Yes, please do make a PR with this addition.

JingxuanChen7 · 2024-10-31T15:45:33Z

@jashapiro Thank you so much for the prompt reply! Should I open another short PR to make this modification?
Regarding CopyKat, in my case the result seems not as good as inferCNV in the sample I used. I decided not to go ahead with this software since it could be misleading.

Yes, please do make a PR with this addition.

Hello @jashapiro , I have opened a PR for adding a tumor classification to the result table. Thanks!

JingxuanChen7 and others added 6 commits October 18, 2024 21:50

add SCT as an option

9dffec4

edit prediction path due to folder structure change

c72f4e9

check png plots

8a63836

Merge branch 'AlexsLemonade:main' into JingxuanChen7/wilms14_anchor2

4730e9f

re arrange old plots & update doc

eb453ff

Merge remote-tracking branch 'origin/JingxuanChen7/wilms14_anchor2' i…

bc1ff1f

…nto JingxuanChen7/wilms14_anchor2

JingxuanChen7 requested a review from jaclyn-taroni as a code owner October 21, 2024 16:30

JingxuanChen7 added 2 commits October 21, 2024 16:38

update doc

39c57f3

add glmGamPoi for SCTransform

ef846b9

jaclyn-taroni requested review from jashapiro and removed request for jaclyn-taroni October 23, 2024 10:35

JingxuanChen7 mentioned this pull request Oct 24, 2024

Modify/explore clustering for Wilms tumor annotation (SCPCP000014) #840

Open

jashapiro added 2 commits October 25, 2024 13:54

Add separate zellkonverter intialization

449d210

Don't use SCT in testing

30bf144

jashapiro reviewed Oct 25, 2024

View reviewed changes

jashapiro mentioned this pull request Oct 25, 2024

Add docker image for cell-type-wilms-tumor-14 #841

Merged

JingxuanChen7 and others added 2 commits October 28, 2024 14:23

Merge branch 'main' into JingxuanChen7/wilms14_anchor2

4c3da3d

apply suggestions

07fa1a5

apply suggestions & create sum table

9e334c0

jashapiro reviewed Oct 29, 2024

View reviewed changes

analyses/cell-type-wilms-tumor-14/scripts/summary_results.R Outdated Show resolved Hide resolved

analyses/cell-type-wilms-tumor-14/scripts/summary_results.R Outdated Show resolved Hide resolved

update doc

35184c4

apply review comments

9451fae

JingxuanChen7 and others added 2 commits October 29, 2024 16:36

update docs

47e6ae4

Merge branch 'main' into JingxuanChen7/wilms14_anchor2

b6d4c07

jashapiro approved these changes Oct 29, 2024

View reviewed changes

Merge branch 'main' into JingxuanChen7/wilms14_anchor2

d6cd1cd

jashapiro merged commit d446cf3 into AlexsLemonade:main Oct 30, 2024
3 checks passed

JingxuanChen7 deleted the JingxuanChen7/wilms14_anchor2 branch October 30, 2024 14:21

JingxuanChen7 mentioned this pull request Oct 31, 2024

Add rough tumor classification into the result table-Wilms tumor annotation (SCPCP000014) #846

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SCTransform as an alternative to anchor transfer-Wilms tumor annotation (SCPCP000014) #836

Adding SCTransform as an alternative to anchor transfer-Wilms tumor annotation (SCPCP000014) #836

JingxuanChen7 commented Oct 21, 2024 •

edited

Loading

JingxuanChen7 commented Oct 21, 2024

jashapiro commented Oct 25, 2024

jashapiro left a comment

jashapiro Oct 25, 2024

JingxuanChen7 Oct 28, 2024

JingxuanChen7 commented Oct 28, 2024

jashapiro commented Oct 28, 2024

JingxuanChen7 commented Oct 29, 2024

jashapiro left a comment

jashapiro commented Oct 29, 2024

JingxuanChen7 commented Oct 29, 2024 •

edited

Loading

jashapiro commented Oct 29, 2024

jashapiro left a comment •

edited

Loading

jashapiro commented Oct 31, 2024

JingxuanChen7 commented Oct 31, 2024

jashapiro commented Oct 31, 2024

JingxuanChen7 commented Oct 31, 2024

jashapiro commented Oct 31, 2024

JingxuanChen7 commented Oct 31, 2024

Adding SCTransform as an alternative to anchor transfer-Wilms tumor annotation (SCPCP000014) #836

Adding SCTransform as an alternative to anchor transfer-Wilms tumor annotation (SCPCP000014) #836

Conversation

JingxuanChen7 commented Oct 21, 2024 • edited Loading

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

JingxuanChen7 commented Oct 21, 2024

jashapiro commented Oct 25, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Oct 25, 2024

Choose a reason for hiding this comment

JingxuanChen7 Oct 28, 2024

Choose a reason for hiding this comment

JingxuanChen7 commented Oct 28, 2024

jashapiro commented Oct 28, 2024

JingxuanChen7 commented Oct 29, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro commented Oct 29, 2024

JingxuanChen7 commented Oct 29, 2024 • edited Loading

jashapiro commented Oct 29, 2024

jashapiro left a comment • edited Loading

Choose a reason for hiding this comment

jashapiro commented Oct 31, 2024

JingxuanChen7 commented Oct 31, 2024

jashapiro commented Oct 31, 2024

JingxuanChen7 commented Oct 31, 2024

jashapiro commented Oct 31, 2024

JingxuanChen7 commented Oct 31, 2024

JingxuanChen7 commented Oct 21, 2024 •

edited

Loading

JingxuanChen7 commented Oct 29, 2024 •

edited

Loading

jashapiro left a comment •

edited

Loading