Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SCTransform as an alternative to anchor transfer-Wilms tumor annotation (SCPCP000014) #836

Merged

Conversation

JingxuanChen7
Copy link
Contributor

@JingxuanChen7 JingxuanChen7 commented Oct 21, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Proposal discussion: #628
Issue: #782

What is the goal of this pull request?

Normalization method can make a difference in anchor transfer strategy. In my previous PR, I filed results generated by standard Seurat workflow, i.e, NormalizeData, FindVariableFeatures, followed by ScaleData. Here, I tried out an alternative normalization method called SCTransform for anchor transfer. In addition, the default UMAP for initial visualization was generated with SCTransform.

Briefly describe the general approach you took to achieve this goal.

I modified my previous code in following aspects:

  • Created Kidney fetal atlas object with SCTransform.
  • In pre-processing, include both layers RNA for standard Seurat workflow and SCT for SCTransform.
  • Initial UMAP was switched to SCTransform-based instead of standard Seurat workflow.
  • Run anchor transfer with both normalization strategies.
  • Since majority of cells are labeled as fetal_nephronin my samples, I removed subtype annotation for stroma, immune and endothelium for cleaner annotation. Please let me know if you still want to keep these sub-levels for minor compartments.

If known, do you anticipate filing additional pull requests to complete this analysis module?

NA

Results

What is the name of your results bucket on S3?

researcher-009160072044-us-east-2

What types of results does your code produce (e.g., table, figure)?

  • Results are uploaded to s3://researcher-009160072044-us-east-2/cell-type-wilms-tumor-14/results/01_anchor_transfer_seurat
.
├── [sample_id]_celltype.csv
└── [sample_id]_compartment.csv
  • The label transfer analysis was performed in two levels: celltype and compartment.
  • [sample_id]_[level].csv label transfer result table including cell ID, predicted cell type, along with predicted scores.
  • Anchor transfer was performed with two normalization methods in subfolders:
    • results/01_anchor_transfer_seurat/RNA: Results generated by normalization method LogNormalize.
    • results/01_anchor_transfer_seurat/SCT: Results generated by normalization method SCTransform.

What is your summary of the results?

Overally, there is only small differences in terms of anchor transfer results.

E.g., in library SCPCL000850

  • Anchor transfer with LogNormalize
    image
  • Anchor transfer with SCTransform
    image

It looks that LogNormalize method shows an overall higher score, with more anchors identified.

  • Anchor transfer with LogNormalize
    image
  • Anchor transfer with SCTransform
    image

However, in UMAP visualization made from SCTransform, I was able to "fix" the strange visualization for library SCPCL001109 (old version here).
The anchor transfer results with SCTransform for this library also make more sense to me, since I think it is problematic to observe scattered immune within fetal_nephron.

  • Anchor transfer with LogNormalize
    image

  • Anchor transfer with SCTransform
    image

  • In this case, I think it would be useful to keep codes/results for SCTransform.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

  • Analysis could be executed on a virtual computer (Standard-4XL) via AWS Lightsail for Research.
  • Required R packages checked in renv.lock file.

Are there particularly areas you'd like reviewers to have a close look at?

NA

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@JingxuanChen7
Copy link
Contributor Author

Regarding the error in CI, it could due to missing library(glmGamPoi) to run SCTransform. It could also due to out of RAM or other resources, since running SCTransform on fetal reference dataset could be computationally intensive. Any suggestions would be highly appreciated on this issue!

FYI, I was able to finish the computing on a lightsail 4XL machine only after I put the conserve.memory = TRUE option for SCTransform.

@jashapiro
Copy link
Member

Hi @JingxuanChen7: I wanted to let you know that I have started to look at this submission. I am starting with a bit of debugging to see why it might have failed in CI, so you may see a few changes pushed to the branch. I will then be looking more directly at the changes you made; it does look like SCTransform may be performing better for this application.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this looks pretty good. I made some suggestions in a few places, including using purrr::pwalk to allow you to remove some duplicated code. I haven't fully tested that code though, so there may well be a couple of typos in there!

I also pushed in some changes to actually skip the SCT transformation of the reference during CI, as I suspect that was what was causing the CI to fail. I have not yet finished testing that, but hopefully it will work out! I am also working on a separate branch to add a docker image for this module, which will hopefully speed up testing as well.

Comment on lines 48 to 52
mutate(annot = celltype) %>%
mutate(annot = case_when(compartment == "stroma" ~ "stroma",
compartment == "immune" ~ "immune",
compartment == "endothelium" ~ "endothelium",
TRUE ~ annot))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic here. While I can understand that at the end you might not want to retain the fine annotations, I feel like you would want to keep them during the anchor transfer step, and maybe combine the two labels afterward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.

My goal here is to improve the prediction scores for stroma, immune, and endothelium and make the annotation cleaner in visualization. If detailed cell types are included here, we may end up getting no cell types related to stroma, immune, or endothelium other than Unknown, for two reasons: (i) In general too few stroma, immune, and endothelium cells are present in the dataset; (ii) Prediction scores for detailed cell types are getting lower which creates a lot of Unknown.

However, I'm not sure if this is the correct thing to do. As you mentioned, I could also try to keep cell types during anchor transfer step, and combine labels later on (or even keep all Unknowns as is?). @jashapiro Any suggestions?

@JingxuanChen7
Copy link
Contributor Author

I also pushed in some changes to actually skip the SCT transformation of the reference during CI, as I suspect that was what was causing the CI to fail.

Thanks for helping troubleshoot the CI. It makes sense to skip SCT when processing reference datasets. I agree that it may cause CI failure.

I will then be looking more directly at the changes you made; it does look like SCTransform may be performing better for this application.

Actually, LogNormalize generates overally higher prediction scores in most samples, while SCTransform performs better in UMAP visualization especially in the low-quality sample. Both sets of results could be useful.

I made some suggestions in a few places, including using purrr::pwalk to allow you to remove some duplicated code. I haven't fully tested that code though, so there may well be a couple of typos in there!

Thanks for the suggestion! I applied it and finished re-run the step!

@jashapiro I think I finished dealing with all review suggestions so far (07fa1a5). Please let me know how you would like to proceed with the last conversation (#836 (comment)). Thanks again for all the help!

@jashapiro
Copy link
Member

I will then be looking more directly at the changes you made; it does look like SCTransform may be performing better for this application.

Actually, LogNormalize generates overally higher prediction scores in most samples, while SCTransform performs better in UMAP visualization especially in the low-quality sample. Both sets of results could be useful.

@JingxuanChen7 I think what I meant by this was that SCTransform is giving results that make more sense with respect to the immune cells scatterered within the fetal_nephron. But when I was just looking into this a bit more, I wonder if there might be an error in the code?

Specifically, I see that when you call FindTransferAnchors() in the code below you are not including the normalization.method argument, which I would expect to be used to be set to "SCT" in the case of SCTransform data. The docs don't seem to indicate that this is set automatically?

https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/07fa1a51bbf1ad66f3101ab186ecaab9744e2097/analyses/cell-type-wilms-tumor-14/scripts/utils/01_anchor_transfer_seurat_functions.R#L32C3-L33C49

I do not know if this would make a difference in the results, but it seems worth looking at?

As to #836 (comment)), I think that what you are doing by combining the compartment and finer level detail might be useful, but I think I would probably call that a fully separate method: so you might want to test the original compartment and celltype annotations as you had them originally, but add a combined method that uses the annotation value that you created here.

Without testing and comparison, I don't think we can say which of these would be best, if doing the combination in the reference annotation works better or worse than combining the results after transfer.

@JingxuanChen7
Copy link
Contributor Author

Hi @jashapiro , thank you so much for the comments!

Specifically, I see that when you call FindTransferAnchors() in the code below you are not including the normalization.method argument, which I would expect to be used to be set to "SCT" in the case of SCTransform data. The docs don't seem to indicate that this is set automatically?

When I tested my codes, it looked like as long as the default assay is set to "SCT", FindTransferAnchors would automatically use SCT data for transforming. In this case, I put two lines to set default assays of the object.

DefaultAssay(sample_obj) <- obj_assay
DefaultAssay(ref_obj) <- obj_assay

In my latest commit (9e334c0), I specified normalization.method option to make sure everything is working as expected & avoid confusion of codes.

As to #836 (comment)), I think that what you are doing by combining the compartment and finer level detail might be useful, but I think I would probably call that a fully separate method: so you might want to test the original compartment and celltype annotations as you had them originally, but add a combined method that uses the annotation value that you created here.

Thank you so much for the suggestion. I agreed that the cell types can be merged later on as needed. In my latest commit, I changed back to original compartment and celltype annotations.

In addition, I realized that since the submission due date is approaching very soon, I'm not sure if I have time to merge another PR. Therefore, I also added a script to create a summary table for results in this PR.

Besides, I have some other codes (notebooks) aiming to explore feature selection & clustering methods, which wouldn't affect cell type results and not part of the final table. Would that be fine to file them after due date (probably as exploratory analysis as well)?

Thank you again for all the efforts!

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this come in and I had a small suggestion for the concatenation function.

@jashapiro
Copy link
Member

In my latest commit (9e334c0), I specified normalization.method option to make sure everything is working as expected & avoid confusion of codes.

Thank you for adding the explicit setting; I saw that you were changing the default assay, but I was unsure if there might be some internal differences in how the function is applied that this setting enables. It does seem like the results may have changed a bit, but I am not sure. The main thing I was curious about was if the prediction scores might have changed with the explicit setting.

In addition, I realized that since the submission due date is approaching very soon, I'm not sure if I have time to merge another PR. Therefore, I also added a script to create a summary table for results in this PR.

Yes, thank you for doing this! I would just ask that you include this script as well in the main analysis shell script so it can be tested (though I realize that this may require changes to skip the SCT files which won't exist in testing)

Can you also please include this table in the PR? To do this, you may want to modify the .gitignore file to add a section like the one below. This gets a bit annoying with nested folders, as you have to not ignore the folder first, then ignore its contents again, then finally not ignore the file you want to keep. (If this is the only file in the directory, you can include the full directory in one line)

# include cell type summary table
!/results/01_anchor_transfer_seurat
/results/01_anchor_transfer_seurat/*
!/results/01_anchor_transfer_seurat/summary_results.csv

Besides, I have some other codes (notebooks) aiming to explore feature selection & clustering methods, which wouldn't affect cell type results and not part of the final table. Would that be fine to file them after due date (probably as exploratory analysis as well)?

Abosolutely! We would be very happy for you to continue to work to improve your annotations after the deadline. We hope that you can stay involved as much as possible as we continue the project and add more analyses!

@JingxuanChen7
Copy link
Contributor Author

JingxuanChen7 commented Oct 29, 2024

Can you also please include this table in the PR? To do this, you may want to modify the .gitignore file to add a section like the one below. This gets a bit annoying with nested folders, as you have to not ignore the folder first, then ignore its contents again, then finally not ignore the file you want to keep. (If this is the only file in the directory, you can include the full directory in one line)

Hi @jashapiro , I tried to commit the summary table. However, the original file size is ~7MB, even the gzipped version is 600KB+. In this case, pre-commit of git-kraken doesn't allow me to commit it since its size exceeds limit (200KB). However, I think some of my previous png files are larger than 200kb as well. Any suggestions? Thanks!

General file size limit..................................................Failed
- hook id: check-added-large-files
- exit code: 1

analyses/cell-type-wilms-tumor-14/results/01_anchor_transfer_seurat/summary_results.csv.gz (666 KB) exceeds 200 KB.

Large file size limit....................................................Passed
detect aws credentials...............................(no files to check)Skipped
detect private key...................................(no files to check)Skipped
forbid submodules....................................(no files to check)Skipped
check for case conflicts.................................................Passed
check for merge conflicts............................(no files to check)Skipped
Detect hardcoded secrets.................................................Passed
Check for included environments......................(no files to check)Skipped
Hook process exited.

@jashapiro
Copy link
Member

Hi @jashapiro , I tried to commit the summary table. However, the original file size is ~7MB, even the gzipped version is 600KB+. In this case, pre-commit of git-kraken doesn't allow me to commit it since its size exceeds limit (200KB). However, I think some of my previous png files are larger than 200kb as well. Any suggestions? Thanks!

General file size limit..................................................Failed
- hook id: check-added-large-files
- exit code: 1

analyses/cell-type-wilms-tumor-14/results/01_anchor_transfer_seurat/summary_results.csv.gz (666 KB) exceeds 200 KB.

Large file size limit....................................................Passed
detect aws credentials...............................(no files to check)Skipped
detect private key...................................(no files to check)Skipped
forbid submodules....................................(no files to check)Skipped
check for case conflicts.................................................Passed
check for merge conflicts............................(no files to check)Skipped
Detect hardcoded secrets.................................................Passed
Check for included environments......................(no files to check)Skipped
Hook process exited.

I guess I was forgetting that this table has a row for every cell in the whole project, and barcodes don't compress very well! Please ignore my suggestion of including the table! Instead, maybe just put a note in the readme with the path to the final table on S3, for convenience in retrieving that file.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingxuanChen7 I think this all looks good for this stage.

For the future, I would be interested to hear how you might want to proceed toward refining and improving these annotations, especially if you have more thoughts on identifying tumor cells in particular. It would be great to have some more discussion posts on that front, if you have anything you might want to share.

We might also want to do some comparisons between your annotations and the annotations that were provided as part of our original pipeline using SingleR and CellAssign with our default references.

Finally, I do want to re-emphasize that as I said earlier, we would love it if you are able to continue to contribute to the project beyond the deadline for these annotations. We obviously have a good deal more work to do, and your efforts and expertise will be welcome as we continue!

@jashapiro jashapiro merged commit d446cf3 into AlexsLemonade:main Oct 30, 2024
3 checks passed
@JingxuanChen7 JingxuanChen7 deleted the JingxuanChen7/wilms14_anchor2 branch October 30, 2024 14:21
@jashapiro
Copy link
Member

Hi @JingxuanChen7,

I wanted to just check in with you about whether you were expecting to submit any more PRs with a final set of cell type assignments? In particular, I want to note that the submission guidelines require specific table headings, and in particular require a tumor_cell_classification column, which you do not yet have. Do you expect to add that or further update the table before the Nov 1 deadline?

@JingxuanChen7
Copy link
Contributor Author

Hi @JingxuanChen7,

I wanted to just check in with you about whether you were expecting to submit any more PRs with a final set of cell type assignments? In particular, I want to note that the submission guidelines require specific table headings, and in particular require a tumor_cell_classification column, which you do not yet have. Do you expect to add that or further update the table before the Nov 1 deadline?

Good morning @jashapiro , thank you so much for asking!

Since the due date is tomorrow, I don't think I have sufficient time to complete another PR.

Regarding tumor_cell_classification column, an idea I got from the community and also a biologist from our institution is that we could regard all Cap mesenchyme as tumor cells since CM should have gone since born. And probably regard immune & endothelium cells as normal cells. However, I think this is a relatively rough classification and I haven't found a way to validate it (inferCNV is one possibility but couldn't apply to all samples especially those without reference). I could make a quick update on result table if the rough classification makes sense to you.

@jashapiro
Copy link
Member

I think a rough classification would still be useful, so I would encourage you to add it. Having a complete table, even if the results are subject to refinement, is going to be required based on the submission guidelines.

If you are able to also run copyKAT, rough as it is, that might be worth doing as well; using immune cells as reference when possible is something that seems to have worked in some other cases.

@JingxuanChen7
Copy link
Contributor Author

I think a rough classification would still be useful, so I would encourage you to add it. Having a complete table, even if the results are subject to refinement, is going to be required based on the submission guidelines.

If you are able to also run copyKAT, rough as it is, that might be worth doing as well; using immune cells as reference when possible is something that seems to have worked in some other cases.

@jashapiro Thank you so much for the prompt reply! Should I open another short PR to make this modification?

Regarding CopyKat, in my case the result seems not as good as inferCNV in the sample I used. I decided not to go ahead with this software since it could be misleading.

@jashapiro
Copy link
Member

@jashapiro Thank you so much for the prompt reply! Should I open another short PR to make this modification?

Regarding CopyKat, in my case the result seems not as good as inferCNV in the sample I used. I decided not to go ahead with this software since it could be misleading.

Yes, please do make a PR with this addition.

@JingxuanChen7
Copy link
Contributor Author

@jashapiro Thank you so much for the prompt reply! Should I open another short PR to make this modification?
Regarding CopyKat, in my case the result seems not as good as inferCNV in the sample I used. I decided not to go ahead with this software since it could be misleading.

Yes, please do make a PR with this addition.

Hello @jashapiro , I have opened a PR for adding a tumor classification to the result table. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants