-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @UTSouthwesternDSSR, thanks for this contribution! We appreciate your patience while the organizing team was out of the office (#832).
My ten thousand-foot view comments are:
- Unfortunately, I don’t know how much more you can automate. From your results, I expect some examination of individual samples to be necessary, and I think that’s what you got started in Exploratory results for inferCNV on non-ETP samples (SCPCP000003) #838 (comment).
- The losses on chr 6 observed for every sample seem like they are likely to be an artifact (MHC genes; infercnv docs) and could be due to the use of B cells as the reference. That would make me hesitant to use the total number of CNVs alone (although that may not be possible when using
analysis_mode = “samples”
from your comment?).
Other comments that primarily influence the review of analyses/cell-type-nonETP-ALL-03/scripts/run_infercnv.R
and analyses/cell-type-nonETP-ALL-03/scripts/00-make-gene-order-file.R
:
- I understand that
analysis_mode = “subclusters”
takes a long time (and possibly longer than you want to invest!), but do you think the error only occurs when samples have very few new B cells? Could we handle that error by only proceeding when the number of reference cells exceeds the kNN setting? That doesn’t solve the time problem, of course, but that could potentially be mitigated by running samples in parallel instead of sequentially.- I am not totally convinced that the results for
SCPCL000076
are equivalent regardless of the analysis mode choice.
- I am not totally convinced that the results for
- Is there a reason not to use the
infercnv
script to set up the gene order file? - Does the
infercnv::run()
output inscratch
give us enough information to check for biology (asking about themap_metadata_from_infercnv.txt
file specifically)? Does the choice of analysis mode greatly influence our ability to dig into the alterations observed in individual libraries?
I’ll also add that I would be interested in what genes are overrepresented in the disease gene sets. If I understand you correctly, the expression of these genes is lower in the cluster you expect to include non-malignant cells. This is probably a comment for a future PR.
Please let us know if there’s anything you’d like to discuss or if there is any way we can help! Thanks again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try using the gtf_to_position_file.py
script that is part of infercnv
? That appears to be the recommendation in the infercnv
docs: https://github.com/broadinstitute/inferCNV/wiki/instructions-create-genome-position-file
Is there a reason a custom R script is required? Could you write a shell script that handles the download and runs the Python script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually got the script from the Ewing modules and followed the approach there. I guess this is not the best way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, let's see if someone more involved with the Ewing module can weigh in! Cc: @jashapiro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there was any specific reason for the script; it should be a fairly simple transformation either way. The gtf_to_position_file.py
script is not distributed with infercnv
, which means that it would need to be downloaded and included somehow here... If we can confirm that both are doing the same thing, I am not sure we really need to worry too much about the method we use, but we might want to move this script to somewhere that multiple analyses could access it rather than making copies.
Thanks for your comments, @UTSouthwesternDSSR! I would say, in general, my willingness to trust the inferCNV results for malignant vs. non-malignant calls hinges on our ability to use it to detect — at least for some samples — changes we’d expect to see in T-ALL rather than the clusters from inferCNV. I am concerned that those clusters may be influenced by the chromosome 6 losses, which may be an artifact (and might show up in all human blood samples regardless of choice of reference — I don’t have any experience to say either way!). My thoughts about the marker genes are similar. Do we see the presence of known T-ALL marker genes if such gene sets exist? (I guess another way to get at this would be to understand how the chromosome 6 losses impact the clustering results, but that seems like more work to me, not less!) I’m not sure if digging into known/previously observed alterations is technically feasible, but it is why I am asking about our ability to look at individual samples’ biology and if our choice of analysis mode impacts that. |
I guess it is hard to decide based on the biology, since the detected copy number alterations (deletion in chr 9, LOH in chr6 etc) do not present in all tumor samples. So when these are not detected, I cannot verify whether is the method not working, or these are not tumor cells, or these are tumor cells that don't have CNA. I tried Since the due date is coming, I will just re-run the CopyKat with the stringent set of B cells as the normal, although I don't think the results will be changed significantly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes need to be made to the analyses/cell-type-nonETP-ALL-03/scripts/writeout_submission.R
script before we can accept this submission.
I share your skepticism about solely using a tool to look at CNAs for tumor vs. normal calls in these samples – and I haven’t seen the SCEVAN results that agree – but I would like to get this in before the eligibility deadline.
writeout <- function(ind.lib, ct.colors = ct_color, project.ID = projectID, n.row = 1){ | ||
seu <- readRDS(file.path(out_loc,"results/rds",paste0(ind.lib,".rds"))) | ||
voi <- c('newB.copykat.pred','sctype_classification') | ||
changeName.voi <- c('tumor_cell_classification','cell_type_assignment') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the tumor_cell_classification
column in the submission guidelines:
The values of this column should be either "tumor" or "normal."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the kind reminder! Is it okay, if I use "tumor", "normal", or "unknown", because there are some cells labeled as "not.defined" from CopyKat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, including "unknown" sounds good.
final.df <- data.frame(scpca_sample_id=rep(project.ID, nrow(voi_df)), voi_df, | ||
CL_ontology_id=gene.df$ontologyID[match(voi_df$cell_type_assignment,gene.df$cellName)]) | ||
write.table(final.df, sep = "\t", quote = F, row.names = F, | ||
file = file.path(out_loc,"results/submission_table",paste0(ind.lib,"_metadata.tsv"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include creating results/submission_table
earlier in the script. I believe this is the cause of the current CI failure.
Rscript scripts/02-03_annotation.R | ||
Rscript scripts/04_multipanel_plot.R | ||
Rscript scripts/05_cluster_evaluation.R | ||
Rscript scripts/06_sctype_exploration.R | ||
Rscript scripts/07_run_copykat.R | ||
Rscript scripts/markerGenes_submission.R | ||
Rscript scripts/writeout_submission.R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at your PR trying to figure out the previous CI failure. It may be due to the small number of cells in the test data, but it may also be a stochastic failure, as I was able to run the whole workflow on a separate machine with the same test data. Nonetheless, it might be helpful for future debugging to add a few info messages like the ones below to help figure out where we are in the CI process.
Since you just added a small change, I will wait to see if that passes before proceeding too far!
Rscript scripts/02-03_annotation.R | |
Rscript scripts/04_multipanel_plot.R | |
Rscript scripts/05_cluster_evaluation.R | |
Rscript scripts/06_sctype_exploration.R | |
Rscript scripts/07_run_copykat.R | |
Rscript scripts/markerGenes_submission.R | |
Rscript scripts/writeout_submission.R | |
printf "\n\nRunning 02-03_annotation.R\n" | |
Rscript scripts/02-03_annotation.R | |
printf "\n\nRunning 04_multipanel_plot.R\n" | |
Rscript scripts/04_multipanel_plot.R | |
printf "\n\nRunning 05_cluster_evaluation.R\n" | |
Rscript scripts/05_cluster_evaluation.R | |
printf "\n\nRunning 06_sctype_exploration.R\n" | |
Rscript scripts/06_sctype_exploration.R | |
printf "\n\nRunning 07_run_copykat.R\n" | |
Rscript scripts/07_run_copykat.R | |
printf "\n\nRunning markerGenes_submission.R\n" | |
Rscript scripts/markerGenes_submission.R | |
printf "\n\nRunning writeout_submission.R\n" | |
Rscript scripts/writeout_submission.R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is working now, so I just merged in the main
branch. Whether you want to include the changes suggested above is up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thank you! I think it is good to add some comments too. I am still doing some minor change on the script and output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the submission tables in the results bucket, and this is passing CI. I will approve this to mark it as approved before tomorrow's deadline.
Purpose/implementation Section
This is the follow-up of previous PR (#815) on exploring the results of inferCNV on non-ETP samples, using the new B cells identified from the stringent ScType B cell strategy (i.e. passing the cutoff of 99 percentile of non-B ScType score on B cell clusters) as the normal cells in running inferCNV.
Please link to the GitHub issue that this pull request addresses.
#811
What is the goal of this pull request?
Trying to distinguish tumor cells from normal using inferCNV method
Briefly describe the general approach you took to achieve this goal.
I used the new B cells identified from the stringent ScType B cell strategy, as the normal cells in running inferCNV. I tried to annotate the tumor cells following the approach implemented in Ewing samples (
04-infercnv.html
) [the script is not shown in the repository], but it doesn't really work on my samples (result is shown in next section). So I am considering to look for non-malignant cells from theinfercnv.png
, although I am not really sure how to automate it.If known, do you anticipate filing additional pull requests to complete this analysis module?
Results
What is the name of your results bucket on S3?
s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/infercnv_output
What types of results does your code produce (e.g., table, figure)?
_infercnv.png
and_run.final.infercnv_obj
for each sample (exceptSCPCL000082
, which does not have any new B cells)What is your summary of the results?
I ran the inferCNV using the new B cells, following the codes implemented in Ewing:
But it gives me error in step 15 when running plot_cnv() as shown below. I found out that the above code is running with
analysis_mode = "subclusters"
, and I manage to pick up from where it left off when I reran the code.When I ran with
analysis_mode = "samples"
, I did not encounter any error. I actually ran inferCNV on my own lab server, with 47 cores. It took ~10 hours (very rough estimate) to runsubclusters
mode vs ~2-3 hours forsamples
mode, and here is the results forSCPCL000076
. I don't think there is much difference between these 2 different modes.However, when I am trying to assign tumor/normal label based on inferCNV results, using both total number of CNVs and mean proportion, only the one ran with
subclusters
mode showed variation in the number of CNVs across observation cells. All observation cells have the same number of CNVs forsamples
mode. Following the similar approach, if we were to annotate tumor cells based on the total number of CNVs or mean proportion, Late Eryth will considered as tumor, but if we were to look at theinfercnv.png
, I believe that the Late Eryth (in blue parenthesis) are showing relatively more copy-number neutral profile than the other observations.Here I am showing the 4 samples that were not doing well (due to lack of B cells). Using the relatively few
new B
cells identified,SCPCL000704
andSCPCL000077
are showing some sort of copy-number calls, as compared toSCPCL000710
andSCPCL000706
.As for
SCPCL000703
, indeed it is the cleanest sample, and we observe consistent predictions as from the CopyKat. I annotate the tumor/normal by visualizing on theinfercnv.png
and decide on the number of clusters, to tease out thenon-malignant cells
from the observations (in this case, it is cluster 3).This involves some sort of manual work, but I am also not really sure how to proceed with inferCNV (
samples
mode), since thesubclusters
mode is taking too long to run, and there is no variation observed in the number of CNV for all observations when usingsamples
mode.Based on the biology of T-ALL, the known copy-number variations are deletion of chr13q, CDKN2A/B (chr9p), and loss of heterozygosity in chr6q (maybe also SUZ12 deletion [located on chr17q]) for some samples. Generally, I observed copy-number loss in chr6 in all samples, and
SCPCL000703
shows chr5q deletion (observed in some subtypes of T-ALL). I haven't gone through the copy-number loss/gain in all samples in detail to check for their biology.Please let me know what you think about this. Thank you!
Provide directions for reviewers
In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.
What are the software and computational requirements needed to be able to run the code in this PR?
This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g.,
renv
orconda
) to run this code?For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.
Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.