Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

06 infercnv pr1013 14 #1026

Merged
merged 13 commits into from
Feb 6, 2025
Merged

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Feb 5, 2025

Purpose/implementation Section

@sjspielman , to avoid conflicts, I decided to go a step back and re-start from the current approved analysis .

Please link to the GitHub issue that this pull request addresses.

I basically copy/paste your script in maud-p#14 and added few steps at the end of the table preparation to have the input required for infercnv, with no additional column (only 4 columns gene ID, chromosome+arm, start of the gene, end of the gene)

This PR is linked to and will replace the PR#1013

My few addition here

%>%
  # Define chromosome arm order
  mutate(chrom_arm = factor(chrom_arm, levels = c(paste0("chr", rep(1:22, each = 2), c("p", "q")),
                                                  "chrXp", "chrXq", "chrYp", "chrYq"))) %>%
  # Sort genes by Chromosome arm and Start position
  arrange(chrom_arm, gene_start)  %>%
  # Select only relevant column for infercnv
  select(ensembl_id, chrom_arm, gene_start, gene_end) %>%
  # Remove ENSG duplicated (genes that are both on X and Y chromosome need to be remove before infercnv)
  distinct(ensembl_id, .keep_all = TRUE)

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this separate PR! It all looks good to me here, but I do have a question about duplicated genes. I think that the distinct() line can be removed since there do not appear to be duplicate genes, unless you think there's actually a bug and there should be duplicate genes!

Comment on lines 94 to 95
distinct(ensembl_id, .keep_all = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually seem like there are any duplicated ENSG ids here. I compared the number of rows with and without this line, and it's the same. So, one question is: How are some of these genes you're thinking of shown in this data frame? Are they on the correct chromosome, or do we have a different parsing problem?

If the data looks fine, then this line can be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I had previously problems with genes common on X and Y chromosomes, but it seems that there are not in the gene position file downloaded from aws.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data looks fine to me, there is only one gene that I had previously and we are missing now, no idea why but I don't think it will impact the following analysis:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing the number of genes per arms with my previous code and the one here, we only have this one gene on chr9p difference
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, so we can remove this line! I had a look at this gene, and I can't imagine it will cause a problem: https://www.genecards.org/cgi-bin/carddisp.pl?gene=GXYLT1P5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and finally comparing random positions in the two tables, we find the exact same gene/arm/coordinates:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I am quite confident about the gene position file created 😄

@sjspielman sjspielman mentioned this pull request Feb 5, 2025
8 tasks
maud-p and others added 7 commits February 5, 2025 15:31
…on.R

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…on.R

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…on.R

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…on.R

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…on.R

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
…on.R

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
@sjspielman
Copy link
Member

Ok, so let's see if this passes in CI - if not, we can comment out the final step in the script to not run the 07 notebook in CI here, and we'll turn it back on in the next PR.

@maud-p
Copy link
Contributor Author

maud-p commented Feb 5, 2025

Great 🥳 I will re-run the 06_infercnv.R step to update the results s3 bucket by the end of the week!

comment the step ´07_combined_annotation_across_samples_exploration´
@sjspielman
Copy link
Member

Hi @maud-p, I see you introduced a lot of changes to the notebook in this commit: 0c4c77a

If possible, I'd prefer if we leave these changes to the next PR and instead keep that notebook commented out in the workflow script as you did in the next commit. If you are able to revert that change, I can approve this PR now.

You can revert with the command git revert 0c4c77aa32a26c0f120975665eb278b261a780df, or using these instructions if you're in GitKraken https://www.gitkraken.com/learn/git/problems/revert-git-commit

Those exact same changes then be "cherry-picked" in a fresh branch for a new PR where we update that notebook. You can run git cherry-pick 0c4c77aa32a26c0f120975665eb278b261a780df in a new branch to apply those exact changes again so they won't be lost and can be reviewed.

@maud-p
Copy link
Contributor Author

maud-p commented Feb 6, 2025

Hi @sjspielman , sorry, I just tried a bit to make the check passed... I now have reset the notebook 07_combined_annotation_across_samples_exploration.Rmd to the version that is in the main OpenScPCA-analysis now.
Thank you!

@sjspielman
Copy link
Member

Thanks for reverting that! I totally understand why you did it too, all good :)

Once this passes, I'll approve and we can get back to notebook 07. It's up to you if you prefer to start a new PR or merge and work with the existing PR if merge conflicts aren't too tricky!

@maud-p
Copy link
Contributor Author

maud-p commented Feb 6, 2025

Great 🥳
I think I will start a new PR, refering to the old one for track record! I am still working on the new version of the notebook 07_combined_annotation_across_samples_exploration.Rmd, I am quite close to get the average CNV profile into it 🤞
Thank you!

@sjspielman sjspielman merged commit 2377f61 into AlexsLemonade:main Feb 6, 2025
3 checks passed
@sjspielman sjspielman mentioned this pull request Feb 6, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants