Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whole Genome normalization question #12

Closed
Rashesh7 opened this issue Mar 8, 2017 · 5 comments
Closed

Whole Genome normalization question #12

Rashesh7 opened this issue Mar 8, 2017 · 5 comments

Comments

@Rashesh7
Copy link

Rashesh7 commented Mar 8, 2017

Hi,

I am a bit confused about when to use set the tri.counts.method to Genome.
For the 70-30 Simulation dataset with tri.counts.method should be default, right?

So If I want to test how the tool does with Genome normalization, which Simulated dataset should I use? Or is there a way to make the 70-30 Simulation dataset based on the tri.counts.genome

@jherrero
Copy link
Collaborator

jherrero commented Mar 8, 2017

Hi Rashesh

The normalization is required when the signatures have been estimated from exome data and the mutation counts correspond to whole-genome data or viceversa. Please refer to the help of whichSignatures where this is explained in more detail.

For the simulated data under the test directory, there isn't any need to normalize the data as they don't refer to actual exome/genome counts, but simply to a linear combination of the signatures.

I am not sure I fully understand your question, though.

@Rashesh7
Copy link
Author

Rashesh7 commented Mar 8, 2017

Thank you Javier for the quick reply.

@Rashesh7 Rashesh7 closed this as completed Mar 8, 2017
@Rashesh7 Rashesh7 reopened this Mar 8, 2017
@Rashesh7
Copy link
Author

Rashesh7 commented Mar 8, 2017

Thank you Javier for the quick reply. Sorry, closed the issue by mistake.

So I have 2 questions:

  1. If I would be using any VCF file with Somatic mutations from a WGS sample , would I need to normalize using tri.counts.method as 'genome' ?

  2. As you mentioned, the simulated data is a linear combination of the signatures, Is there a way to generate a simulated data mimicking real data (basically considering the Tri nucleotide counts)?

Sorry if I am a bit confusing, I am still not adept in simulations. The thing is I am testing a few signature tools and SigneR is also one of them. They provide a simulated dataset of 21 breast cancer tumors with and without opportunity. But they did not provide a truthset or a script. Since you guys have been helpful enough to provide details about the Simulation, I was wondering if I could generate a simulated dataset with opportunity.

@jherrero
Copy link
Collaborator

jherrero commented Mar 8, 2017

From the whichSignatures() help: "The method of normalization chosen should match how the input signatures were normalized. For exome data, the 'exome2genome' method is appropriate for the signatures included in this package. For whole genome data, use the 'default' method to obtain consistent results."

  1. No, use the "default" method, which leaves the proportions unchanged (you are comparing WGS data to signatures calculated for whole-genome data).

  2. It depends on what you mean by that. The simulations in the test directory do create simulated counts of mutations in their context based on the signatures. Each simulated sample will have approximately 500 mutations in different tri-nucleotide contexts based on the probabilities defined by the signatures. Potentially you could simulated VCF files by generating mutations in the whole genome based on those signatures, but you would end up having the same result.

If you wanted to compare WGS and WES data, you can have a look at the issue #2 where this was discussed and assessed. Essentially, we looked at the result of using WGS data and compared that to the result of using only the mutations on the exome for the same real samples.

@jherrero jherrero closed this as completed Mar 8, 2017
@Rashesh7
Copy link
Author

Hi Javier,

I was just looking at the COMIC site (http://cancer.sanger.ac.uk/cosmic/signatures ), they mention that "Mutational signatures are displayed and reported based on the observed trinucleotide frequency of the human genome, i.e., representing the relative proportions of mutations generated by each signature based on the actual trinucleotide frequencies of the reference human genome version GRCh37"

So shouldn't that mean we should use 'genome' method while using WGS samples?

Sorry for the repetitive question, I am just trying to make sure I understand the difference.

Thanks,
Rashesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants