Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalization question for whole exome sequencing data #2

Closed
YingYa opened this issue Mar 22, 2016 · 13 comments
Closed

normalization question for whole exome sequencing data #2

YingYa opened this issue Mar 22, 2016 · 13 comments

Comments

@YingYa
Copy link

YingYa commented Mar 22, 2016

Hi,

It is a useful tool for mutation feature analysis.

I have some whole exome/genome sequencing data that I wanted to compare to signatures.nature2013 (or signatures.cosmic) signature. I am bit confused about which tri.counts.method I have to set on, exome or exome2genome for WES, genome for WGS?

thanks

@raerose01
Copy link
Owner

Hi,

According to the published notes about the original signatures.nature2013, they reflect absolute frequencies as would occur across the entire genome. If you are using exome data as input, I would suggest using either the 'default' or 'exome2genome' option for tri.counts.method. In my own experience, I have had more success using the 'default' method, which does not add any additional normalization.

However, if you would like to use the 'exome' option, which normalizes based on the number of times each trinucleotide sequence is found in the genome, then you would also have to re-normalize the input signatures.

I hope this helps.

@YingYa
Copy link
Author

YingYa commented Mar 23, 2016

Thanks a lot! This is very helpful!

@ohofmann
Copy link

Rachel,

still trying to wrap my head around this one. We are testing the framework for a number of projects here, and I've started looking at the concordance between deconstructSigs and the signatures called originally on a number of WGS. This part is looking great!

We've then subset the WGS data to an in silico exome to check how stable the signatures are for different cancer types as the number of calls decreases. Somewhat surprisingly I am seeing better correlation between the virtual exome VCF signatures and the WGS signatures when using no normalisation in deconstructSigs for the 'exome' data; when normalising with exome correlation decreases. Since we are not looking at a whole genome background in this case shouldn't the normalised data work better? And how does exome2genome differ from the exome option?

Thanks for all your work on this!

Best, Oliver

@raerose01
Copy link
Owner

Hi Oliver,

The 'exome' normalization uses the trinucleotide counts calculated across the exome. The original signatures were not normalized in this way (the respective plots with/without such normalization can be found in the supplement to their 2013 Nature paper), so if you do use that normalization method, then you would also have to re-normalize the input signatures. I would suspect that's why the correlation is currently worse.

The 'exome2genome' normalization takes what is exome data and normalizes to what the count would be in genome data using the corresponding trinucleotide frequencies as counted in the exome and genome. This is how the original signatures are normalized according to the authors. However, I have had more success using the 'default' method, without any additional normalization.

I hope this is somewhat more clear, but let me know if you have any more questions!

Rachel

@ohofmann
Copy link

Rachel,

that's super helpful, thank you -- we'll re-run with exome2genome and test again on a large ICGC cohort. I'll report back.

Cheers,

             Oliver 

On Thu, 14 Apr 2016 at 17:45 raerose01

<
mailto:raerose01 notifications@github.com

wrote:

Hi Oliver,

The 'exome' normalization uses the trinucleotide counts calculated across the exome. The original signatures were not normalized in this way (the respective plots with/without such normalization can be found in the supplement to their 2013 Nature paper), so if you do use that normalization method, then you would also have to re-normalize the input signatures. I would suspect that's why the correlation is currently worse.

The 'exome2genome' normalization takes what is exome data and normalizes to what the count would be in genome data using the corresponding trinucleotide frequencies as counted in the exome and genome. This is how the original signatures are normalized according to the authors. However, I have had more success using the 'default' method, without any additional normalization.

I hope this is somewhat more clear, but let me know if you have any more questions!

Rachel

You are receiving this because you commented.

Reply to this email directly or
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raerose01_deconstructSigs_issues_2-23issuecomment-2D210043740&d=CwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=YU6T7Z5-MRt9FZen65-U_QG81GFUSq1SUv84XAdhjWM&m=lVwmZtEay12BXSPJqYTPN3ThbhUKWVlBb5-7c_4q_nc&s=HEktXSXFSkkaNkrrOtAISSiOp1_EBjl_zXc4iCP4itc&e=
#2 (comment)

@benbfly
Copy link

benbfly commented Apr 21, 2016

Rachel,

We were also very confused by this when reading the section of the paper and the README section that deals with normalization. It seems like there should be a statistically "right" answer, rather than just empirically whether one seems to work better than another. Are the original mutation counts available for the public signature data? If so, then we could re-process them all uniformly so that there would be a single right answer. Many users (including us) will be using exome data, and will thus assume that "exome" is the correct setting. It's important to make this really clear in the documentation so that people will adopt your very useful tool.

Thanks!
Ben.

@jherrero
Copy link
Collaborator

Dear Ben, all

The normalisation is indeed a confusing aspect of this analysis. This is further confounded by the fact that the signatures provided in the 2013 Nature paper are a merge of several signatures. In their original work, they determined the signatures independently for each cancer and data (exome vs whole genome) type and then merged the results, so it is more difficult to compare the results we get with what was originally found.

Nevertheless, according to the information provided in the original 2013 Nature paper and personal communication with colleagues at the WT Sanger Institute, the right thing to do is to use the fractions of mutations in each context as they are when using whole genome data and normalise from exome to genome fraction if using whole exome data.

We have been looking more closely at some of the data set to confirm this. Indeed, if you take the Medulloblastoma data (all whole genome data; ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl/somatic_mutation_data/Medulloblastoma/Medulloblastoma_clean_somatic_mutations_for_signature_analysis.txt) and look for the signatures, using the "default" normalisation (i.e. simply dividing the number of mutations in each context by the total number of mutations) provides the best match to the original results. Note again that the signatures used in the deconstructSigs package are slightly different from the ones used in each cancer type as the former are the consensus of the signatures across all cancer types. You can do a similar experiments with the Myeloma dataset (all whole exome) and in that case the "exome2genome" option is the one that works best.

I see your point about the name of the context and the fact it could be misunderstood as the data type for the input. We wanted the package to be as flexible as possible and allow the use of any other set of signatures, not necessarily defined for whole genome data. We will consider ways to clarify this further, ideally without loosing any flexibility.

Thanks for your feedback!

Regards

Javier

@ohofmann
Copy link

Javier,

thanks for the feedback. I've been playing with the framework a bit to look at a subset of ICGC cancer genomes -- some with a mix of signatures (breast cancer, pancreatic cancer), and some that should have a strong single signature (melanoma, adenocarcinoma).

Running deconstructSigs on the WGS-derived VCFs gives reasonable results. Smoking/UV shows up for lung/skin; a fair amount of the breast cancer samples have BRCA signatures. To test for the stability we took the same VCFs and subset them to ENSEMBL GRCh37 coding regions only to approximate an exome analysis.

Running the signatures again without any normalisation, that is in the same way as the WGS gives reasonable correlation at least for samples with >500 somatic variants or at least one strong signal. Using the exome2genomeoption on the other hand all but removes any correlation and results in a quite different distribution of signatures.

Based on your Medulloblastoma example that does sound unexpected, or did I get this wrong?

Best, Oliver

@jherrero
Copy link
Collaborator

Hi Oliver

Despite your results, it doesn't make much sense that the "default" normalisation produces similar results for exome and whole genome data.

There might be some discrepancy between the exome trinucleotide contexts available in the package and the strict set of coding regions that you are using for your in-silico exome data. You may want to calculate the trinucleotide frequencies for the regions you are sub-sampling. The ones in the package are based on the Agilent SureSelectV5 target regions.

Have you also checked that the number of mutations in the "exome" dataset is large enough (at the very least 50 mutations)?

Best, Javier

@ohofmann
Copy link

Javier,

apologies for the radio silence, I got sidetracked. Finally found the time to put together a report at https://dl.dropboxusercontent.com/u/407047/Work/WWCRC/Signatures/testSigs.html. In brief, we took 80 samples from ICGC from four different cancers, checked that the WGS-derived signatures make sense, then limited the calls to those falling into regions by Agilent SureSelect v5 and repeated the process.

As you can see from the report at least for this data set I'd ask for a minimum of 300+ variants before trusting the signatures. Signatures without at least one strong signature signal or too many signatures seem to correlate poorly.

Does this roughly match your expectations? I did not see much difference in overall signatures or correlation scores when moving from default to exome2genome normalisation. Hope this helps, happy to provide additional information; chances are I got some steps wrong during processing.

Best, Oliver

@ohofmann
Copy link

Javier, Rachel, any thoughts on this? We will be adding deconstructSigs to bcbio as a default method for somatic WGS projects, but reluctant to enable it for exomes at this point.

Cheers,

         Oliver 

@jherrero
Copy link
Collaborator

Hi Oliver

Very interesting analysis, thanks for sharing. Indeed having just 50 somatic mutations to find signatures is very limiting, especially when you consider that we are playing with 96 different contexts. We chose that lower threshold as a limit below which it is quite hopeless to find any significant result. As you show in your analysis, strong signatures are still detectable nevertheless.

I am not sure what to recommend in terms of what threshold to use for bcbio. This is the eternal discussion on whether it is best to provide some hint even if not fully reliable vs not providing anything at all. You would hope that the person interpreting the results would have read the documentation and/or warning messages that you may add, but I tend to be slightly more realistic...

On a related note, bear in mind that some signatures are more difficult to estimate than others. For instance, Signature.5 is often underestimated probably because it has a fairly flat profile.

Javier

@ohofmann
Copy link

Javier,

agree with your assessment. We'll limit it to WGS analysis for the time being; I'd rather not set false expectations.

Best, Oliver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants