-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are there so many reads not containing one of the expected bases at known SNP positions ? #11
Comments
And when I use bismark v0.18.0, there are too many "This read pair has a worse alignment score than the best alignment so far and will be ignored even though it is ambiguous in itself" printing on screen. I took a glance at the script, I think it should be annotation ? |
To start with the second question first, I accidentally left a warning message in there which I have corrected in the v0.18.1 release (https://github.com/FelixKrueger/Bismark/releases). Just ignore them for the time being. |
Maybe first of all thanks for your feedback, your project sounds quite exciting! I am assuming you performed genome re-sequencing, called SNPs on this data and then used the VCF file with the SNPsplit genome preparation? Did you get a chance to also sequence the maternal and paternal genomes as well in order to consistently phase the genomes? I would assume that unphased SNP data would also work in the SNPsplit genome preparation process and you should be able to get allele-specific methylation information for reads covering individual SNPs, but you would not be able to assign this information consistently to the maternal or paternal alleles. I just had a look at the code which is responsible for assigning reads to the did not contain expected bases counter, and found that this only occurs if a read is overlapping known SNPs but returns as neither genome 1 nor genome 2 (both have a count of 0). As such these reads are a subset of the Unassignable fraction of reads. If SNP positions involve a C in either the reference or the alternative locus the position is skipped for the scoring entirely. Therefore I would assume that the reads do indeed have one of the two remaining (neither reference nor SNP) bases, or potentially Ns, at the positions marked in bold. Theoretically it should work to run SNPsplit with the option I just had a look at the example SNPsplit report for bisulfites data which we put up on the Babraham website (https://www.bioinformatics.babraham.ac.uk/projects/SNPsplit/BS-seq_report.pdf) it does actually look like we saw a similar value of unexpected bases. Maybe the bisulfite treatment has something to do with this I am wondering? Regarding the stringency of Bismark when aligning to a genome that is known to involve SNPs, or when the genome is not quite as finished as the human or mouse genomes are, I think it would be fair to relax the --score_min function a little to allow a few more 'errors' during the alignment, e.g. I would probably start by using I hope we will get to the bottom of this, All the best, Felix |
Sorry for the late reply.
The majority of the bold part is condition 1, which accounts for almost 99% of bold part. Condition 2, 3,4 are normal. I think I do find something about condition5. SNPsplit running info: We can see that the read is from OB strand. In fact this read can be assigned to G2. Condition 5 accounts for about 0.5 %. I notice that SNPsplit do take SNP with C or G into account. But the judgement statement of score_bisulfite_SNPs may result in this error ? I have modified score_bisulfite_SNPs function in SNPsplit: score_bisulfite_SNPs. I think there is a more simple judgement statement but I write all in order not to miss any condition. Thank you very much. (I hope I express myself clearly, haha~) |
Just to say that I haven't forgotten about this, but I will need some time to read this carefully and will hopefully get to it soon. |
Hi iss1227, Apologies for the delay, but I have now looked at this issue in more detail. I agree that there were some sub-cases which had been overlooked previously. In more detail: only for C>G/G>C SNPs, and there only for either the top or bottom strand (depending on the position), and there only the bisulfite converted form the reads had not been assigned to one of the alleles. I have now tried to add this extra case to the logic that was already in place, and here are the new results using a 500000 line test case (129/Cast mouse hybrid):
It seems that this rescues a few extra reads even though the overall gain was only around 0.2% per allele. I have also tried out the code in the subroutine you linked above and this produced slightly different results:
I did not go into further detail understanding why your results differ (but one is going up the other one down...). Initially I was a little puzzled that the total number of reads did not contain one of the expected bases at known SNP positions was still around 10%, but when I had a second look I realised that C>T or G>A SNPs (which are being ignored for the allele-assignment) would also result in reads that can't be assigned to either allele. Thus, the name of the counter is probably not ideal as it is confounding C>T SNPs (which are being ignored intentionally) and SNPs that really didn't contain one of the expected bases. For a future release I should be looking into a way of counting these reads differently. So for the time being, would you mind cloning the latest development version and testing whether it resolves the issues you have seen so far? Many thanks! Felix |
I have now had another go at the Allele-tagging report. I have introduced a new counter for the
These two changes have now made it into a new release v0.3.3 because I believe it is important not to introduce the bias in allele-assignment. Thanks for bringing this to my attention, and I still welcome any further comments on this issue. Best, Felix |
Hi, I've been using SNPsplit recently with our monkey data. The input is the output of bismark (v 0.18.0,default parameters). SNP information is generated by genome resequencing. And this is part of my allele-tagging report. SNPsplit seems working well.
What I am confused about is the bold text. Why are there so many reads not containing one of the expected bases at known SNP positions (Well I think it's a little much )? Is it because of mismatches allowed by bismark? I also wonder if there are recommended parameters for bismark when the output is used for phasing ?
Thank you very much for comments.
Allele-tagging report
Processed 366283612 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
288619297 reads were unassignable (78.80%)
35559649 reads were specific for genome 1 (9.71%)
34895376 reads were specific for genome 2 (9.53%)
30805929 reads did not contain one of the expected bases at known SNP positions (8.41%)
7209290 contained conflicting allele-specific SNPs (1.97%)
The text was updated successfully, but these errors were encountered: