-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement mixture model compatibility in full pipeline #155
base: cb
Are you sure you want to change the base?
Conversation
GRAND-SLAM vs. SLAMDUNK + bakR fraction new comparison looks about as correlated as I hoped: I also compared the information content of the standard tcount.tsv file to the new cB file. Read counts are identical, which is good to see: but raw ConversionRates are sometimes slightly different, with the _tcount.tsv file always having the same or higher ConversionRate than the new cB file has: Looking at the tables reveals that this is due to there being slightly more Ts in the cB than the _tcount file: Problem was I was using the T count after it had been incremented by the number of mutations:
so I changed it accordingly to:
|
Sorry for the delay in replying, I was on a course the whole week and will check it until next week |
It's no problem, I hope the course went well. I am actually still dealing with a discrepancy between the total conversionRate reported in the _tcount.tsv and in the cB file. The change I described in my last message has a minimal impact on the Tcoverage and thus conversionRates. So it is still the case that the conversionRate computed from the data in the cB file is often lower than that reported in the _tcount.tsv. I need to spend some time working through the details of To be clear, I am getting each read's T count from the getTcount() method in the readInRegion class, and the total mutation counts in the cB exactly match the counts reported in _tcount.tsv |
Hi @isaacvock so to get this right, you are getting the identical T>C conversions for each read from my code and your new cb code, but the T count is off? |
Exactly, with the new code often counting more Ts (see the very last plot in my most recent long message). |
Hm that's a bit odd because for getting the Ts/As for a given read, we really simply take the read.sequence property from the pysam read object and count the Ts, so there should actually be no hidden layer in there that can really deviate.
How are you counting the Ts? |
Yeah, though I am technically comparing the Maybe I am just misunderstanding the |
Maybe here is the confusion - the |
Fair point and that definitely is a bug in my current code that I will fix (i.e., I want nT column of cB to track the number of Us in the original RNA prior to chemical conversion), but barring lots of A/C/G > T conversions in the read, that should cause me to undercount the number of Ts in reads, not overcount it. I think the problem is that the way that I am counting mutations, and the way you calculate covargeOnTs and conversionsOnTs all filter out nucleotides in the read that do not lie completely within the annotated UTR. In other words, there is always some if statement of the form So I'll work on fixing both what I am counting (number of Ts that would have existed if there were no mismatches) and when I am counting it (don't count nucleotides not aligning to the annotated UTR region). |
Actually, thinking of just counting the total number of covered reference Ts and total number of T-to-C mutations in each read, not removing counts for nucleotides aligning to regions outside of the annotated UTR. So relevant code to get the TC and nT columns of cB is:
I'll run some tests in the next couple days, but let me know if you have any objections to this strategy. |
Hi @t-neumann, All tests have passed (ran on full dataset I previously mentioned as well as the small test dataset you have hosted on the repo; made sure cB file creation worked with Is there anything else you would like me to check/tests you would like me to run? If not, then it is ready to officially merge into the cb branch when you are ready. Also, can you point me to where the documentation is hosted? I didn't find it within the slamdunk repo, and I'd be happy to make a pull request to update the docs as necessary. Best, |
Hi @isaacvock yes please just submit the remaining changes. I would also like to test a few things on my end with a second pair of eyes, but the changes sound good to me. For the documentation, you actually need to create a different PR on the Cheers, Tobi |
Btw did you resolve the issue with the unmatching numbers of covered Ts? |
There are no additional changes to submit. And thank you for the information regarding the docs, I will try and make the necessary PR to update them sometime this week. As I mentioned in my previous messages, I ended up just counting all of the T-to-C conversions and reference Ts in a read, regardless of whether the nucleotides were contained within the annotated UTR. Thus, the number of covered Ts and T-to-C conversions will be overall higher in the cB than in the _tcounts file, but the two files will agree on the overall mutational content, which is what matters. See the conversion rate comparison plot in my last message for additional context; the problem before was that the _tcount conversion rates were consistently higher than those computed from the cB. |
OK thanks a lot. Please let me know if you need further directions for the docs. Please be patient with me to merge this PR as I want to maybe incorporate a few bugfixes that have been lying around and properly test the whole thing to release a whole new slamdunk version. I guess that works for you since you have already a local working version with your changes correct? |
Thank you, will do. Yeah no rush, I appreciate your help throughout this process and any time you can find to work on eventually getting this out. Let me know if you have any further questions about the changes I made. Best, |
Fully addresses #153 and extends upon work done in #154 .
Tests are looking good. Ended up analyzing a fastq from the "classic" BRD4-MYC paper to test the new pipeline. Confirmed that it works and the fraction new (or new-to-total ratio (NTR)) estimates are as expected given a 1 hour feed (i.e., a little under 10% labeled, or a logit(NTR) of < -2.5):
Same distribution obtained from GRAND-SLAM is:
In addition, bit of a niche analysis but nice to see, transcripts from the X chromosome are on average more stable (lower NTR) than those on other chromosomes. This was originally published here, and is something I have confirmed in multiple independent datasets collected in human cell lines:
Work left to do: