Implement mixture model compatibility in full pipeline #155

isaacvock · 2024-05-01T01:30:49Z

Fully addresses #153 and extends upon work done in #154 .

Tests are looking good. Ended up analyzing a fastq from the "classic" BRD4-MYC paper to test the new pipeline. Confirmed that it works and the fraction new (or new-to-total ratio (NTR)) estimates are as expected given a 1 hour feed (i.e., a little under 10% labeled, or a logit(NTR) of < -2.5):

$SLAMDUNK_EZbakR_fraction_news$

Same distribution obtained from GRAND-SLAM is:

$GRAND_SLAM_fraction_news$

In addition, bit of a niche analysis but nice to see, transcripts from the X chromosome are on average more stable (lower NTR) than those on other chromosomes. This was originally published here, and is something I have confirmed in multiple independent datasets collected in human cell lines:

$SLAMDUNK_EZbakR_fraction_news_Xchrstratify$

Work left to do:

Compare gene-wise average logit(NTR) estimates between GRAND-SLAM and SLAMDUNK + bakR directly to make sure estimates are gene-by-gene consistent
Run the test dataset that is in the SLAMDUNK repository
Anything else?

isaacvock · 2024-05-02T00:58:24Z

GRAND-SLAM vs. SLAMDUNK + bakR fraction new comparison looks about as correlated as I hoped:

I also compared the information content of the standard tcount.tsv file to the new cB file. Read counts are identical, which is good to see:

but raw ConversionRates are sometimes slightly different, with the _tcount.tsv file always having the same or higher ConversionRate than the new cB file has:

Looking at the tables reveals that this is due to there being slightly more Ts in the cB than the _tcount file:

Problem was I was using the T count after it had been incremented by the number of mutations:

            testN = read.getTcount()
            testk = 0
            for mismatch in read.mismatches:
                if(mismatch.referencePosition >= 0 and mismatch.referencePosition < utr.getLength()):
                    if(mismatch.isT(read.direction == ReadDirection.Reverse)):
                        testN += 1
                    if(mismatch.isTCMismatch(read.direction == ReadDirection.Reverse)):
                        testk += 1

            if (makeCB):

                key = (testk, testN)

                if key in cB:
                    cB[key] += 1

                else:
                    cB[key] = 1

so I changed it accordingly to:

           ...

            if (makeCB):

                key = (testk, read.getTcount())

                if key in cB:
                    cB[key] += 1

                else:
                    cB[key] = 1

t-neumann · 2024-05-03T14:38:10Z

Sorry for the delay in replying, I was on a course the whole week and will check it until next week

isaacvock · 2024-05-03T14:48:23Z

It's no problem, I hope the course went well.

I am actually still dealing with a discrepancy between the total conversionRate reported in the _tcount.tsv and in the cB file. The change I described in my last message has a minimal impact on the Tcoverage and thus conversionRates. So it is still the case that the conversionRate computed from the data in the cB file is often lower than that reported in the _tcount.tsv. I need to spend some time working through the details of computeTconversions() more carefully, but if you have any insights as to the source of the discrepancy, that would be much appreciated.

To be clear, I am getting each read's T count from the getTcount() method in the readInRegion class, and the total mutation counts in the cB exactly match the counts reported in _tcount.tsv

t-neumann · 2024-05-06T08:37:56Z

Hi @isaacvock

so to get this right, you are getting the identical T>C conversions for each read from my code and your new cb code, but the T count is off?

isaacvock · 2024-05-06T10:44:33Z

Exactly, with the new code often counting more Ts (see the very last plot in my most recent long message).

t-neumann · 2024-05-06T11:08:48Z

Hm that's a bit odd because for getting the Ts/As for a given read, we really simply take the read.sequence property from the pysam read object and count the Ts, so there should actually be no hidden layer in there that can really deviate.

    def getTcount(self):
        if(self.direction == ReadDirection.Reverse):
            return self.sequence.count("a") + self.sequence.count("A")
        else:
            return self.sequence.count("t") + self.sequence.count("T")

How are you counting the Ts?

isaacvock · 2024-05-06T12:41:48Z

Yeah, though I am technically comparing the coverageOnTs column of the *_tcounts.tsv output to the total number of Ts counted for each UTR in the cB file (so using dplyr syntax: cB %>% group_by(UTR) %>% summarise(cB_CoverageOnTs = sum(nT*n)), where UTR is short hand for all of the UTR specification columns). When I say the mutational content is identical, I mean that conversionsOnTs is equivalent to the same summarization of the cB file but for sum(TC*n).

Maybe I am just misunderstanding the coverageOnTs output though. Would you expect this to be equal to the sum of all of the T counts in all of the reads from a given UTR?

t-neumann · 2024-05-06T12:49:32Z

Maybe here is the confusion - the coverageOnTs should be the number of times a reference T is covered by a read. That does not imply that the read itself has a T on this position, it could also be a T>C conversion (or any other mutation)

isaacvock · 2024-05-06T13:13:45Z

Fair point and that definitely is a bug in my current code that I will fix (i.e., I want nT column of cB to track the number of Us in the original RNA prior to chemical conversion), but barring lots of A/C/G > T conversions in the read, that should cause me to undercount the number of Ts in reads, not overcount it.

I think the problem is that the way that I am counting mutations, and the way you calculate covargeOnTs and conversionsOnTs all filter out nucleotides in the read that do not lie completely within the annotated UTR. In other words, there is always some if statement of the form if(position >= 0 & position < utr.length() that needs to be TRUE in order for the relevant counter to be incremented. For counting Ts, I am just counting the number of Ts in each read regardless of whether it falls outside of the annotated UTR bounds.

So I'll work on fixing both what I am counting (number of Ts that would have existed if there were no mismatches) and when I am counting it (don't count nucleotides not aligning to the annotated UTR region).

isaacvock · 2024-05-06T13:42:26Z

Actually, thinking of just counting the total number of covered reference Ts and total number of T-to-C mutations in each read, not removing counts for nucleotides aligning to regions outside of the annotated UTR. So relevant code to get the TC and nT columns of cB is:

for read in readIterator:

    # Total number of T-to-C mutations in the read
    totalTCcnt = read.tcCount

    # Total number of reference Ts overlapped (minus those mutated to something other than C)
    totalTcnt = read.getTcount()

    ...

    # Increment nT count by A/C/G/N > T mismatches
    for mismatch in read.mismatches:
        if(mismatch.referencePosition >= 0 and mismatch.referencePosition < utr.getLength()):
            if(mismatch.isT(read.direction == ReadDirection.Reverse)):
               ...
                totalTcnt += 1
               ...

        else:
            if(mismatch.isT(read.direction == ReadDirection.Reverse)):
                totalTcnt += 1

    # Increment TC and nT counts for cB
    if (makeCB):

        key = (totalTCcnt, totalTcnt)

        if key in cB:
            cB[key] += 1

        else:
            cB[key] = 1

I'll run some tests in the next couple days, but let me know if you have any objections to this strategy.

isaacvock · 2024-05-12T23:48:29Z

Hi @t-neumann,

All tests have passed (ran on full dataset I previously mentioned as well as the small test dataset you have hosted on the repo; made sure cB file creation worked with slamdunk all and slamdunk count, and also made sure that not including the new cB file flag led to the cB file not getting created) and things look good. Conversion rates inferred from cB agree well with SLAMDUNK _tcount.tsv conversion rates (left plot) and new-to-total ratios (NTRs) estimated with GRAND-SLAM agree well with those estimated with the new cB file (right plot).

Is there anything else you would like me to check/tests you would like me to run? If not, then it is ready to officially merge into the cb branch when you are ready.

Also, can you point me to where the documentation is hosted? I didn't find it within the slamdunk repo, and I'd be happy to make a pull request to update the docs as necessary.

Best,
Isaac

t-neumann · 2024-05-13T11:49:06Z

Hi @isaacvock

yes please just submit the remaining changes. I would also like to test a few things on my end with a second pair of eyes, but the changes sound good to me.

For the documentation, you actually need to create a different PR on the gh-pages branch which hosts the documentation (that is automatically rebuilt). You basically would need to edit the *.rst files in doc/source/.

Cheers,

Tobi

t-neumann · 2024-05-13T11:49:28Z

Btw did you resolve the issue with the unmatching numbers of covered Ts?

isaacvock · 2024-05-13T12:47:15Z

There are no additional changes to submit. And thank you for the information regarding the docs, I will try and make the necessary PR to update them sometime this week.

As I mentioned in my previous messages, I ended up just counting all of the T-to-C conversions and reference Ts in a read, regardless of whether the nucleotides were contained within the annotated UTR. Thus, the number of covered Ts and T-to-C conversions will be overall higher in the cB than in the _tcounts file, but the two files will agree on the overall mutational content, which is what matters. See the conversion rate comparison plot in my last message for additional context; the problem before was that the _tcount conversion rates were consistently higher than those computed from the cB.

t-neumann · 2024-05-14T08:16:32Z

OK thanks a lot. Please let me know if you need further directions for the docs.

Please be patient with me to merge this PR as I want to maybe incorporate a few bugfixes that have been lying around and properly test the whole thing to release a whole new slamdunk version. I guess that works for you since you have already a local working version with your changes correct?

isaacvock · 2024-05-14T13:24:18Z

Thank you, will do.

Yeah no rush, I appreciate your help throughout this process and any time you can find to work on eventually getting this out. Let me know if you have any further questions about the changes I made.

Best,
Isaac

isaacvock added 3 commits April 30, 2024 21:07

Add makeCB option to all dunk

dd6f113

Fix T overcounting bug

6b5bfc7

Make cB header consistent with _tcount.tsv header

97746a5

Changes to be thrown out

4b97031

Count total number of ref Ts and TC mutations

27dc665

isaacvock added 3 commits May 6, 2024 09:49

Don't double count TC mismatches in nT

d648cb3

Remove unnecessary files from local install

51d369a

Update comments

e529e77

isaacvock mentioned this pull request May 22, 2024

Add cB file information to documentation #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement mixture model compatibility in full pipeline #155

Implement mixture model compatibility in full pipeline #155

isaacvock commented May 1, 2024 •

edited

Loading

isaacvock commented May 2, 2024 •

edited

Loading

t-neumann commented May 3, 2024

isaacvock commented May 3, 2024 •

edited

Loading

t-neumann commented May 6, 2024

isaacvock commented May 6, 2024

t-neumann commented May 6, 2024

isaacvock commented May 6, 2024

t-neumann commented May 6, 2024

isaacvock commented May 6, 2024 •

edited

Loading

isaacvock commented May 6, 2024 •

edited

Loading

isaacvock commented May 12, 2024

t-neumann commented May 13, 2024

t-neumann commented May 13, 2024

isaacvock commented May 13, 2024 •

edited

Loading

t-neumann commented May 14, 2024

isaacvock commented May 14, 2024

Implement mixture model compatibility in full pipeline #155

Are you sure you want to change the base?

Implement mixture model compatibility in full pipeline #155

Conversation

isaacvock commented May 1, 2024 • edited Loading

isaacvock commented May 2, 2024 • edited Loading

t-neumann commented May 3, 2024

isaacvock commented May 3, 2024 • edited Loading

t-neumann commented May 6, 2024

isaacvock commented May 6, 2024

t-neumann commented May 6, 2024

isaacvock commented May 6, 2024

t-neumann commented May 6, 2024

isaacvock commented May 6, 2024 • edited Loading

isaacvock commented May 6, 2024 • edited Loading

isaacvock commented May 12, 2024

t-neumann commented May 13, 2024

t-neumann commented May 13, 2024

isaacvock commented May 13, 2024 • edited Loading

t-neumann commented May 14, 2024

isaacvock commented May 14, 2024

isaacvock commented May 1, 2024 •

edited

Loading

isaacvock commented May 2, 2024 •

edited

Loading

isaacvock commented May 3, 2024 •

edited

Loading

isaacvock commented May 6, 2024 •

edited

Loading

isaacvock commented May 6, 2024 •

edited

Loading

isaacvock commented May 13, 2024 •

edited

Loading