GeneExpressionEvaluation tool #6602

kachulis · 2020-05-13T14:57:00Z

Adds a tool for evaluating gene expression from RNA-seq reads aligned to whole genome.

requires samtools/htsjdk#1480

droazen · 2020-06-10T14:35:32Z

@lbergelson How soon is samtools/htsjdk#1480 going to trickle down into the GATK?

@kachulis Can you nominate a reviewer for this branch?

lbergelson · 2020-06-10T16:38:48Z

There are a few other prs I want in in htsjdk first but we can do a release whenever really.

meganshand

This looks like a very easy to use tool. I'm curious what the specific use case you have for it is. Just a couple of small comments/questions. Also, thanks for the thorough unit testing :)

meganshand · 2020-07-01T14:11:57Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+ *  <li>Reads are inward-facing</li>
+ *</p>
+ *
+ * <p>Reads can be either from spliced are unspliced RNA.  If from spliced RNA, alignment blocks of reads are taken as their coverage.  


Suggested change

* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.

* <p>Reads can be either from spliced or unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.

meganshand · 2020-07-01T14:35:35Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+ *         -I input.bam
+ *         -G geneAnnotations.gff3
+ *         -O output.tsv
+ *         --spliced true


Suggested change

* --spliced true

* --spliced false

meganshand · 2020-07-01T14:36:26Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+ *</p>
+ *
+ * <p>Reads can be either from spliced are unspliced RNA.  If from spliced RNA, alignment blocks of reads are taken as their coverage.  
+ * If from unspliced RNA, the entire region from the start of the earliest read to the end of the latest read is taken as the fragment


I'm not sure I understand what "earliest" and "latest" read means here.

changed to describe via 5'/3', so hopefully clearer.

meganshand · 2020-07-01T14:41:38Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+ *         
+ * </p>
+ *
+ * <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways.  Equals weight can be given to each grouping feature,


Suggested change

* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equals weight can be given to each grouping feature,

* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equal weight can be given to each grouping feature,

meganshand · 2020-07-01T14:45:14Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+ *     </p>
+ * </p>
+ *
+ * <p>Multi-mapping fragments (fragments whose reads map to multiple locations) can also be handled in two ways.  They can be ignored, in which case only fragments with a single mapping are counted.


Does Multi-mapping here mean a secondary alignment (both reads with MQ0)?

meganshand · 2020-07-01T18:06:53Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+
+    static boolean inGoodPair(final GATKRead read, int minimumMappingQuality) {
+
+        boolean ret = !read.mateIsUnmapped() && read.isProperlyPaired() && read.getContig().equals(read.getMateContig()) &&


Does properlyPaired not cover the rest of these checks?

In practicality probably yes, but "properlyPaired" is not particularly strongly specified (the spec says this is "each segment properly aligned according to the aligner"). In general I want to trust the aligner's decision about what this means, but the logic I use later on fails if either the mate is unmapped or the read and mate are on different contigs. So on the technical possibility that the aligner could choose to mark an alignment as properlyPaired when either of those conditions are true, I want to explicitly check them here. I will remove the strand checks though, in light of @barkasn 's comment that it is protocol dependent.

meganshand · 2020-07-01T18:17:16Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+        return ret;
+    }
+
+    static List<Interval> getAlignmentIntervals(final GATKRead read, final boolean spliced, final int minimumMappingQuality) {


Can you add a comment somewhere here to mention that this should only be called on one of the reads in a pair.

I think it actually could be called on an unpaired read, and in fact is called on reads which are not in good pairs. All the logic which relies on a mate sits inside an inGoodPair check.

I think it would be helpful to have a comment here that if a read is paired it should only be called on one read (otherwise you'd be double counting).

meganshand · 2020-07-01T18:34:24Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+
+    @Override
+    public void apply(GATKRead read, ReferenceContext referenceContext, FeatureContext featureContext) {
+        if ((!read.isReverseStrand() || !inGoodPair(read, mappingQualityFilter.minMappingQualityScore))) {


Why do you want to include reverse reads that are not in good pairs here?

The idea is that for good pairs, we only need to look at one of the reads, since we can get its mate information from it. However, for non good pairs, we want to consider each read separately. For example if we are using unspliced data and the two reads aligned 10000 bases away from each other, and is thus marked as not properly paired, we don't want to consider that a fragment that is 10000 bases long and thus covers every gene between the two alignments.
I suppose it would also be reasonable (as I think is your point), to just ignore all data that isn't properly paired on the assumption that something funny has happened with those reads, and so they shouldn't be trusted. So I will add an option to use "non good pair" data or not.

I don't think you necessarily need to add that option. It's hard to say without looking at data if you should be ignoring "non good" pairs completely or not, so I'd wait until someone asks for that specific use case (unless you've already done it because I've commented here too late 😃).

Yeah, I was also just realizing as I was thinking about how to implement it that the effect can mostly be achieved through standard command line read-filters, so I'm going to leave as-is for now.

meganshand · 2020-07-01T18:40:40Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+/**
+ * Evaluate gene expression from RNA-seq reads aligned to genome.
+ *
+ * <p>This tool evaluates gene expression from RNA-seq reads aligned to genome.  Features to evaluate expression over are defined in an input annotation file in gff3 fomat


Can you add here what you're counting to get gene expression?

Do you mean just specify that this is counting fragments?

yeah, I think that would be helpful.

meganshand · 2020-07-01T18:49:08Z

.../broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluationIntegrationTest.java

+    // If true, update the expected outputs in tests that assert a match vs. prior output,
+    // instead of actually running the tests. Can be used with "./gradlew test -Dtest.single=GeneExpressionEvaluationIntegrationTest"
+    // to update all of the exact-match tests at once. After you do this, you should look at the
+    // diffs in the new expected outputs in git to confirm that they are consistent with expectations.


This feels dangerous. We used to have methods for updating all the md5 checks in GATK3 and I thought we were moving away from this style test in GATK4. Are there examples of this kind of quick update for other GATK4 integration tests? (What I remember happening in GATK3 was that you'd end up with so many tests changing that you'd run the script to update them and not actually look at the diff for hundreds of changed files.)

I stole this from HaplotypeCallerIntegrationTest, but I'm not sure it's used anywhere else. I agree it's probably better to make people work a little harder to update expected test results, so will remove this.

Really? I mean, if it's already in HC then it's probably fine. It certainly makes it easier, and it doesn't become a problem until you have hundreds of tests.

barkasn

Looks good overall. I made some comments as per our discussion and I will review again when I can run it. I suggest adding a one line comment to document each function as well as parameter descriptions. Also consider auto-formatting, there is some inconsistent spacing and long lines.

barkasn · 2020-07-01T20:04:00Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+                    alignmentIntervals.add(alignmentBlockInterval);
+
+                    if (!overlapsMate && read.overlaps(alignmentBlockInterval)) {
+                        overlapsMate = true;


This doesn't seem like its used

good catch!

barkasn · 2020-07-01T20:04:20Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/rnaseq/GeneExpressionEvaluation.java

+ *     </p>
+ * </p>
+ * 
+ *<p>Reads are assumed to be paired-end.  Reads which are in a "good pair" are counted once together as a single fragment.  Reads which are not in a "good pair" are each counted separately.


Consider adding a fragment mate size distance cutoff for this. Document that you currently rely on the aligner

Added documentation. I want to avoid a simple fragment size cutoff since it can be easily thwarted by different protocols or splicing. And I believe most aligners handle fragment sizes in a more sophisticated manner when setting the properly paired flag.

kachulis · 2020-07-07T18:58:04Z

@meganshand @barkasn thanks for the reviews! I have made some changes in response to your comments. I will ping you both for final review once necessary changes trickle down from htsjdk.

gatk-bot · 2020-07-20T15:04:28Z

Travis reported job failures from build 30984
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	30984.12	logs
integration	openjdk8	30984.2	logs

gatk-bot · 2020-07-20T16:10:53Z

Travis reported job failures from build 30988
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	30988.12	logs
integration	openjdk8	30988.2	logs

kachulis · 2020-07-21T14:46:33Z

@meganshand @barkasn the necessary changes have now made their way up through htsjdk, so this is now ready for another round of review

meganshand

Looks good, thanks @kachulis!

barkasn

Approved!

Adds a tool for evaluating gene expression from rna-seq aligned to whole genome.

droazen assigned kachulis Jun 17, 2020

kachulis requested review from meganshand and barkasn June 26, 2020 17:15

meganshand reviewed Jul 1, 2020

View reviewed changes

barkasn reviewed Jul 1, 2020

View reviewed changes

kachulis added 22 commits July 20, 2020 09:57

gtf

f1d8aaf

gff3

d02c051

stuff

cb4791d

CollectFragmentCounts

da43856

count mq 0

7a9d946

using htsjdk gff3codec

4f3d947

stuff

9fc75e2

stuff

ab1146b

Adding CountFeatureCoverage

ec675be

remove print statement

b4ba254

overlap detector

3754e3d

barebones features

2aa8507

overlap and group features

8c9b927

output 2 decimals

917fd87

test

20be835

testing

22513aa

stuff

ceee6a7

testing some things

3939f27

stuff

f0820ab

agreement with featureCounts

15726c9

MultiOverlap method options

4081279

weight code into MultiOverlap emun methods

db956ec

kachulis added 12 commits July 20, 2020 09:58

integration tests

f2fc100

cleanup

4adefca

tests

54e44dd

unit tests

9de170f

documentation

6bde359

cleanup

fee8f47

Gff3 score field

37aa449

adjust for htsjdk breaking change

1e36a65

responding to comments

072eb96

float changed to double

882f418

small documentation change

6dca2c3

small fixes after rebase

e5848b1

kachulis force-pushed the ck_count_fragments_features branch from 034463a to e5848b1 Compare July 20, 2020 14:17

test files

099c3ea

test files

919a066

kachulis marked this pull request as ready for review July 21, 2020 14:45

bug in output file validation

7c84daa

meganshand approved these changes Jul 22, 2020

View reviewed changes

barkasn approved these changes Aug 11, 2020

View reviewed changes

kachulis merged commit 0cb632a into master Aug 11, 2020

kachulis deleted the ck_count_fragments_features branch August 11, 2020 22:49

kachulis restored the ck_count_fragments_features branch August 11, 2020 22:49

kachulis deleted the ck_count_fragments_features branch August 11, 2020 22:49

kachulis restored the ck_count_fragments_features branch August 11, 2020 22:49

jamesemery mentioned this pull request Aug 31, 2020

Lingering Improvements to DepthOfCoverage tool beta version #6491

Open

7 tasks

mwalker174 pushed a commit that referenced this pull request Nov 3, 2020

GeneExpressionEvaluation tool (#6602)

05d3d59

Adds a tool for evaluating gene expression from rna-seq aligned to whole genome.

	* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.
	* <p>Reads can be either from spliced or unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.

	* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equals weight can be given to each grouping feature,
	* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equal weight can be given to each grouping feature,


		static boolean inGoodPair(final GATKRead read, int minimumMappingQuality) {

		boolean ret = !read.mateIsUnmapped() && read.isProperlyPaired() && read.getContig().equals(read.getMateContig()) &&

GeneExpressionEvaluation tool #6602

GeneExpressionEvaluation tool #6602

Conversation

kachulis commented May 13, 2020

droazen commented Jun 10, 2020

lbergelson commented Jun 10, 2020

meganshand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barkasn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachulis commented Jul 7, 2020

gatk-bot commented Jul 20, 2020 • edited Loading

gatk-bot commented Jul 20, 2020 • edited Loading

kachulis commented Jul 21, 2020

meganshand left a comment

Choose a reason for hiding this comment

barkasn left a comment

Choose a reason for hiding this comment

gatk-bot commented Jul 20, 2020 •

edited

Loading

gatk-bot commented Jul 20, 2020 •

edited

Loading