-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeneExpressionEvaluation tool #6602
Conversation
@lbergelson How soon is samtools/htsjdk#1480 going to trickle down into the GATK? @kachulis Can you nominate a reviewer for this branch? |
There are a few other prs I want in in htsjdk first but we can do a release whenever really. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very easy to use tool. I'm curious what the specific use case you have for it is. Just a couple of small comments/questions. Also, thanks for the thorough unit testing :)
* <li>Reads are inward-facing</li> | ||
*</p> | ||
* | ||
* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage. | |
* <p>Reads can be either from spliced or unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage. |
* -I input.bam | ||
* -G geneAnnotations.gff3 | ||
* -O output.tsv | ||
* --spliced true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* --spliced true | |
* --spliced false |
*</p> | ||
* | ||
* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage. | ||
* If from unspliced RNA, the entire region from the start of the earliest read to the end of the latest read is taken as the fragment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what "earliest" and "latest" read means here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to describe via 5'/3', so hopefully clearer.
* | ||
* </p> | ||
* | ||
* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equals weight can be given to each grouping feature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equals weight can be given to each grouping feature, | |
* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equal weight can be given to each grouping feature, |
* </p> | ||
* </p> | ||
* | ||
* <p>Multi-mapping fragments (fragments whose reads map to multiple locations) can also be handled in two ways. They can be ignored, in which case only fragments with a single mapping are counted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Multi-mapping here mean a secondary alignment (both reads with MQ0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
|
||
static boolean inGoodPair(final GATKRead read, int minimumMappingQuality) { | ||
|
||
boolean ret = !read.mateIsUnmapped() && read.isProperlyPaired() && read.getContig().equals(read.getMateContig()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does properlyPaired not cover the rest of these checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practicality probably yes, but "properlyPaired" is not particularly strongly specified (the spec says this is "each segment properly aligned according to the aligner"). In general I want to trust the aligner's decision about what this means, but the logic I use later on fails if either the mate is unmapped or the read and mate are on different contigs. So on the technical possibility that the aligner could choose to mark an alignment as properlyPaired when either of those conditions are true, I want to explicitly check them here. I will remove the strand checks though, in light of @barkasn 's comment that it is protocol dependent.
return ret; | ||
} | ||
|
||
static List<Interval> getAlignmentIntervals(final GATKRead read, final boolean spliced, final int minimumMappingQuality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment somewhere here to mention that this should only be called on one of the reads in a pair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually could be called on an unpaired read, and in fact is called on reads which are not in good pairs. All the logic which relies on a mate sits inside an inGoodPair
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to have a comment here that if a read is paired it should only be called on one read (otherwise you'd be double counting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
@Override | ||
public void apply(GATKRead read, ReferenceContext referenceContext, FeatureContext featureContext) { | ||
if ((!read.isReverseStrand() || !inGoodPair(read, mappingQualityFilter.minMappingQualityScore))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to include reverse reads that are not in good pairs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that for good pairs, we only need to look at one of the reads, since we can get its mate information from it. However, for non good pairs, we want to consider each read separately. For example if we are using unspliced data and the two reads aligned 10000 bases away from each other, and is thus marked as not properly paired, we don't want to consider that a fragment that is 10000 bases long and thus covers every gene between the two alignments.
I suppose it would also be reasonable (as I think is your point), to just ignore all data that isn't properly paired on the assumption that something funny has happened with those reads, and so they shouldn't be trusted. So I will add an option to use "non good pair" data or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you necessarily need to add that option. It's hard to say without looking at data if you should be ignoring "non good" pairs completely or not, so I'd wait until someone asks for that specific use case (unless you've already done it because I've commented here too late 😃).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was also just realizing as I was thinking about how to implement it that the effect can mostly be achieved through standard command line read-filters, so I'm going to leave as-is for now.
/** | ||
* Evaluate gene expression from RNA-seq reads aligned to genome. | ||
* | ||
* <p>This tool evaluates gene expression from RNA-seq reads aligned to genome. Features to evaluate expression over are defined in an input annotation file in gff3 fomat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here what you're counting to get gene expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just specify that this is counting fragments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think that would be helpful.
// If true, update the expected outputs in tests that assert a match vs. prior output, | ||
// instead of actually running the tests. Can be used with "./gradlew test -Dtest.single=GeneExpressionEvaluationIntegrationTest" | ||
// to update all of the exact-match tests at once. After you do this, you should look at the | ||
// diffs in the new expected outputs in git to confirm that they are consistent with expectations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels dangerous. We used to have methods for updating all the md5 checks in GATK3 and I thought we were moving away from this style test in GATK4. Are there examples of this kind of quick update for other GATK4 integration tests? (What I remember happening in GATK3 was that you'd end up with so many tests changing that you'd run the script to update them and not actually look at the diff for hundreds of changed files.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stole this from HaplotypeCallerIntegrationTest
, but I'm not sure it's used anywhere else. I agree it's probably better to make people work a little harder to update expected test results, so will remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? I mean, if it's already in HC then it's probably fine. It certainly makes it easier, and it doesn't become a problem until you have hundreds of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I made some comments as per our discussion and I will review again when I can run it. I suggest adding a one line comment to document each function as well as parameter descriptions. Also consider auto-formatting, there is some inconsistent spacing and long lines.
alignmentIntervals.add(alignmentBlockInterval); | ||
|
||
if (!overlapsMate && read.overlaps(alignmentBlockInterval)) { | ||
overlapsMate = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like its used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
* </p> | ||
* </p> | ||
* | ||
*<p>Reads are assumed to be paired-end. Reads which are in a "good pair" are counted once together as a single fragment. Reads which are not in a "good pair" are each counted separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a fragment mate size distance cutoff for this. Document that you currently rely on the aligner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation. I want to avoid a simple fragment size cutoff since it can be easily thwarted by different protocols or splicing. And I believe most aligners handle fragment sizes in a more sophisticated manner when setting the properly paired flag.
@meganshand @barkasn thanks for the reviews! I have made some changes in response to your comments. I will ping you both for final review once necessary changes trickle down from htsjdk. |
034463a
to
e5848b1
Compare
@meganshand @barkasn the necessary changes have now made their way up through htsjdk, so this is now ready for another round of review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @kachulis!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved!
Adds a tool for evaluating gene expression from rna-seq aligned to whole genome.
Adds a tool for evaluating gene expression from RNA-seq reads aligned to whole genome.
requires samtools/htsjdk#1480