Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeneExpressionEvaluation tool #6602

Merged
merged 37 commits into from
Aug 11, 2020
Merged

GeneExpressionEvaluation tool #6602

merged 37 commits into from
Aug 11, 2020

Conversation

kachulis
Copy link
Contributor

Adds a tool for evaluating gene expression from RNA-seq reads aligned to whole genome.

requires samtools/htsjdk#1480

@droazen
Copy link
Contributor

droazen commented Jun 10, 2020

@lbergelson How soon is samtools/htsjdk#1480 going to trickle down into the GATK?

@kachulis Can you nominate a reviewer for this branch?

@lbergelson
Copy link
Member

There are a few other prs I want in in htsjdk first but we can do a release whenever really.

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very easy to use tool. I'm curious what the specific use case you have for it is. Just a couple of small comments/questions. Also, thanks for the thorough unit testing :)

* <li>Reads are inward-facing</li>
*</p>
*
* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.
* <p>Reads can be either from spliced or unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.

* -I input.bam
* -G geneAnnotations.gff3
* -O output.tsv
* --spliced true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* --spliced true
* --spliced false

*</p>
*
* <p>Reads can be either from spliced are unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage.
* If from unspliced RNA, the entire region from the start of the earliest read to the end of the latest read is taken as the fragment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what "earliest" and "latest" read means here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to describe via 5'/3', so hopefully clearer.

*
* </p>
*
* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equals weight can be given to each grouping feature,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equals weight can be given to each grouping feature,
* <p>Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equal weight can be given to each grouping feature,

* </p>
* </p>
*
* <p>Multi-mapping fragments (fragments whose reads map to multiple locations) can also be handled in two ways. They can be ignored, in which case only fragments with a single mapping are counted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Multi-mapping here mean a secondary alignment (both reads with MQ0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


static boolean inGoodPair(final GATKRead read, int minimumMappingQuality) {

boolean ret = !read.mateIsUnmapped() && read.isProperlyPaired() && read.getContig().equals(read.getMateContig()) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does properlyPaired not cover the rest of these checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practicality probably yes, but "properlyPaired" is not particularly strongly specified (the spec says this is "each segment properly aligned according to the aligner"). In general I want to trust the aligner's decision about what this means, but the logic I use later on fails if either the mate is unmapped or the read and mate are on different contigs. So on the technical possibility that the aligner could choose to mark an alignment as properlyPaired when either of those conditions are true, I want to explicitly check them here. I will remove the strand checks though, in light of @barkasn 's comment that it is protocol dependent.

return ret;
}

static List<Interval> getAlignmentIntervals(final GATKRead read, final boolean spliced, final int minimumMappingQuality) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment somewhere here to mention that this should only be called on one of the reads in a pair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually could be called on an unpaired read, and in fact is called on reads which are not in good pairs. All the logic which relies on a mate sits inside an inGoodPair check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to have a comment here that if a read is paired it should only be called on one read (otherwise you'd be double counting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Override
public void apply(GATKRead read, ReferenceContext referenceContext, FeatureContext featureContext) {
if ((!read.isReverseStrand() || !inGoodPair(read, mappingQualityFilter.minMappingQualityScore))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to include reverse reads that are not in good pairs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that for good pairs, we only need to look at one of the reads, since we can get its mate information from it. However, for non good pairs, we want to consider each read separately. For example if we are using unspliced data and the two reads aligned 10000 bases away from each other, and is thus marked as not properly paired, we don't want to consider that a fragment that is 10000 bases long and thus covers every gene between the two alignments.
I suppose it would also be reasonable (as I think is your point), to just ignore all data that isn't properly paired on the assumption that something funny has happened with those reads, and so they shouldn't be trusted. So I will add an option to use "non good pair" data or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you necessarily need to add that option. It's hard to say without looking at data if you should be ignoring "non good" pairs completely or not, so I'd wait until someone asks for that specific use case (unless you've already done it because I've commented here too late 😃).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also just realizing as I was thinking about how to implement it that the effect can mostly be achieved through standard command line read-filters, so I'm going to leave as-is for now.

/**
* Evaluate gene expression from RNA-seq reads aligned to genome.
*
* <p>This tool evaluates gene expression from RNA-seq reads aligned to genome. Features to evaluate expression over are defined in an input annotation file in gff3 fomat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here what you're counting to get gene expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just specify that this is counting fragments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think that would be helpful.

// If true, update the expected outputs in tests that assert a match vs. prior output,
// instead of actually running the tests. Can be used with "./gradlew test -Dtest.single=GeneExpressionEvaluationIntegrationTest"
// to update all of the exact-match tests at once. After you do this, you should look at the
// diffs in the new expected outputs in git to confirm that they are consistent with expectations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels dangerous. We used to have methods for updating all the md5 checks in GATK3 and I thought we were moving away from this style test in GATK4. Are there examples of this kind of quick update for other GATK4 integration tests? (What I remember happening in GATK3 was that you'd end up with so many tests changing that you'd run the script to update them and not actually look at the diff for hundreds of changed files.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stole this from HaplotypeCallerIntegrationTest, but I'm not sure it's used anywhere else. I agree it's probably better to make people work a little harder to update expected test results, so will remove this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? I mean, if it's already in HC then it's probably fine. It certainly makes it easier, and it doesn't become a problem until you have hundreds of tests.

Copy link

@barkasn barkasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I made some comments as per our discussion and I will review again when I can run it. I suggest adding a one line comment to document each function as well as parameter descriptions. Also consider auto-formatting, there is some inconsistent spacing and long lines.

alignmentIntervals.add(alignmentBlockInterval);

if (!overlapsMate && read.overlaps(alignmentBlockInterval)) {
overlapsMate = true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like its used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

* </p>
* </p>
*
*<p>Reads are assumed to be paired-end. Reads which are in a "good pair" are counted once together as a single fragment. Reads which are not in a "good pair" are each counted separately.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a fragment mate size distance cutoff for this. Document that you currently rely on the aligner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation. I want to avoid a simple fragment size cutoff since it can be easily thwarted by different protocols or splicing. And I believe most aligners handle fragment sizes in a more sophisticated manner when setting the properly paired flag.

@kachulis
Copy link
Contributor Author

kachulis commented Jul 7, 2020

@meganshand @barkasn thanks for the reviews! I have made some changes in response to your comments. I will ping you both for final review once necessary changes trickle down from htsjdk.

@kachulis kachulis force-pushed the ck_count_fragments_features branch from 034463a to e5848b1 Compare July 20, 2020 14:17
@gatk-bot
Copy link

gatk-bot commented Jul 20, 2020

Travis reported job failures from build 30984
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30984.12 logs
integration openjdk8 30984.2 logs

@gatk-bot
Copy link

gatk-bot commented Jul 20, 2020

Travis reported job failures from build 30988
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30988.12 logs
integration openjdk8 30988.2 logs

@kachulis kachulis marked this pull request as ready for review July 21, 2020 14:45
@kachulis
Copy link
Contributor Author

@meganshand @barkasn the necessary changes have now made their way up through htsjdk, so this is now ready for another round of review

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @kachulis!

Copy link

@barkasn barkasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved!

@kachulis kachulis merged commit 0cb632a into master Aug 11, 2020
@kachulis kachulis deleted the ck_count_fragments_features branch August 11, 2020 22:49
@kachulis kachulis restored the ck_count_fragments_features branch August 11, 2020 22:49
@kachulis kachulis deleted the ck_count_fragments_features branch August 11, 2020 22:49
@kachulis kachulis restored the ck_count_fragments_features branch August 11, 2020 22:49
mwalker174 pushed a commit that referenced this pull request Nov 3, 2020
Adds a tool for evaluating gene expression from rna-seq aligned to whole genome.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants