Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding utilities for read trimming. #248

Merged
merged 1 commit into from
May 29, 2014

Conversation

fnothaft
Copy link
Member

This pull request adds a few utilities for read trimming, as a first start towards read error correction. This pull request contains:

  • An update to the ADAMRecord schema to add fields for bases trimmed from the read start/end. After alignment, these fields are redundant (with hard clipping in the CIGAR), but the fields are important for trimming performed before alignment. These fields are added in the SAMRecordConverter.
  • Two new RDD functions. One performs a fixed length trim to all reads. A second function trims the prefix/suffix of all reads in a read group based on the geometric mean of the error probability at that position.
  • An update to the Transform CLI module, to add these two functions as possible transforms.
  • A correction to the CycleCovariate that correctly calculates the sequencing cycle for hard clipped reads.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/327/

@@ -197,7 +199,7 @@ record ADAMGenotype {
union { null, VariantCallingAnnotations } variantCallingAnnotations = null;

// Sample-level data, i.e. data specific to this particular sample
union { null, string } sampleId = null;
string sampleId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an optional field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, something went screwy with the rebase here. Let me fix this...

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/328/

// we should trim the first and last 5 bases off all reads
trimmed.collect.foreach(r => {
assert(r.getBasesTrimmedFromStart === 5)
assert(r.getBasesTrimmedFromStart === 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean getBasesTrimmedFromEnd here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, yes, I do... I'll fix this.

@tdanford
Copy link
Contributor

Are we still maintaining updates to the CHANGES file(s) as part of a pull request?

@fnothaft
Copy link
Member Author

@tdanford I believe we've dropped the CHANGES.txt, and are just keeping the CHANGES.md. However, CHANGES.md should probably be updated via the script whenever we do a release.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/332/

@tdanford
Copy link
Contributor

Frank, I think you've got some rebasing to do -- it looks like there are some commits in this branch that are duplicated from master, right?

@fnothaft
Copy link
Member Author

@tdanford correct; not sure how that happened...

@fnothaft
Copy link
Member Author

@tdanford this is rebased now, thanks for the heads up!

@tdanford
Copy link
Contributor

Again! I just merged Matt's edit to CONTRIBUTING.md, you're still out of date! :-)

@fnothaft
Copy link
Member Author

@tdanford u r rebase troll :'(

Re-rebased...

@tdanford
Copy link
Contributor

I like my commit histories like I like my spaces: linear.

@tdanford
Copy link
Contributor

Okay, a more substantive comment: Frank, do you have somewhere (comments? a change file?) where you can write a little bit about the motivation behind these trimming additions? Are you trying to recreate specific functionality from Picard (I'm thinking the MergeBamAlignments command, but maybe something else)? From somewhere else? Does this satisfy a requirement or need for avocado or variant calling in particular? I understand there are a lot of reasons why one might "trim" reads, I'm just hoping for a little context here.

@fnothaft
Copy link
Member Author

@tdanford Good point; I don't have this documented yet. When I started on this, I was planning to implement read error correction first (à la Quake), but I then decided that read trimming was a better first start. The need is indeed for variant calling/assembly.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/336/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/337/

massie added a commit that referenced this pull request May 29, 2014
Adding utilities for read trimming.
@massie massie merged commit abe6834 into bigdatagenomics:master May 29, 2014
@massie
Copy link
Member

massie commented May 29, 2014

Thanks, Frank!

@fnothaft fnothaft deleted the read-trimming branch July 10, 2014 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants