Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastq record converter #1185

Closed
wants to merge 41 commits into from
Closed

Conversation

zyxue
Copy link
Contributor

@zyxue zyxue commented Sep 28, 2016

There are still problems:

  • The description for convertRead does don't match with what it does (e.g. the return type). The method name is kind of vague, not sure what behavior is actually intended. The refactoring tries not to change its behavior.
  • I am also a bit confused for what default values should be used when creating a AlignmentRecord instance based on a Fastq entry. The original code set many fields to null. I thought leaving them as default would make more sense. e.g. ReadNegativeStrand is set to null while the default is false. ProperPair is set to true, which I don't think make sense since you can't really tell just based on Fastq, the default is also fasle. It would be helpful if someone can clarify what's the most sensible values for the following fields for both paired-end and single-end fastq entries.
.setReadPaired(readPaired)
.setReadInFragment(readInFragment)
.setReadNegativeStrand(null)
.setMateNegativeStrand(null)
.setPrimaryAlignment(null)
.setSecondaryAlignment(null)
.setSupplementaryAlignment(null)
  • Also, what to do when the read length and that of qualities don't match? There is a stringency parameter involved in convertRead. If it's STRICT, then qualities must exist (CANNOT be *) and also match the length of reads. When it's NOT STRICT, qualities will be padded with B if not exist or shorter than read length. If it's longer than read length, NotImplementedError will be thrown. Such behavior seems quite arbitrary and doesn't make much sense to me, and it doesn't apply to convertPair and convertFragment, in which the read length and qualities must match without consideration for stringency.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@heuermh
Copy link
Member

heuermh commented Sep 28, 2016

Jenkins, test this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1505/

Build result: FAILURE

GitHub pull request #1185 of commit d4c5ad6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f280ecc19296a5841548d95e94b1bc3b986b0012 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision f280ecc19296a5841548d95e94b1bc3b986b0012 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f280ecc19296a5841548d95e94b1bc3b986b0012First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.


private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = {
val lines = input.toString.split('\n')
require(lines.length == 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mentioned this before; perhaps now is the time to fix it? FASTQ format allows for hard line wrapping, so there may be new line characters at any place in the record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example for hard line wrapping? What does it look like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

Copy link
Contributor Author

@zyxue zyxue Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement parsing for wrapped lines. Then I found that it would require exact match of sequence length and quality length. Otherwise, it's ambiguous to tell when the quality lines stop. This makes padding with B when length(qual line) < length(seq line), is that right? e.g. error_short_qual.fastq from biojava is an error.

s"Input must have 4 lines (${lines.length.toString} found):\n${input}")

val readName = lines(0).drop(1)
if (readName.endsWith("/1") && setSecondOfPair)
Copy link
Member

@heuermh heuermh Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen files in the wild that use 1 and 2 (with space) instead of /1 and /2. Should we add that here?
See e.g. PairedEndFastqReader.java#L59

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen both, as well. I am not aware if there is a specification that lists all possibilities. I am thinking of using regex to account for all of them gradually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a specification, only convention. See http://dx.doi.org/10.1093/nar/gkp1137

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I've also seen _1/2

Copy link
Contributor Author

@zyxue zyxue Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed by regex like [/ +_]1$

else {
if (readQualitiesRaw == "*") "B" * readSequence.length
else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length))
else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplementedError doesn't seem right, should be IllegalArgumentException. These length checks should also happen with strict stringency.

Copy link
Contributor Author

@zyxue zyxue Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the reason for padding B in case qualities information is incomplete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, B is the code for "unknown" quality. CC @ryan-williams

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/FASTQ_format

Also, in Illumina runs using PhiX controls, the character 'B' was observed to represent an "unknown quality score". The error rate of 'B' reads was roughly 3 phred scores lower the mean observed score of a given run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed NotImplementedError => IllegalArgumentException

.setSequence(sequence)
.setQual(qual)
.setReadPaired(readPaired)
.setProperPair(null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why these are explicitly set to null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly setting null is dropped in newer commits. If you rerun the tests, it should all pass.

import org.scalatest.FunSuite

/**
* Created by zyxue on 2016-09-27.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test suite for FastqRecordConverter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, am suggesting a doc comment change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removed the unnecessary comment.

/**
* Created by zyxue on 2016-09-27.
*/
class FastqConverterSuite extends FunSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be FastqRecordConverterSuite.

Copy link
Contributor Author

@zyxue zyxue Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I followed convention in FastaConverterSuite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you're right. I realized it's

FastaConverter.scala & FastaConverterSuite

so it should be

FastqRecordConverter & FastqRecordConverterSuite

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, although I find the word "record" redundant almost everywhere, so we could possibly drop it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will stick to the original convention for now.

@heuermh
Copy link
Member

heuermh commented Sep 28, 2016

Thanks @zyxue! I've left some review comments but they may not completely answer your questions. Hopefully others will chime in.

Note there are two unit test failures in AlignmentRecordRDDSuite, see https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1505/HADOOP_VERSION=2.6.0,SCALAVER=2.10,SPARK_VERSION=1.5.2,label=centos/testReport/

FastqConverterSuite.scala => FastqRecordConverterSuite.scala
@zyxue
Copy link
Contributor Author

zyxue commented Sep 28, 2016

I have located the reason at least for the first test failure. As mentioned, it's due to inconsistent default values. In the original convertRead, it'ssetReadNegativeStrand(false), setting it to null, which is done in convertPair and convertFragment, would cause java.lang.NullPointerException when getReadNegativeStrand. I suggest dropping all the explicitly set null, which seems irrelevant to a fastq file, and just use default values instead, what do you think?

@heuermh
Copy link
Member

heuermh commented Sep 30, 2016

@fnothaft I don't see any of your review comments. Were they lost or resolved?

s"Input must have 4 lines (${lines.length.toString} found):\n${input}")

val readName = lines(0).drop(1)
if (readName.endsWith("/1") && setSecondOfPair)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I've also seen _1/2

else {
if (readQualitiesRaw == "*") "B" * readSequence.length
else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length))
else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, B is the code for "unknown" quality. CC @ryan-williams


private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = {
val lines = input.toString.split('\n')
require(lines.length == 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.

secondReadName,
secondReadSequence,
secondReadQualities
) = this.parseReadPairInFastq(element._2.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the this. here or below.

.setSecondaryAlignment(null)
.setSupplementaryAlignment(null)
.build()
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

secondReadName,
secondReadSequence,
secondReadQualities
) = this.parseReadPairInFastq(element._2.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

firstReadName == secondReadName,
"Reads %s and %s in Fragment have different names.".format(
firstReadName,
secondReadName
)
)

val alignments = List(
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)
val (readName, readSequence, readQualities) =
this.parseReadInFastq(element._2.toString, setFirstOfPair, setSecondOfPair, stringency)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

recordGroupOpt.foreach(builder.setRecordGroupName)

builder.build()
this.makeAlignmentRecord(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed all this., wondering when will this. ever be necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only necessary if there is a collision between say a field and a local variable with the same name

@zyxue
Copy link
Contributor Author

zyxue commented Sep 30, 2016

Where can I find more information on ValidationStringency.STRICT, what should the behavior be when it's STRICT or not, please?

@zyxue
Copy link
Contributor Author

zyxue commented Oct 4, 2016

@heuermh, Can you test it again, please? If there is no further request, I think it's ready to be merged.

@heuermh
Copy link
Member

heuermh commented Oct 4, 2016

Jenkins, test this please.

@heuermh
Copy link
Member

heuermh commented Oct 4, 2016

Where can I find more information on ValidationStringency.STRICT, what should the behavior be when it's STRICT or not, please?

We're borrowing the enum and concept from htsjdk. Briefly, on errors strict should throw exceptions, lenient should log warnings, and silent should not complain.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1516/

Build result: FAILURE

GitHub pull request #1185 of commit ce5e3a0 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 5a913a0e1c5cb9874266ae2a108db3822c19c7b1First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Oct 5, 2016

Jenkins, retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1520/
Test PASSed.

@fnothaft
Copy link
Member

fnothaft commented Oct 7, 2016

LGTM! Ping @heuermh for a review pass.
I'm thinking that we should merge this manually, since the commit history is pretty long. @heuermh let me know if/when it looks good to you, and I will merge it.

@fnothaft
Copy link
Member

fnothaft commented Oct 7, 2016

@heuermh is this still pending changes from your side?

@heuermh
Copy link
Member

heuermh commented Oct 8, 2016

Will complete review on Monday

@@ -41,6 +41,114 @@ import scala.collection.JavaConversions._
private[adam] class FastqRecordConverter extends Serializable with Logging {

/**
* Parse 4 lines at a time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment doesn't match the method.

Perhaps something like Return true if the read name suffix and flags match.

val match2 = secondReadSuffix.findAllIn(readName)

if (match1.nonEmpty && isSecondOfPair)
throw new IllegalArgumentException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exceptions are thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?

val readName = lines(0).drop(1)
if (setFirstOfPair || setSecondOfPair) readNameSuffixAndIndexOfPairMustMatch(readName, setFirstOfPair)

val suffix = """([/ +_]1$)|([/ +_]2$)""".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this regex and those above on line 50 and 51 might be combined as static private fields so that the two methods don't get out of sync

val readName = trimTrailingReadNumber(lines(0).drop(1))
val readSequence = lines(1)
if (setFirstOfPair && setSecondOfPair)
throw new IllegalArgumentException("setFirstOfPair and setSecondOfPair cannot be true at the same time")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is also thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?

.setReadPaired(readPaired)
.setReadInFragment(readInFragment)

if (recordGroupOpt != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With opt.foreach you don't also need to check against None

@heuermh heuermh modified the milestone: 0.20.0 Oct 13, 2016
@fnothaft fnothaft mentioned this pull request Oct 13, 2016
@fnothaft
Copy link
Member

Moved over to #1208.

@fnothaft fnothaft closed this Oct 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants