-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fastq record converter #1185
Fastq record converter #1185
Conversation
tested FastqRecordConverter.convertPair
when fastq lines number doesn't match
Refactored out the common part between FastqRecordConverter.convertPair & FastqRecordConverter.convertFragment
shared by convertPair & convertFragment
factoring out common code from convertRead
There is still quite a bit of problem with default values when creating a AlignmentRecord instance from a Fastq entry
Can one of the admins verify this patch? |
Jenkins, test this please. |
Test FAILed. Build result: FAILUREGitHub pull request #1185 of commit d4c5ad6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f280ecc19296a5841548d95e94b1bc3b986b0012 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision f280ecc19296a5841548d95e94b1bc3b986b0012 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f280ecc19296a5841548d95e94b1bc3b986b0012First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
|
||
private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = { | ||
val lines = input.toString.split('\n') | ||
require(lines.length == 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mentioned this before; perhaps now is the time to fix it? FASTQ format allows for hard line wrapping, so there may be new line characters at any place in the record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example for hard line wrapping? What does it look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are all kinds of correct and incorrect examples here https://github.com/biojava/biojava/tree/master/biojava-sequencing/src/test/resources/org/biojava/nbio/sequencing/io/fastq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to implement parsing for wrapped lines. Then I found that it would require exact match of sequence length and quality length. Otherwise, it's ambiguous to tell when the quality lines stop. This makes padding with B
when length(qual line) < length(seq line), is that right? e.g. error_short_qual.fastq from biojava is an error.
s"Input must have 4 lines (${lines.length.toString} found):\n${input}") | ||
|
||
val readName = lines(0).drop(1) | ||
if (readName.endsWith("/1") && setSecondOfPair) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen files in the wild that use 1
and 2
(with space) instead of /1
and /2
. Should we add that here?
See e.g. PairedEndFastqReader.java#L59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen both, as well. I am not aware if there is a specification that lists all possibilities. I am thinking of using regex to account for all of them gradually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a specification, only convention. See http://dx.doi.org/10.1093/nar/gkp1137
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I've also seen _1/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed by regex like [/ +_]1$
else { | ||
if (readQualitiesRaw == "*") "B" * readSequence.length | ||
else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length)) | ||
else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotImplementedError
doesn't seem right, should be IllegalArgumentException
. These length checks should also happen with strict stringency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what's the reason for padding B
in case qualities information is incomplete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, B
is the code for "unknown" quality. CC @ryan-williams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/FASTQ_format
Also, in Illumina runs using PhiX controls, the character 'B' was observed to represent an "unknown quality score". The error rate of 'B' reads was roughly 3 phred scores lower the mean observed score of a given run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed NotImplementedError
=> IllegalArgumentException
.setSequence(sequence) | ||
.setQual(qual) | ||
.setReadPaired(readPaired) | ||
.setProperPair(null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why these are explicitly set to null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly setting null is dropped in newer commits. If you rerun the tests, it should all pass.
import org.scalatest.FunSuite | ||
|
||
/** | ||
* Created by zyxue on 2016-09-27. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test suite for FastqRecordConverter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, am suggesting a doc comment change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, removed the unnecessary comment.
/** | ||
* Created by zyxue on 2016-09-27. | ||
*/ | ||
class FastqConverterSuite extends FunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be FastqRecordConverterSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I followed convention in FastaConverterSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, you're right. I realized it's
FastaConverter.scala
& FastaConverterSuite
so it should be
FastqRecordConverter
& FastqRecordConverterSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, although I find the word "record" redundant almost everywhere, so we could possibly drop it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will stick to the original convention for now.
Thanks @zyxue! I've left some review comments but they may not completely answer your questions. Hopefully others will chime in. Note there are two unit test failures in |
FastqConverterSuite.scala => FastqRecordConverterSuite.scala
I have located the reason at least for the first test failure. As mentioned, it's due to inconsistent default values. In the original |
@fnothaft I don't see any of your review comments. Were they lost or resolved? |
s"Input must have 4 lines (${lines.length.toString} found):\n${input}") | ||
|
||
val readName = lines(0).drop(1) | ||
if (readName.endsWith("/1") && setSecondOfPair) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I've also seen _1/2
else { | ||
if (readQualitiesRaw == "*") "B" * readSequence.length | ||
else if (readQualitiesRaw.length < readSequence.length) readQualitiesRaw + ("B" * (readSequence.length - readQualitiesRaw.length)) | ||
else if (readQualitiesRaw.length > readSequence.length) throw new NotImplementedError("Not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, B
is the code for "unknown" quality. CC @ryan-williams
|
||
private def parseReadPairInFastq(input: String): (String, String, String, String, String, String) = { | ||
val lines = input.toString.split('\n') | ||
require(lines.length == 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this work for the simple case first (what we have currently implemented --> fastq record is 4 lines, interleaved read pair is 8 lines). In a follow on, we can make the arbitrary wrapping case work. In my experience, "simply" formatted files are much more common than arbitrarily formatted files.
secondReadName, | ||
secondReadSequence, | ||
secondReadQualities | ||
) = this.parseReadPairInFastq(element._2.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the this.
here or below.
.setSecondaryAlignment(null) | ||
.setSupplementaryAlignment(null) | ||
.build() | ||
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the this.
.
secondReadName, | ||
secondReadSequence, | ||
secondReadQualities | ||
) = this.parseReadPairInFastq(element._2.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the this.
.
firstReadName == secondReadName, | ||
"Reads %s and %s in Fragment have different names.".format( | ||
firstReadName, | ||
secondReadName | ||
) | ||
) | ||
|
||
val alignments = List( | ||
this.makeAlignmentRecord(firstReadName, firstReadSequence, firstReadQualities, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the this.
.
val readName = trimTrailingReadNumber(lines(0).drop(1)) | ||
val readSequence = lines(1) | ||
val (readName, readSequence, readQualities) = | ||
this.parseReadInFastq(element._2.toString, setFirstOfPair, setSecondOfPair, stringency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the this.
.
recordGroupOpt.foreach(builder.setRecordGroupName) | ||
|
||
builder.build() | ||
this.makeAlignmentRecord( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the this.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed all this.
, wondering when will this.
ever be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
is only necessary if there is a collision between say a field and a local variable with the same name
Where can I find more information on |
throw IllegalArgumentException instead of NotImplementedError when quality length is longer than read length in FASTQ record under STRICT mode
@heuermh, Can you test it again, please? If there is no further request, I think it's ready to be merged. |
Jenkins, test this please. |
We're borrowing the enum and concept from htsjdk. Briefly, on errors strict should throw exceptions, lenient should log warnings, and silent should not complain. |
Test FAILed. Build result: FAILUREGitHub pull request #1185 of commit ce5e3a0 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1185/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1185/merge^{commit} # timeout=10Checking out Revision 5a913a0e1c5cb9874266ae2a108db3822c19c7b1 (origin/pr/1185/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 5a913a0e1c5cb9874266ae2a108db3822c19c7b1First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please |
Test PASSed. |
@heuermh is this still pending changes from your side? |
Will complete review on Monday |
@@ -41,6 +41,114 @@ import scala.collection.JavaConversions._ | |||
private[adam] class FastqRecordConverter extends Serializable with Logging { | |||
|
|||
/** | |||
* Parse 4 lines at a time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment doesn't match the method.
Perhaps something like Return true if the read name suffix and flags match.
val match2 = secondReadSuffix.findAllIn(readName) | ||
|
||
if (match1.nonEmpty && isSecondOfPair) | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exceptions are thrown without considering ValidationStringency
. If the stringency is lenient or silent, is it possible to continue processing?
val readName = lines(0).drop(1) | ||
if (setFirstOfPair || setSecondOfPair) readNameSuffixAndIndexOfPairMustMatch(readName, setFirstOfPair) | ||
|
||
val suffix = """([/ +_]1$)|([/ +_]2$)""".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this regex and those above on line 50 and 51 might be combined as static private fields so that the two methods don't get out of sync
val readName = trimTrailingReadNumber(lines(0).drop(1)) | ||
val readSequence = lines(1) | ||
if (setFirstOfPair && setSecondOfPair) | ||
throw new IllegalArgumentException("setFirstOfPair and setSecondOfPair cannot be true at the same time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is also thrown without considering ValidationStringency. If the stringency is lenient or silent, is it possible to continue processing?
.setReadPaired(readPaired) | ||
.setReadInFragment(readInFragment) | ||
|
||
if (recordGroupOpt != None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With opt.foreach
you don't also need to check against None
Moved over to #1208. |
There are still problems:
convertRead
does don't match with what it does (e.g. the return type). The method name is kind of vague, not sure what behavior is actually intended. The refactoring tries not to change its behavior.null
. I thought leaving them as default would make more sense. e.g.ReadNegativeStrand
is set tonull
while the default isfalse
.ProperPair
is set totrue
, which I don't think make sense since you can't really tell just based on Fastq, the default is alsofasle
. It would be helpful if someone can clarify what's the most sensible values for the following fields for both paired-end and single-end fastq entries.convertRead
. If it's STRICT, then qualities must exist (CANNOT be*
) and also match the length of reads. When it's NOT STRICT, qualities will be padded withB
if not exist or shorter than read length. If it's longer than read length,NotImplementedError
will be thrown. Such behavior seems quite arbitrary and doesn't make much sense to me, and it doesn't apply toconvertPair
andconvertFragment
, in which the read length and qualities must match without consideration for stringency.