Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scala.MatchError RegExp does not catch colons in value part properly #1061

Closed
pauca opened this issue Jun 30, 2016 · 1 comment
Closed

scala.MatchError RegExp does not catch colons in value part properly #1061

pauca opened this issue Jun 30, 2016 · 1 comment
Assignees
Labels
Milestone

Comments

@pauca
Copy link

pauca commented Jun 30, 2016

line

val attrRegex = RegExp("([^:]{2,4}):([AifZHB]):([cCiIsSf]{1},)?(.*)")

with
val attrRegex = RegExp("([^:]{2,4}):([AifZHB]):([cCiIsSf]{1},)?(.*)")

does not handle properly alignmentrecords with attributes like
OQ:Z:C55/15D:::::::.7GFFAFDA442.40F=AGHHE
ie. have colons in the value part

some problematic reads are contained in gatk bundle file CEUTrio.HiSeq.WGS.b37.NA12878.bam

scala> BamWriter.adamSAMSave( "output.bam", bam.sequences, bam.recordGroups , true, true ,false)
2016-06-30 17:01:41 ERROR Utils:95 - Aborting task
scala.MatchError: Z:C, (of class java.lang.String)
    at org.bdgenomics.adam.util.AttributeUtils$.createAttribute(AttributeUtils.scala:92)
    at org.bdgenomics.adam.util.AttributeUtils$.parseAttribute(AttributeUtils.scala:74)
    at org.bdgenomics.adam.util.AttributeUtils$$anonfun$parseAttributes$2.apply(AttributeUtils.scala:61)

@fnothaft
Copy link
Member

fnothaft commented Jul 1, 2016

Thanks for reporting this @pauca! We will look into this in the next week. We have some separate logic to extract the OQ field, and I think this isn't getting handled properly.

@fnothaft fnothaft added the bug label Jul 1, 2016
@fnothaft fnothaft added this to the 0.20.0 milestone Jul 1, 2016
@fnothaft fnothaft self-assigned this Jul 16, 2016
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 17, 2016
We had a bug in `org.bdgenomics.adam.util.AttributeUtils` where the regex for
splitting out the formatting string for array attributes was applied to all
attributes. In an array attribute (SAM "B" tags), the type of the array elements
is encoded before the attribute values, and is split off by commas. E.g.,
"B:i,1,2,3". If the attribute is a string (SAM "Z" tags), commas are allowed.
To resolve this, I split this regex into two regexes. We only apply the
regex for splitting out the array type if we are working on an array
attribute. This resolves bigdatagenomics#1061.
fnothaft added a commit to fnothaft/adam that referenced this issue Jul 18, 2016
We had a bug in `org.bdgenomics.adam.util.AttributeUtils` where the regex for
splitting out the formatting string for array attributes was applied to all
attributes. In an array attribute (SAM "B" tags), the type of the array elements
is encoded before the attribute values, and is split off by commas. E.g.,
"B:i,1,2,3". If the attribute is a string (SAM "Z" tags), commas are allowed.
To resolve this, I split this regex into two regexes. We only apply the
regex for splitting out the array type if we are working on an array
attribute. This resolves bigdatagenomics#1061.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants