Conceptual fix for duplicate marking and sorting stragglers #624

fnothaft · 2015-03-18T17:03:13Z

Resolves #574, depends on #623. In this pull request, we associate unmapped reads with a reference position that is derived from either their read name or their read sequence during the sort and duplicate marking phases. Specifically, sort uses ZZZ<read_name> and duplicate marking uses <read_sequence> as the contig name. Unmapped reads are placed at position 0 on the contig. Conceptually, this change should also enable duplicate marking on unmapped reads. I haven't tested this yet, but will spin a cluster up sometime soon.

AmplabJenkins · 2015-03-18T17:17:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/643/
Test PASSed.

hannes-ucsc · 2015-03-19T01:25:23Z

Getting this when running against NA12878.hiseq.wgs.bwa.raw.bam on 98 slaves:

java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at org.bdgenomics.adam.models.ReferenceRegion.<init>(ReferenceRegion.scala:127)
    at org.bdgenomics.adam.models.ReferencePosition.<init>(ReferencePosition.scala:86)
    at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:82)
    at org.bdgenomics.adam.rich.RichAlignmentRecord.fivePrimeReferencePosition(RichAlignmentRecord.scala:134)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$models$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:38)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at scala.Option.map(Option.scala:145)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:30)
    at org.apache.spark.rdd.Timer.time(Timer.scala:57)
    at org.bdgenomics.adam.models.ReferencePositionPair$.apply(ReferencePositionPair.scala:30)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

fnothaft · 2015-03-20T15:22:24Z

@hannes-ucsc looking now.

AmplabJenkins · 2015-03-20T18:42:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/646/
Test PASSed.

hannes-ucsc · 2015-03-21T01:06:13Z

Confirmed working on CCLE exome.

10 r3.xlarge, s.d.p 40

~~Stage 2 task duration: average 40.75, stdev 5.74, median 41~~

fnothaft · 2015-03-21T01:07:29Z

@hannes-ucsc Excellent! Any chance you'd know what the variance is without the patch?

hannes-ucsc · 2015-03-21T01:47:50Z

this PR, stage 2 task duration in min:
Median 6
Average 5.9875
StdDev 0.20778132
Variance 0.04317307692

0.16.0, stage 2 task duration in min:
Median 5.6
Average 5.67
StdDev 0.2523733498
Variance 0.06369230769

hannes-ucsc · 2015-03-21T01:50:43Z

I don't think this is conclusive. I mainly ran this to catch bugs. Will run WGS next week.

BTW: I used this BAM: https://browser.cghub.ucsc.edu/details/ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf/

fnothaft · 2015-03-21T02:14:30Z

Yeah, this doesn't appear to be demonstrating stragglers. Let's wait for a WGS run.

hannes-ucsc · 2015-03-24T01:33:39Z

I'm happy to report that the WGS finished successfully. The per-stage running time in min was 32, 51, 34 respectively. All tasks in the last stage (the one that always had the two stragglers) were within a minute of each other.

So this is great news since it marks the first successful WGS run for us. Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

fnothaft · 2015-03-24T04:27:52Z

@hannes-ucsc w00t! Great to hear. There is a similar straggler problem in INDEL realignment; I'll have a PR for that upcoming (possibly by the end of the week? not sure).

Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

I agree that the ZZZ is a smell; I think your approach is reasonable. Let me see if there's a clean way to refactor it.

massie · 2015-03-24T18:50:00Z

As I understand it, the Spark Partitioner uses the hashCode() method for assigning partitions.

In Avro, the hashCode() method is in GenericData.java and reads...

/** Compute a hash code according to a schema, consistent with {@link
   * #compare(Object,Object,Schema)}. */
  public int hashCode(Object o, Schema s) {
    if (o == null) return 0;                      // incomplete datum
    int hashCode = 1;
    switch (s.getType()) {
    case RECORD:
      for (Field f : s.getFields()) {
        if (f.order() == Field.Order.IGNORE)
          continue;
        hashCode = hashCodeAdd(hashCode,
                               getField(o, f.name(), f.pos()), f.schema());
      }
      return hashCode;
    case ARRAY:
      Collection<?> a = (Collection<?>)o;
      Schema elementType = s.getElementType();
      for (Object e : a)
        hashCode = hashCodeAdd(hashCode, e, elementType);
      return hashCode;
    case UNION:
      return hashCode(o, s.getTypes().get(resolveUnion(s, o)));
    case ENUM:
      return s.getEnumOrdinal(o.toString());
    case NULL:
      return 0;
    case STRING:
      return (o instanceof Utf8 ? o : new Utf8(o.toString())).hashCode();
    default:
      return o.hashCode();
    }
  }

If we just use an AlignmentRecord with all fields NULL except the reference, position and map flag then we should be ok. The Avro Record also support Comparable for sorting too.

…Region. Consolidates ordered classes down into unordered classes.

…hAlignmentRecord.

fnothaft · 2015-03-31T23:24:59Z

Rebased. Ping for review/merge of this and #623.

AmplabJenkins · 2015-03-31T23:35:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/660/
Test PASSed.

Conceptual fix for duplicate marking and sorting stragglers

massie · 2015-03-31T23:35:50Z

Thanks, Frank!

fnothaft force-pushed the mkdup-stragglers branch from 1268644 to d0891e2 Compare March 20, 2015 18:32

fnothaft mentioned this pull request Mar 31, 2015

Cleanup code smell in sort work balancing code #635

Closed

fnothaft added 3 commits March 31, 2015 16:24

[ADAM-601] Consolidate ReferencePosition so that it extends Reference…

9a318ba

…Region. Consolidates ordered classes down into unordered classes.

[ADAM-574] Remove option wrapping in ReferencePosition/Region and Ric…

1ddd643

…hAlignmentRecord.

Straggler mitigation code for duplicate marking and sorting.

6c9ab29

fnothaft force-pushed the mkdup-stragglers branch from d0891e2 to 6c9ab29 Compare March 31, 2015 23:24

massie added a commit that referenced this pull request Mar 31, 2015

Merge pull request #624 from fnothaft/mkdup-stragglers

f9f9caf

Conceptual fix for duplicate marking and sorting stragglers

massie merged commit f9f9caf into bigdatagenomics:master Mar 31, 2015

fnothaft deleted the mkdup-stragglers branch March 31, 2015 23:36

This was referenced Mar 31, 2015

[ADAM-601] Consolidate ReferencePosition #623

Closed

Consolidate ReferenceRegion, Position, etc. #601

Closed

fnothaft mentioned this pull request Aug 23, 2015

BAM-sorting puts unmapped reads before reads mapped to lower-case-named contigs #799

Closed

fnothaft mentioned this pull request Jul 6, 2016

ReferenceRegion shouldn't extend Ordered #511

Closed

fnothaft mentioned this pull request Apr 11, 2017

Clean up ReferenceRegion.scala and add thresholded overlap and covers #1484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conceptual fix for duplicate marking and sorting stragglers #624

Conceptual fix for duplicate marking and sorting stragglers #624

fnothaft commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

hannes-ucsc commented Mar 19, 2015

fnothaft commented Mar 20, 2015

AmplabJenkins commented Mar 20, 2015

hannes-ucsc commented Mar 21, 2015

fnothaft commented Mar 21, 2015

hannes-ucsc commented Mar 21, 2015

hannes-ucsc commented Mar 21, 2015

fnothaft commented Mar 21, 2015

hannes-ucsc commented Mar 24, 2015

fnothaft commented Mar 24, 2015

massie commented Mar 24, 2015

fnothaft commented Mar 31, 2015

AmplabJenkins commented Mar 31, 2015

massie commented Mar 31, 2015

Conceptual fix for duplicate marking and sorting stragglers #624

Conceptual fix for duplicate marking and sorting stragglers #624

Conversation

fnothaft commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

hannes-ucsc commented Mar 19, 2015

fnothaft commented Mar 20, 2015

AmplabJenkins commented Mar 20, 2015

hannes-ucsc commented Mar 21, 2015

fnothaft commented Mar 21, 2015

hannes-ucsc commented Mar 21, 2015

hannes-ucsc commented Mar 21, 2015

fnothaft commented Mar 21, 2015

hannes-ucsc commented Mar 24, 2015

fnothaft commented Mar 24, 2015

massie commented Mar 24, 2015

fnothaft commented Mar 31, 2015

AmplabJenkins commented Mar 31, 2015

massie commented Mar 31, 2015