Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conceptual fix for duplicate marking and sorting stragglers #624

Merged
merged 3 commits into from
Mar 31, 2015

Conversation

fnothaft
Copy link
Member

Resolves #574, depends on #623. In this pull request, we associate unmapped reads with a reference position that is derived from either their read name or their read sequence during the sort and duplicate marking phases. Specifically, sort uses ZZZ<read_name> and duplicate marking uses <read_sequence> as the contig name. Unmapped reads are placed at position 0 on the contig. Conceptually, this change should also enable duplicate marking on unmapped reads. I haven't tested this yet, but will spin a cluster up sometime soon.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/643/
Test PASSed.

@hannes-ucsc
Copy link
Contributor

Getting this when running against NA12878.hiseq.wgs.bwa.raw.bam on 98 slaves:

java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at org.bdgenomics.adam.models.ReferenceRegion.<init>(ReferenceRegion.scala:127)
    at org.bdgenomics.adam.models.ReferencePosition.<init>(ReferencePosition.scala:86)
    at org.bdgenomics.adam.models.ReferencePosition$.apply(ReferencePosition.scala:82)
    at org.bdgenomics.adam.rich.RichAlignmentRecord.fivePrimeReferencePosition(RichAlignmentRecord.scala:134)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.org$bdgenomics$adam$models$ReferencePositionPair$$anonfun$$getPos$1(ReferencePositionPair.scala:38)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1$$anonfun$apply$2.apply(ReferencePositionPair.scala:45)
    at scala.Option.map(Option.scala:145)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:45)
    at org.bdgenomics.adam.models.ReferencePositionPair$$anonfun$apply$1.apply(ReferencePositionPair.scala:30)
    at org.apache.spark.rdd.Timer.time(Timer.scala:57)
    at org.bdgenomics.adam.models.ReferencePositionPair$.apply(ReferencePositionPair.scala:30)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.bdgenomics.adam.rdd.read.MarkDuplicates$$anonfun$apply$1.apply(MarkDuplicates.scala:79)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at org.apache.spark.rdd.RDD$$anonfun$keyBy$1.apply(RDD.scala:1227)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

@fnothaft
Copy link
Member Author

@hannes-ucsc looking now.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/646/
Test PASSed.

@hannes-ucsc
Copy link
Contributor

Confirmed working on CCLE exome.

10 r3.xlarge, s.d.p 40

Stage 2 task duration: average 40.75, stdev 5.74, median 41

@fnothaft
Copy link
Member Author

@hannes-ucsc Excellent! Any chance you'd know what the variance is without the patch?

@hannes-ucsc
Copy link
Contributor

this PR, stage 2 task duration in min:
Median 6
Average 5.9875
StdDev 0.20778132
Variance 0.04317307692

0.16.0, stage 2 task duration in min:
Median 5.6
Average 5.67
StdDev 0.2523733498
Variance 0.06369230769

@hannes-ucsc
Copy link
Contributor

I don't think this is conclusive. I mainly ran this to catch bugs. Will run WGS next week.

BTW: I used this BAM: https://browser.cghub.ucsc.edu/details/ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf/

@fnothaft
Copy link
Member Author

Yeah, this doesn't appear to be demonstrating stragglers. Let's wait for a WGS run.

@hannes-ucsc
Copy link
Contributor

I'm happy to report that the WGS finished successfully. The per-stage running time in min was 32, 51, 34 respectively. All tasks in the last stage (the one that always had the two stragglers) were within a minute of each other.

So this is great news since it marks the first successful WGS run for us. Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

@fnothaft
Copy link
Member Author

@hannes-ucsc w00t! Great to hear. There is a similar straggler problem in INDEL realignment; I'll have a PR for that upcoming (possibly by the end of the week? not sure).

Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple ( library, position ) we can sort by the triple ( is_mapped, if is_mapped library else read_name, if is_mapped position else 0 )?

I agree that the ZZZ is a smell; I think your approach is reasonable. Let me see if there's a clean way to refactor it.

@massie
Copy link
Member

massie commented Mar 24, 2015

As I understand it, the Spark Partitioner uses the hashCode() method for assigning partitions.

In Avro, the hashCode() method is in GenericData.java and reads...

/** Compute a hash code according to a schema, consistent with {@link
   * #compare(Object,Object,Schema)}. */
  public int hashCode(Object o, Schema s) {
    if (o == null) return 0;                      // incomplete datum
    int hashCode = 1;
    switch (s.getType()) {
    case RECORD:
      for (Field f : s.getFields()) {
        if (f.order() == Field.Order.IGNORE)
          continue;
        hashCode = hashCodeAdd(hashCode,
                               getField(o, f.name(), f.pos()), f.schema());
      }
      return hashCode;
    case ARRAY:
      Collection<?> a = (Collection<?>)o;
      Schema elementType = s.getElementType();
      for (Object e : a)
        hashCode = hashCodeAdd(hashCode, e, elementType);
      return hashCode;
    case UNION:
      return hashCode(o, s.getTypes().get(resolveUnion(s, o)));
    case ENUM:
      return s.getEnumOrdinal(o.toString());
    case NULL:
      return 0;
    case STRING:
      return (o instanceof Utf8 ? o : new Utf8(o.toString())).hashCode();
    default:
      return o.hashCode();
    }
  }

If we just use an AlignmentRecord with all fields NULL except the reference, position and map flag then we should be ok. The Avro Record also support Comparable for sorting too.

@fnothaft
Copy link
Member Author

Rebased. Ping for review/merge of this and #623.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/660/
Test PASSed.

massie added a commit that referenced this pull request Mar 31, 2015
Conceptual fix for duplicate marking and sorting stragglers
@massie massie merged commit f9f9caf into bigdatagenomics:master Mar 31, 2015
@massie
Copy link
Member

massie commented Mar 31, 2015

Thanks, Frank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove wrapping by Option where throwing an exception is preferable
4 participants