-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conceptual fix for duplicate marking and sorting stragglers #624
Conversation
Test PASSed. |
Getting this when running against NA12878.hiseq.wgs.bwa.raw.bam on 98 slaves:
|
@hannes-ucsc looking now. |
1268644
to
d0891e2
Compare
Test PASSed. |
Confirmed working on CCLE exome. 10 r3.xlarge, s.d.p 40
|
@hannes-ucsc Excellent! Any chance you'd know what the variance is without the patch? |
this PR, stage 2 task duration in min: 0.16.0, stage 2 task duration in min: |
I don't think this is conclusive. I mainly ran this to catch bugs. Will run WGS next week. BTW: I used this BAM: https://browser.cghub.ucsc.edu/details/ebdb53ae-6386-4bc4-90b1-4f249ff9fcdf/ |
Yeah, this doesn't appear to be demonstrating stragglers. Let's wait for a WGS run. |
I'm happy to report that the WGS finished successfully. The per-stage running time in min was 32, 51, 34 respectively. All tasks in the last stage (the one that always had the two stragglers) were within a minute of each other. So this is great news since it marks the first successful WGS run for us. Can I just say one thing: the "ZZZ" smells bad to me. Is there a way that instead of sorting by the tuple |
@hannes-ucsc w00t! Great to hear. There is a similar straggler problem in INDEL realignment; I'll have a PR for that upcoming (possibly by the end of the week? not sure).
I agree that the |
As I understand it, the Spark In Avro, the /** Compute a hash code according to a schema, consistent with {@link
* #compare(Object,Object,Schema)}. */
public int hashCode(Object o, Schema s) {
if (o == null) return 0; // incomplete datum
int hashCode = 1;
switch (s.getType()) {
case RECORD:
for (Field f : s.getFields()) {
if (f.order() == Field.Order.IGNORE)
continue;
hashCode = hashCodeAdd(hashCode,
getField(o, f.name(), f.pos()), f.schema());
}
return hashCode;
case ARRAY:
Collection<?> a = (Collection<?>)o;
Schema elementType = s.getElementType();
for (Object e : a)
hashCode = hashCodeAdd(hashCode, e, elementType);
return hashCode;
case UNION:
return hashCode(o, s.getTypes().get(resolveUnion(s, o)));
case ENUM:
return s.getEnumOrdinal(o.toString());
case NULL:
return 0;
case STRING:
return (o instanceof Utf8 ? o : new Utf8(o.toString())).hashCode();
default:
return o.hashCode();
}
} If we just use an |
…Region. Consolidates ordered classes down into unordered classes.
…hAlignmentRecord.
d0891e2
to
6c9ab29
Compare
Rebased. Ping for review/merge of this and #623. |
Test PASSed. |
Conceptual fix for duplicate marking and sorting stragglers
Thanks, Frank! |
Resolves #574, depends on #623. In this pull request, we associate unmapped reads with a reference position that is derived from either their read name or their read sequence during the sort and duplicate marking phases. Specifically, sort uses
ZZZ<read_name>
and duplicate marking uses<read_sequence>
as the contig name. Unmapped reads are placed at position 0 on the contig. Conceptually, this change should also enable duplicate marking on unmapped reads. I haven't tested this yet, but will spin a cluster up sometime soon.