Ensure splits are sorted by path name #200

tomwhite · 2018-06-14T10:28:23Z

Untested fix for #199. @lbergelson are you able to try this out?

coveralls · 2018-06-14T10:32:02Z

Coverage increased (+0.02%) to 63.753% when pulling 08bf388 on tomwhite:always_sort_splits_by_name into 9f974cf on HadoopGenomics:master.

lbergelson · 2018-06-18T17:17:06Z

@tomwhite Would it be possible for you to publish a snapshot of this? I'm not sure how to publish hadoop bam snapshots and I was only able to reproduce the original problem on travis.

lbergelson · 2018-06-18T17:22:06Z

src/main/java/org/seqdoop/hadoop_bam/AnySAMInputFormat.java

@@ -229,18 +230,21 @@ else if (split instanceof FileVirtualSplit)
 		final List<InputSplit> origSplits =
 				BAMInputFormat.removeIndexFiles(super.getSplits(job));

+		final List<InputSplit> sortedSplits = new ArrayList<>(origSplits);
+		sortedSplits.sort(Comparator.comparing(split -> ((FileSplit) split).getPath()));


Is this cast problematic? I ran into the problem that some splits are FileVirtualSplits which are not a subtype of FileSplit, there isn't any common super type that offers getPath. Maybe FileVirtualSplit should extend FileSplit?

AnySAMInputFormat extends Hadoop's FileInputFormat, which returns FileSplit objects and never FileVirtualSplit. So the cast is safe at this point in the code.

tomwhite · 2018-06-19T14:23:23Z

@lbergelson Snapshots are only published from master, not for PRs, and I don't know a simple way to publish a snapshot.

One way to workaround this is to use a Maven system dependency (just for testing of course). Check in the Hadoop-BAM JAR that you've built locally, then change the dependency in GATK to reference it by path; see an example here: https://github.com/broadinstitute/gatk/compare/tw_squark#diff-c197962302397baf3a4cc36463dce5ea.

magicDGS · 2018-06-19T14:38:03Z

Another way is to use a snapshot built with jitpack.io - https://jitpack.io/#HadoopGenomics/Hadoop-BAM

Ensure splits are sorted by path name

08bf388

lbergelson reviewed Jun 18, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure splits are sorted by path name #200

Ensure splits are sorted by path name #200

tomwhite commented Jun 14, 2018

coveralls commented Jun 14, 2018

lbergelson commented Jun 18, 2018

lbergelson Jun 18, 2018

tomwhite Jun 19, 2018

tomwhite commented Jun 19, 2018

magicDGS commented Jun 19, 2018

Ensure splits are sorted by path name #200

Are you sure you want to change the base?

Ensure splits are sorted by path name #200

Conversation

tomwhite commented Jun 14, 2018

coveralls commented Jun 14, 2018

lbergelson commented Jun 18, 2018

lbergelson Jun 18, 2018

Choose a reason for hiding this comment

tomwhite Jun 19, 2018

Choose a reason for hiding this comment

tomwhite commented Jun 19, 2018

magicDGS commented Jun 19, 2018