ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906

fnothaft · 2015-12-27T22:09:14Z

Resolves #599 and #905:

Moves to bdg-formats:0.7.0, where the recordGroup metadata fields have been eliminated from the AlignmentRecord schema.
Adds code so that loadAlignments always returns Sequence and RecordGroup dictionaries.
Supports the loading/storage of Sequence/RecordGroup dictionaries by writing them to disk as Avro files using the Contig and RecordGroupMetadata records from bdg-formats.
Scala/javadoc cleanup related to the above changes.

AmplabJenkins · 2015-12-27T22:27:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1034/
Test PASSed.

AmplabJenkins · 2015-12-29T00:18:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1035/
Test PASSed.

heuermh · 2015-12-29T18:09:02Z

adam-cli/src/main/scala/org/bdgenomics/adam/cli/Adam2Fastq.scala

@@ -70,7 +70,7 @@ class Adam2Fastq(val args: Adam2FastqArgs) extends BDGSparkCommand[Adam2FastqArg
      else
        None

-    var reads: RDD[AlignmentRecord] = sc.loadAlignments(args.inputPath, projection = projectionOpt)
+    var reads: RDD[AlignmentRecord] = sc.loadAlignments(args.inputPath, projection = projectionOpt)._1


I would rather this primary API entry point didn't return a tuple. How about a new load method that returns the tuple, and keep this one returning the RDD?

AmplabJenkins · 2015-12-29T21:50:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1038/
Test PASSed.

laserson · 2016-01-07T23:34:52Z

I may be a little out of context, since I've been offline for a month. I tend to be on the side of doing less magic for the user. Will returning such a tuple turn out to be a hassle if a user does some more exotic things other than just vanilla genome sequencing? Will there be situations where the additional non-read objects will be dummies? I would also support avoiding a tuple and replacing with a named type in case we go down that path.

fnothaft · 2016-01-11T19:27:44Z

OK, rebased and added code to address the tuple method signature in a new commit (999f48f).

AmplabJenkins · 2016-01-11T19:30:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1045/

Build result: FAILURE

GitHub pull request #906 of commit 999f48f automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/906/merge^{commit} # timeout=10 > git branch -a --contains 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7 # timeout=10 > git rev-parse remotes/origin/pr/906/merge^{commit} # timeout=10Checking out Revision 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7 (origin/pr/906/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

heuermh · 2016-01-11T19:40:59Z

@fnothaft thank you for the update! I don't see the source for GenomicRDD in the diff, did I miss that?

fnothaft · 2016-01-12T00:44:18Z

Added the missing file and squashed down to two commits (one per issue).

AmplabJenkins · 2016-01-12T01:04:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1046/
Test PASSed.

heuermh · 2016-01-12T05:57:11Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   *
+   * @param filePath Path to the file on disk.
+   *
+   * @return Returns a Tuple3 containing (an RDD of reads, the sequence


Tuple3 → an aligned read RDD (a tuple of ...

heuermh · 2016-01-12T17:58:46Z

Even though this makes a lot of changes, some of which might be binary-incompatible (when it comes to the implicit stuff with scala I'm not too sure), my downstream projects still work.

+1 from me after minor doc fixes mentioned above.

Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.

fnothaft · 2016-01-12T20:18:38Z

@heuermh fixed the doc issues.

AmplabJenkins · 2016-01-12T20:38:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1049/
Test PASSed.

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata

heuermh · 2016-01-12T20:40:29Z

Thank you, @fnothaft!

Resolves bigdatagenomics#934. If the library name for a read group is not set, we will use a null string during the groupBy. This is equivalent to our pre-bigdatagenomics#906 implementation. Additionally, this commit adds logging that prints a warning message for the user if there are read groups whose library ID is not set.

Resolves #934. If the library name for a read group is not set, we will use a null string during the groupBy. This is equivalent to our pre-#906 implementation. Additionally, this commit adds logging that prints a warning message for the user if there are read groups whose library ID is not set.

fnothaft force-pushed the eliminate-metadata branch from ddebbd8 to 3ff9c1e Compare December 29, 2015 00:00

heuermh reviewed Dec 29, 2015
View reviewed changes

This was referenced Dec 29, 2015

Spark/ADAM Pipeline BD2KGenomics/toil-scripts#72

Merged

Single file save from #733, rebased #901

Merged

fnothaft force-pushed the eliminate-metadata branch from 3ff9c1e to eeb83ff Compare December 29, 2015 21:15

heuermh mentioned this pull request Dec 29, 2015

Various small fixes #907

Closed

fnothaft added this to the 0.19.0 milestone Jan 7, 2016

fnothaft mentioned this pull request Jan 7, 2016

Load/store sequence dictionaries alongside Genotype RDDs #909

Closed

[ADAM-905] Move to bdg-formats 0.7.0. Resolves bigdatagenomics#905.

951e68d

fnothaft force-pushed the eliminate-metadata branch from eeb83ff to 999f48f Compare January 11, 2016 19:27

fnothaft force-pushed the eliminate-metadata branch from 999f48f to 5480e4c Compare January 12, 2016 00:43

heuermh reviewed Jan 12, 2016
View reviewed changes

fnothaft force-pushed the eliminate-metadata branch from 5480e4c to 493dd2b Compare January 12, 2016 20:18

heuermh added a commit that referenced this pull request Jan 12, 2016

Merge pull request #906 from fnothaft/eliminate-metadata

4415b04

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata

heuermh merged commit 4415b04 into bigdatagenomics:master Jan 12, 2016

fnothaft mentioned this pull request Jan 15, 2016

Add back limit_projection on Transform #920

Closed

fnothaft mentioned this pull request Jan 23, 2016

Future of schemas in bdg-formats #925

Closed

fnothaft mentioned this pull request Feb 9, 2016

MarkDuplicates fails if library name is not set #934

Closed

fnothaft mentioned this pull request Feb 9, 2016

[ADAM-934] Properly handle unset library name during duplicate marking #935

Closed

heuermh mentioned this pull request Mar 31, 2016

Explore if SeqDict data can be factored out more aggressively #983

Closed

fnothaft mentioned this pull request Jul 6, 2016

Normalize AlignmentRecord.recordGroup* fields onto a separate record type #828

Closed

heuermh mentioned this pull request Jul 22, 2016

Support Hive-style partitioning #651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906

fnothaft commented Dec 27, 2015

AmplabJenkins commented Dec 27, 2015

AmplabJenkins commented Dec 29, 2015

heuermh Dec 29, 2015

AmplabJenkins commented Dec 29, 2015

laserson commented Jan 7, 2016

fnothaft commented Jan 11, 2016

AmplabJenkins commented Jan 11, 2016

heuermh commented Jan 11, 2016

fnothaft commented Jan 12, 2016

AmplabJenkins commented Jan 12, 2016

heuermh Jan 12, 2016

heuermh commented Jan 12, 2016

fnothaft commented Jan 12, 2016

AmplabJenkins commented Jan 12, 2016

heuermh commented Jan 12, 2016

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906

Conversation

fnothaft commented Dec 27, 2015

AmplabJenkins commented Dec 27, 2015

AmplabJenkins commented Dec 29, 2015

heuermh Dec 29, 2015

Choose a reason for hiding this comment

AmplabJenkins commented Dec 29, 2015

laserson commented Jan 7, 2016

fnothaft commented Jan 11, 2016

AmplabJenkins commented Jan 11, 2016

Build result: FAILURE

heuermh commented Jan 11, 2016

fnothaft commented Jan 12, 2016

AmplabJenkins commented Jan 12, 2016

heuermh Jan 12, 2016

Choose a reason for hiding this comment

heuermh commented Jan 12, 2016

fnothaft commented Jan 12, 2016

AmplabJenkins commented Jan 12, 2016

heuermh commented Jan 12, 2016