Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906

Merged
merged 2 commits into from
Jan 12, 2016

Conversation

fnothaft
Copy link
Member

Resolves #599 and #905:

  • Moves to bdg-formats:0.7.0, where the recordGroup metadata fields have been eliminated from the AlignmentRecord schema.
  • Adds code so that loadAlignments always returns Sequence and RecordGroup dictionaries.
  • Supports the loading/storage of Sequence/RecordGroup dictionaries by writing them to disk as Avro files using the Contig and RecordGroupMetadata records from bdg-formats.
  • Scala/javadoc cleanup related to the above changes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1034/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1035/
Test PASSed.

@@ -70,7 +70,7 @@ class Adam2Fastq(val args: Adam2FastqArgs) extends BDGSparkCommand[Adam2FastqArg
else
None

var reads: RDD[AlignmentRecord] = sc.loadAlignments(args.inputPath, projection = projectionOpt)
var reads: RDD[AlignmentRecord] = sc.loadAlignments(args.inputPath, projection = projectionOpt)._1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather this primary API entry point didn't return a tuple. How about a new load method that returns the tuple, and keep this one returning the RDD?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1038/
Test PASSed.

@laserson
Copy link
Contributor

laserson commented Jan 7, 2016

I may be a little out of context, since I've been offline for a month. I tend to be on the side of doing less magic for the user. Will returning such a tuple turn out to be a hassle if a user does some more exotic things other than just vanilla genome sequencing? Will there be situations where the additional non-read objects will be dummies? I would also support avoiding a tuple and replacing with a named type in case we go down that path.

@fnothaft
Copy link
Member Author

OK, rebased and added code to address the tuple method signature in a new commit (999f48f).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1045/

Build result: FAILURE

GitHub pull request #906 of commit 999f48f automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/906/merge^{commit} # timeout=10 > git branch -a --contains 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7 # timeout=10 > git rev-parse remotes/origin/pr/906/merge^{commit} # timeout=10Checking out Revision 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7 (origin/pr/906/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Jan 11, 2016

@fnothaft thank you for the update! I don't see the source for GenomicRDD in the diff, did I miss that?

@fnothaft
Copy link
Member Author

Added the missing file and squashed down to two commits (one per issue).

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1046/
Test PASSed.

*
* @param filePath Path to the file on disk.
*
* @return Returns a Tuple3 containing (an RDD of reads, the sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple3 → an aligned read RDD (a tuple of ...

@heuermh
Copy link
Member

heuermh commented Jan 12, 2016

Even though this makes a lot of changes, some of which might be binary-incompatible (when it comes to the implicit stuff with scala I'm not too sure), my downstream projects still work.

+1 from me after minor doc fixes mentioned above.

Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.
@fnothaft
Copy link
Member Author

@heuermh fixed the doc issues.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1049/
Test PASSed.

heuermh added a commit that referenced this pull request Jan 12, 2016
ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata
@heuermh heuermh merged commit 4415b04 into bigdatagenomics:master Jan 12, 2016
@heuermh
Copy link
Member

heuermh commented Jan 12, 2016

Thank you, @fnothaft!

fnothaft added a commit to fnothaft/adam that referenced this pull request Feb 9, 2016
Resolves bigdatagenomics#934. If the library name for a read group is not set, we will use a
null string during the groupBy. This is equivalent to our pre-bigdatagenomics#906
implementation. Additionally, this commit adds logging that prints a warning
message for the user if there are read groups whose library ID is not set.
heuermh pushed a commit that referenced this pull request Feb 13, 2016
Resolves #934. If the library name for a read group is not set, we will use a
null string during the groupBy. This is equivalent to our pre-#906
implementation. Additionally, this commit adds logging that prints a warning
message for the user if there are read groups whose library ID is not set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants