-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata #906
Conversation
Test PASSed. |
ddebbd8
to
3ff9c1e
Compare
Test PASSed. |
@@ -70,7 +70,7 @@ class Adam2Fastq(val args: Adam2FastqArgs) extends BDGSparkCommand[Adam2FastqArg | |||
else | |||
None | |||
|
|||
var reads: RDD[AlignmentRecord] = sc.loadAlignments(args.inputPath, projection = projectionOpt) | |||
var reads: RDD[AlignmentRecord] = sc.loadAlignments(args.inputPath, projection = projectionOpt)._1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather this primary API entry point didn't return a tuple. How about a new load method that returns the tuple, and keep this one returning the RDD?
3ff9c1e
to
eeb83ff
Compare
Test PASSed. |
I may be a little out of context, since I've been offline for a month. I tend to be on the side of doing less magic for the user. Will returning such a tuple turn out to be a hassle if a user does some more exotic things other than just vanilla genome sequencing? Will there be situations where the additional non-read objects will be dummies? I would also support avoiding a tuple and replacing with a named type in case we go down that path. |
eeb83ff
to
999f48f
Compare
OK, rebased and added code to address the tuple method signature in a new commit (999f48f). |
Test FAILed. Build result: FAILUREGitHub pull request #906 of commit 999f48f automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/906/merge^{commit} # timeout=10 > git branch -a --contains 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7 # timeout=10 > git rev-parse remotes/origin/pr/906/merge^{commit} # timeout=10Checking out Revision 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7 (origin/pr/906/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5561a72212c3b57bf6e1ac28ce69c5aa4c16f1c7First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
@fnothaft thank you for the update! I don't see the source for |
999f48f
to
5480e4c
Compare
Added the missing file and squashed down to two commits (one per issue). |
Test PASSed. |
* | ||
* @param filePath Path to the file on disk. | ||
* | ||
* @return Returns a Tuple3 containing (an RDD of reads, the sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple3 → an aligned read RDD (a tuple of ...
Even though this makes a lot of changes, some of which might be binary-incompatible (when it comes to the implicit stuff with scala I'm not too sure), my downstream projects still work. +1 from me after minor doc fixes mentioned above. |
Resolves bigdatagenomics#599. Since we have added the RecordGroupMetadata fields in bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We process these files when loading/writing the Parquet files where the alignment data is stored. This allows us to both eliminate the bulky metadata that we are currently storing in the AlignmentRecord, while maintaining the Sequence and RecordGroup dictionaries that we need to keep around.
5480e4c
to
493dd2b
Compare
@heuermh fixed the doc issues. |
Test PASSed. |
ADAM-599, 905: Move to bdg-formats:0.7.0 and migrate metadata
Thank you, @fnothaft! |
Resolves bigdatagenomics#934. If the library name for a read group is not set, we will use a null string during the groupBy. This is equivalent to our pre-bigdatagenomics#906 implementation. Additionally, this commit adds logging that prints a warning message for the user if there are read groups whose library ID is not set.
Resolves #599 and #905:
AlignmentRecord
schema.loadAlignments
always returns Sequence and RecordGroup dictionaries.Contig
andRecordGroupMetadata
records from bdg-formats.