Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1192] Correctly handle other whitespace in FASTA description. #1198

Merged
merged 1 commit into from
Oct 8, 2016

Conversation

fnothaft
Copy link
Member

@fnothaft fnothaft commented Oct 6, 2016

Resolves #1192. More broadly handles whitespace in the FASTA description, and also disregards FASTA metadata which is inlined.

Testing this on the cluster as we speak.

@@ -30,8 +30,8 @@ class Fasta2ADAMSuite extends ADAMFunSuite {
val contigFragments = sc.loadParquetContigFragments(convertPath)
assert(contigFragments.rdd.count() === 26)
val first = contigFragments.rdd.first()
assert(first.getContig.getContigName === "gi|224384749|gb|CM000682.1|")
assert(first.getDescription === "Homo sapiens chromosome 20, GRCh37 primary reference assembly")
assert(first.getContig.getContigName === null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, the string before the first space is typically the sequence name in FASTA format. You might want to compare to the various open-bio projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, gi is the Entrez ID, and gb is the GenBank ID. I assure you that they are not the sequence name.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1523/

Build result: FAILURE

GitHub pull request #1198 of commit 0c98c3c automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1198/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 73f1a1cb0d8e8dee23d6ebf6b35bcb9b8881060b # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1198/merge^{commit} # timeout=10Checking out Revision 73f1a1cb0d8e8dee23d6ebf6b35bcb9b8881060b (origin/pr/1198/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 73f1a1cb0d8e8dee23d6ebf6b35bcb9b8881060bFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Resolves bigdatagenomics#1192. More broadly handles whitespace in the FASTA description, and
also disregards FASTA metadata which is inlined.
@fnothaft
Copy link
Member Author

fnothaft commented Oct 6, 2016

Whoops! Forgot to commit the test resource. Just pushed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1524/
Test PASSed.

@fnothaft
Copy link
Member Author

fnothaft commented Oct 7, 2016

With this, recomputing the MD tags worked on the cluster with GRCh38.

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully convinced of the changes here, but if they properly support the GRCh38 MD tag use case, then I'm +1.

@heuermh heuermh merged commit d70e68b into bigdatagenomics:master Oct 8, 2016
@heuermh
Copy link
Member

heuermh commented Oct 8, 2016

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants