Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements to SAM reading and processing #2280

Merged
merged 2 commits into from
Nov 5, 2020

Conversation

benraha
Copy link
Contributor

@benraha benraha commented Nov 4, 2020

These changes significantly affect the performance of processing SAM/BAM files (x60 in my test cases).

The changes:

  1. StringBuilder usage instead of many String.format, and using native Scala mkString.
  2. While using the object SAMSequenceDictionary from the Samtools package, we pass a Scala implementation of an iterator, which, unfortunately, takes O(n) time to check its length.

Keep up the great work!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@heuermh
Copy link
Member

heuermh commented Nov 4, 2020

Jenkins, test this please

@heuermh
Copy link
Member

heuermh commented Nov 4, 2020

Jenkins, add to whitelist

@heuermh
Copy link
Member

heuermh commented Nov 4, 2020

Hello @benraha!

Thank you for this pull request, that sounds like an impressive gain!

Please give me a heads up when you believe this is ready for review and merge.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3129/

Build result: FAILURE

[...truncated 5 lines...]Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2280/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains 61a1585 # timeout=10Checking out Revision 61a1585 (origin/pr/2280/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 61a1585bacd0628cb688f5810739383657c2d6b5First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.5,2.12,3.0.1,ubuntuTriggering ADAM-prb ? 2.7.5,2.11,2.4.7,ubuntuTriggering ADAM-prb ? 2.7.5,2.12,2.4.7,ubuntuADAM-prb ? 2.7.5,2.12,3.0.1,ubuntu completed with result SUCCESSADAM-prb ? 2.7.5,2.11,2.4.7,ubuntu completed with result SUCCESSADAM-prb ? 2.7.5,2.12,2.4.7,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@benraha
Copy link
Contributor Author

benraha commented Nov 5, 2020

Jenkins, test this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/3130/
Test PASSed.

@benraha
Copy link
Contributor Author

benraha commented Nov 5, 2020

@heuermh Hi!

I'm ready to review and merge this one. Let me know if you prefer two separate PR since I'm touching two not related parts of the code.

@@ -184,7 +184,7 @@ class SequenceDictionary(val records: Vector[SequenceRecord]) extends Serializab
* @return Returns a SAM formatted sequence dictionary.
*/
def toSAMSequenceDictionary: SAMSequenceDictionary = {
new SAMSequenceDictionary(records.iterator.map(_.toSAMSequenceRecord).toList)
new SAMSequenceDictionary(records.map(_.toSAMSequenceRecord).asJava)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch, did you have profiling results leading you here, or were you able to eyeball it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiler, of course!

@@ -36,16 +36,20 @@ case class Attribute(tag: String, tagType: TagType.Value, value: Any) {
val byteSequenceTypes = Array(TagType.NumericByteSequence, TagType.NumericUnsignedByteSequence)
val intSequenceTypes = Array(TagType.NumericIntSequence, TagType.NumericUnsignedIntSequence)
val shortSequenceTypes = Array(TagType.NumericShortSequence, TagType.NumericUnsignedShortSequence)
val sb = new StringBuilder
sb.append(tag)
sb.append(":")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be all one line for code style purposes, but I don't mind either way

@heuermh
Copy link
Member

heuermh commented Nov 5, 2020

Let me know if you prefer two separate PR since I'm touching two not related parts of the code.

I am ok with that, will squash and merge the commits. Thank you again, @benraha!

@heuermh heuermh merged commit 61c4809 into bigdatagenomics:master Nov 5, 2020
@heuermh heuermh added this to the 0.33.0 milestone Nov 5, 2020
@benraha
Copy link
Contributor Author

benraha commented Nov 5, 2020

Thanks! My pleasure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants