Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve SW performance by replacing functional reductions with imperative ones #965

Conversation

noamBarkai
Copy link
Contributor

while running Smith-Waterman on large data sets over Spark we noticed "hot spots" in Scala's for-loops, hence the change below replaces the elegant functional calculations with the far less-elegant but slightly more efficient while-loops.
We've seen a ~30% improvement in performance with this change.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@noamBarkai noamBarkai force-pushed the feature/improve-sw-performance branch 2 times, most recently from 43ae08f to baba084 Compare March 2, 2016 10:10
@heuermh
Copy link
Member

heuermh commented Mar 3, 2016

Jenkins, test this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1102/

Build result: FAILURE

GitHub pull request #965 of commit baba084 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/965/merge^{commit} # timeout=10 > git branch -a --contains 8cc5c6d # timeout=10 > git rev-parse remotes/origin/pr/965/merge^{commit} # timeout=10Checking out Revision 8cc5c6d (origin/pr/965/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8cc5c6d1508118846a1ddc4f4dd3decbc17a7994First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Mar 3, 2016

Thanks @noamBarkai! LGTM! I'll look at the failing tests a bit later. Always amusing to run into while being much more efficient than for in Scala.

@heuermh
Copy link
Member

heuermh commented Mar 13, 2016

Jenkins, retest this please

@heuermh
Copy link
Member

heuermh commented Mar 13, 2016

I saw a 40% speed up using this simple example on 36000 comparisons over 1024 executors.

Curious, are you using sw for biological sequences? I ask because our current implementation does not have support for alphabets or substitution matrices.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1103/

Build result: FAILURE

GitHub pull request #965 of commit baba084 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/965/merge^{commit} # timeout=10 > git branch -a --contains 8cc5c6d # timeout=10 > git rev-parse remotes/origin/pr/965/merge^{commit} # timeout=10Checking out Revision 8cc5c6d (origin/pr/965/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8cc5c6d1508118846a1ddc4f4dd3decbc17a7994First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@noamBarkai
Copy link
Contributor Author

@heuermh - glad to hear you've reproduced the performance improvement. 👍
re your question - yes, we are using it for biological sequences. for the purposes we're using it for, alphabets and substitution matrices are of less concern.

@heuermh
Copy link
Member

heuermh commented Mar 22, 2016

@fnothaft @massie is this just failing in Jenkins because of the code formatter? I'm not sure what I'm seeing wrong in the console.

@massie
Copy link
Member

massie commented Mar 22, 2016

The error is

++ git status --porcelain
+ test -n ' M adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala'
+ echo 'Please run '\''./scripts/format-source'\'''
Please run './scripts/format-source'
+ exit 1
Build step 'Execute shell' marked build as failure
Recording test results
Finished: FAILURE

which, as you say, looks like an issue with the source formatting of SmithWaterman.scala. It's odd that it would only show up in a single build of the matrix.

Check that your git repo is clean and then run ./scripts/format-source and see if any files are modified.

@heuermh
Copy link
Member

heuermh commented Mar 22, 2016

The 2.10/1.5.2/2.6.0 configuration is the first build to run, if that fails the whole build stops.

$ ./scripts/format-source
$ git diff .
diff --git a/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala b/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
index d4e3930..83179a9 100644
--- a/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
+++ b/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
@@ -47,25 +47,25 @@ abstract class SmithWaterman(xSequence: String, ySequence: String) extends Seria
    * @param matrix Matrix to score.
    * @return Tuple of (i, j) coordinates.
    */
-    private[smithwaterman] final def maxCoordinates(matrix: Array[Array[Double]]): (Int, Int) = {
-      var xMax = 0
-      var yMax = 0
-      var max = Double.MinValue
-      var x = 0
-      while (x < matrix.length) {
-        var y = 0
-        while (y < matrix(x).length) {
-          if (matrix(x)(y) >= max) {
-            max = matrix(x)(y)
-            xMax = x
-            yMax = y
-          }
-          y += 1
+  private[smithwaterman] final def maxCoordinates(matrix: Array[Array[Double]]): (Int, Int) = {
+    var xMax = 0
+    var yMax = 0
+    var max = Double.MinValue
+    var x = 0
+    while (x < matrix.length) {
+      var y = 0
+      while (y < matrix(x).length) {
+        if (matrix(x)(y) >= max) {
+          max = matrix(x)(y)
+          xMax = x
+          yMax = y
         }
-        x += 1
+        y += 1
       }
-      (yMax, xMax)
+      x += 1
     }
+    (yMax, xMax)
+  }

@noamBarkai could you run scripts/format-source and then update this pull request? Thank you.

@noamBarkai noamBarkai force-pushed the feature/improve-sw-performance branch from baba084 to eb8f6d4 Compare March 22, 2016 18:12
@noamBarkai
Copy link
Contributor Author

sorry about that, done.
I amended this additional change to the previous commit and force-pushed so as not to "litter" the commit log, but the formatting change is there.

@heuermh
Copy link
Member

heuermh commented Mar 22, 2016

Perfect, thank you. Will merge once Mr Jenkins gives his approval.

@heuermh
Copy link
Member

heuermh commented Mar 22, 2016

Jenkins, retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1111/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Mar 22, 2016

Thanks for the contribution! Merged in commit a505bb9

@heuermh heuermh closed this Mar 22, 2016
@noamBarkai
Copy link
Contributor Author

cool! thanks.

On Tue, Mar 22, 2016 at 9:00 PM, Michael L Heuer notifications@github.com
wrote:

Thanks for the contribution! Merged in commit a505bb9
a505bb9


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#965 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants