-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4797] Replace breezeSquaredDistance #3643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, it looks like this may be faster for dense vectors but not for sparse. SparseVector.toArray will create a dense vector, making it much slower if the vector is very sparse. You will probably need separate cases for DenseVector and SparseVector. |
|
Thanks. I add the consideration for different cases of SparseVector and DenseVector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.intersect does not know that the indices are sorted, so it may be much faster to compute the intersection explicitly. That is done in BLAS.dot; I would recommend following that pattern. That will keep you from having 3 separate iterations over the indices.
In general, I wonder if these methods would be much faster if you iterated over counters & used while loops, rather than built-in iterator methods like intersect/diff/foreach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will modify this PR to follow the pattern used in BLAS.dot.
Seems that while-loop has better performance advantage than foreach. I add a commit to replace foreach with while-loop.
|
Thanks for the update! I added some comments. One more about public APIs: The new method for vectorSquaredDistance is public, which requires some careful thought about API design. I'd recommend either: |
|
Thanks for that. I add new commit to make the methods private now. |
|
Hi, intersect, diff and foreach are all replaced with while-loop in the new commit to follow BLAS.dot pattern. Please see if there is any problem. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this loop logic correct? What if v1 runs out of indices, but v2 has some left. You'll need a check for this after the loop. But a better way to write it might be to have the outer loop check kv1 < nnzv1 || kv2 < nnzv2 and handle 1 non-zero index per iteration of the loop. (That would also let you have only 1 line updating squaredDistance
|
@viirya Thanks for the updates! I made some inline comments, one of them major. Please let me know when to check again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Modified in later commit.
|
@jkbradley Thanks. The codes are modified for your comments. The test is also expanded to test the case of the major comment you mentioned. Please check it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check (kv1 < nnzv1 && kv2 < nnzv2) here since it will always be true (because of the previous 2 if statements).
|
Just added more comments, including 1 small bug |
|
A high-level comment: We recently merged couple PRs that optimize linear algebra performance because breeze is slow at certain operations. But I always think it would be nice if we can optimize breeze directly unless we want to provide a full-functional linear algebra package inside mllib. If this optimization could easily go into breeze, I would suggest that direction. |
|
Thanks @mengxr. Not quite familiar with breeze. But as I roughly go through distance metric implementations of breeze. They are following the same pattern that employs zip operation. This way can address distance metric problem in a consistent approach. But the efficiency might be not its main consideration. That seems to be matching with the goal of breeze project to be generic, clean, and powerful without sacrificing (much) efficiency. To adopt this optimization in breeze will inevitably contradict breeze's principle and design. As I can see, this optimization may not easily go into breeze. That is my rough idea. |
|
Test build #552 has finished for PR 3643 at commit
|
|
It looks like there are bugs which show up in the pyspark tests. Have you run them locally? |
|
I have not run pyspark tests. Fixed in update. |
|
Please test it again. |
|
Test build #553 has finished for PR 3643 at commit
|
|
@jkbradley Is there any problem you concern? Is this ready to merge? Thanks. |
|
@viirya We recently added some linear algebra utility functions to |
|
@mengxr The implementation is renamed and moved to |
|
add to whitelist |
|
test this please |
|
I ran some quick tests with random sparsity patterns. Averaged over 1000 iterations, it's definitely faster:
|
|
@viirya Thanks for the updates! LGTM pending Jenkins |
|
Test build #24964 has finished for PR 3643 at commit
|
|
Merged into master. Thanks! (minor TODO: Though |
|
Thanks. I will add the unit tests in a later PR soon. |
Related to #3643. Follow the previous suggestion to add unit test for `sqdist` in `VectorsSuite`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3869 from viirya/sqdist_test and squashes the following commits: fb743da [Liang-Chi Hsieh] Modified for comment and fix bug. 90a08f3 [Liang-Chi Hsieh] Modified for comment. 39a3ca6 [Liang-Chi Hsieh] Take care of special case. b789f42 [Liang-Chi Hsieh] More proper unit test with random sparsity pattern. c36be68 [Liang-Chi Hsieh] Add unit test for sqdist.
This PR replaces slow breezeSquaredDistance.