Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the cosine_similarity_top_k function performance #8151

Merged

Conversation

misrasaurabh1
Copy link
Contributor

@misrasaurabh1 misrasaurabh1 commented Jul 23, 2023

Optimizing important numerical code and making it run faster.

Performance went up by 1.48x (148%). Runtime went down from 138715us to 56020us

Optimization explanation:

The cosine_similarity_top_k function is where we made the most significant optimizations.
Instead of sorting the entire score_array which needs considering all elements, np.argpartition is utilized to find the top_k largest scores indices, this operation has a time complexity of O(n), higher performance than sorting. Remember, np.argpartition doesn't guarantee the order of the values. So we need to use argsort() to get the indices that would sort our top-k values after partitioning, which is much more efficient because it only sorts the top-K elements, not the entire array. Then to get the row and column indices of sorted top_k scores in the original score array, we use np.unravel_index. This operation is more efficient and cleaner than a list comprehension.

The code has been tested for correctness by running the following snippet on both the original function and the optimized function and averaged over 5 times.

def test_cosine_similarity_top_k_large_matrices():
    X = np.random.rand(1000, 1000)
    Y = np.random.rand(1000, 1000)
    top_k = 100
    score_threshold = 0.5
    gc.disable()
    counter = time.perf_counter_ns()
    return_value = cosine_similarity_top_k(X, Y, top_k, score_threshold)
    duration = time.perf_counter_ns() - counter
    gc.enable()

@hwaking @hwchase17 @jerwelborn

Unit tests pass, I also generated more regression tests which all passed.

@misrasaurabh1
Copy link
Contributor Author

The mypy isn't able to correctly detect the return type, the types should still be compatible. How should I fix the mypy warming?

Manually verified that the types work
@vercel
Copy link

vercel bot commented Jul 24, 2023

@misrasaurabh1 is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

@vercel
Copy link

vercel bot commented Jul 24, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jul 24, 2023 6:04pm

scores = score_array.flatten()[top_idxs].tolist()
return ret_idxs, scores
score_array[score_array < score_threshold] = 0
top_k = min(top_k or len(score_array), np.count_nonzero(score_array))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will np.count_nonzero be strictly less than len(score_array)? in which case latter isn't needed?

Copy link
Contributor Author

@misrasaurabh1 misrasaurabh1 Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that top_k can also be None, so the or len(score_array) is such that the other np.count_nonzero(score_array) always wins min. A None in min leads to an exception.
When top_k is not None, the len is just ignored.

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 24, 2023
@baskaryan
Copy link
Collaborator

thanks @misrasaurabh1!

@baskaryan baskaryan merged commit db9d5b2 into langchain-ai:master Jul 27, 2023
hinthornw pushed a commit that referenced this pull request Jul 27, 2023
Optimizing important numerical code and making it run faster.

Performance went up by 1.48x (148%). Runtime went down from 138715us to
56020us

Optimization explanation:

The `cosine_similarity_top_k` function is where we made the most
significant optimizations.
Instead of sorting the entire score_array which needs considering all
elements, `np.argpartition` is utilized to find the top_k largest scores
indices, this operation has a time complexity of O(n), higher
performance than sorting. Remember, `np.argpartition` doesn't guarantee
the order of the values. So we need to use argsort() to get the indices
that would sort our top-k values after partitioning, which is much more
efficient because it only sorts the top-K elements, not the entire
array. Then to get the row and column indices of sorted top_k scores in
the original score array, we use `np.unravel_index`. This operation is
more efficient and cleaner than a list comprehension.

The code has been tested for correctness by running the following
snippet on both the original function and the optimized function and
averaged over 5 times.
```
def test_cosine_similarity_top_k_large_matrices():
    X = np.random.rand(1000, 1000)
    Y = np.random.rand(1000, 1000)
    top_k = 100
    score_threshold = 0.5
    gc.disable()
    counter = time.perf_counter_ns()
    return_value = cosine_similarity_top_k(X, Y, top_k, score_threshold)
    duration = time.perf_counter_ns() - counter
    gc.enable()
```

@hwaking @hwchase17 @jerwelborn 

Unit tests pass, I also generated more regression tests which all
passed.
hinthornw pushed a commit that referenced this pull request Jul 27, 2023
Optimizing important numerical code and making it run faster.

Performance went up by 1.48x (148%). Runtime went down from 138715us to
56020us

Optimization explanation:

The `cosine_similarity_top_k` function is where we made the most
significant optimizations.
Instead of sorting the entire score_array which needs considering all
elements, `np.argpartition` is utilized to find the top_k largest scores
indices, this operation has a time complexity of O(n), higher
performance than sorting. Remember, `np.argpartition` doesn't guarantee
the order of the values. So we need to use argsort() to get the indices
that would sort our top-k values after partitioning, which is much more
efficient because it only sorts the top-K elements, not the entire
array. Then to get the row and column indices of sorted top_k scores in
the original score array, we use `np.unravel_index`. This operation is
more efficient and cleaner than a list comprehension.

The code has been tested for correctness by running the following
snippet on both the original function and the optimized function and
averaged over 5 times.
```
def test_cosine_similarity_top_k_large_matrices():
    X = np.random.rand(1000, 1000)
    Y = np.random.rand(1000, 1000)
    top_k = 100
    score_threshold = 0.5
    gc.disable()
    counter = time.perf_counter_ns()
    return_value = cosine_similarity_top_k(X, Y, top_k, score_threshold)
    duration = time.perf_counter_ns() - counter
    gc.enable()
```

@hwaking @hwchase17 @jerwelborn 

Unit tests pass, I also generated more regression tests which all
passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: embeddings Related to text embedding models module 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants