Optimize the cosine_similarity_top_k function performance #8151

misrasaurabh1 · 2023-07-23T19:34:35Z

Optimizing important numerical code and making it run faster.

Performance went up by 1.48x (148%). Runtime went down from 138715us to 56020us

Optimization explanation:

The cosine_similarity_top_k function is where we made the most significant optimizations.
Instead of sorting the entire score_array which needs considering all elements, np.argpartition is utilized to find the top_k largest scores indices, this operation has a time complexity of O(n), higher performance than sorting. Remember, np.argpartition doesn't guarantee the order of the values. So we need to use argsort() to get the indices that would sort our top-k values after partitioning, which is much more efficient because it only sorts the top-K elements, not the entire array. Then to get the row and column indices of sorted top_k scores in the original score array, we use np.unravel_index. This operation is more efficient and cleaner than a list comprehension.

The code has been tested for correctness by running the following snippet on both the original function and the optimized function and averaged over 5 times.

def test_cosine_similarity_top_k_large_matrices():
    X = np.random.rand(1000, 1000)
    Y = np.random.rand(1000, 1000)
    top_k = 100
    score_threshold = 0.5
    gc.disable()
    counter = time.perf_counter_ns()
    return_value = cosine_similarity_top_k(X, Y, top_k, score_threshold)
    duration = time.perf_counter_ns() - counter
    gc.enable()

@hwaking @hwchase17 @jerwelborn

Unit tests pass, I also generated more regression tests which all passed.

misrasaurabh1 · 2023-07-23T23:05:48Z

The mypy isn't able to correctly detect the return type, the types should still be compatible. How should I fix the mypy warming?

Manually verified that the types work

vercel · 2023-07-24T15:32:42Z

@misrasaurabh1 is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2023-07-24T18:04:06Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)			Jul 24, 2023 6:04pm

baskaryan · 2023-07-24T18:16:04Z

libs/langchain/langchain/utils/math.py

-    scores = score_array.flatten()[top_idxs].tolist()
-    return ret_idxs, scores
+    score_array[score_array < score_threshold] = 0
+    top_k = min(top_k or len(score_array), np.count_nonzero(score_array))


will np.count_nonzero be strictly less than len(score_array)? in which case latter isn't needed?

The problem is that top_k can also be None, so the or len(score_array) is such that the other np.count_nonzero(score_array) always wins min. A None in min leads to an exception.
When top_k is not None, the len is just ignored.

baskaryan · 2023-07-27T01:02:19Z

thanks @misrasaurabh1!

@hwaking

Optimizing important numerical code and making it run faster. Performance went up by 1.48x (148%). Runtime went down from 138715us to 56020us Optimization explanation: The `cosine_similarity_top_k` function is where we made the most significant optimizations. Instead of sorting the entire score_array which needs considering all elements, `np.argpartition` is utilized to find the top_k largest scores indices, this operation has a time complexity of O(n), higher performance than sorting. Remember, `np.argpartition` doesn't guarantee the order of the values. So we need to use argsort() to get the indices that would sort our top-k values after partitioning, which is much more efficient because it only sorts the top-K elements, not the entire array. Then to get the row and column indices of sorted top_k scores in the original score array, we use `np.unravel_index`. This operation is more efficient and cleaner than a list comprehension. The code has been tested for correctness by running the following snippet on both the original function and the optimized function and averaged over 5 times. ``` def test_cosine_similarity_top_k_large_matrices(): X = np.random.rand(1000, 1000) Y = np.random.rand(1000, 1000) top_k = 100 score_threshold = 0.5 gc.disable() counter = time.perf_counter_ns() return_value = cosine_similarity_top_k(X, Y, top_k, score_threshold) duration = time.perf_counter_ns() - counter gc.enable() ``` @hwaking @hwchase17 @jerwelborn Unit tests pass, I also generated more regression tests which all passed.

@hwaking

Optimizing important numerical code and making it run faster. Performance went up by 1.48x (148%). Runtime went down from 138715us to 56020us Optimization explanation: The `cosine_similarity_top_k` function is where we made the most significant optimizations. Instead of sorting the entire score_array which needs considering all elements, `np.argpartition` is utilized to find the top_k largest scores indices, this operation has a time complexity of O(n), higher performance than sorting. Remember, `np.argpartition` doesn't guarantee the order of the values. So we need to use argsort() to get the indices that would sort our top-k values after partitioning, which is much more efficient because it only sorts the top-K elements, not the entire array. Then to get the row and column indices of sorted top_k scores in the original score array, we use `np.unravel_index`. This operation is more efficient and cleaner than a list comprehension. The code has been tested for correctness by running the following snippet on both the original function and the optimized function and averaged over 5 times. ``` def test_cosine_similarity_top_k_large_matrices(): X = np.random.rand(1000, 1000) Y = np.random.rand(1000, 1000) top_k = 100 score_threshold = 0.5 gc.disable() counter = time.perf_counter_ns() return_value = cosine_similarity_top_k(X, Y, top_k, score_threshold) duration = time.perf_counter_ns() - counter gc.enable() ``` @hwaking @hwchase17 @jerwelborn Unit tests pass, I also generated more regression tests which all passed.

Optimize the cosine_similarity_top_k function performance

e22ca86

Add mypy type:ignore annotation.

6190c8a

Manually verified that the types work

baskaryan reviewed Jul 24, 2023

View reviewed changes

dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 24, 2023

baskaryan merged commit db9d5b2 into langchain-ai:master Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the cosine_similarity_top_k function performance #8151

Optimize the cosine_similarity_top_k function performance #8151

misrasaurabh1 commented Jul 23, 2023 •

edited

Loading

misrasaurabh1 commented Jul 23, 2023

vercel bot commented Jul 24, 2023

vercel bot commented Jul 24, 2023 •

edited

Loading

baskaryan Jul 24, 2023

misrasaurabh1 Jul 24, 2023 •

edited

Loading

baskaryan commented Jul 27, 2023

Optimize the cosine_similarity_top_k function performance #8151

Optimize the cosine_similarity_top_k function performance #8151

Conversation

misrasaurabh1 commented Jul 23, 2023 • edited Loading

misrasaurabh1 commented Jul 23, 2023

vercel bot commented Jul 24, 2023

vercel bot commented Jul 24, 2023 • edited Loading

baskaryan Jul 24, 2023

Choose a reason for hiding this comment

misrasaurabh1 Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

baskaryan commented Jul 27, 2023

misrasaurabh1 commented Jul 23, 2023 •

edited

Loading

vercel bot commented Jul 24, 2023 •

edited

Loading

misrasaurabh1 Jul 24, 2023 •

edited

Loading