Bulk vectorising for bulk search #373

pandu-k · 2023-03-07T05:54:22Z

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
What is the current behavior? (You can also link to an open issue here)
What is the new behavior (if this is a feature change)?
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
Have unit tests been run against this PR? (Has there also been any additional testing?)
Related Python client changes (link commit/PR here)
Related documentation changes (link commit/PR here)
Other information:
Please check if the PR fulfills these requirements

The commit message follows our guidelines
Tests for the changes have been added (for bug fixes/features)
Docs have been added / updated (for bug fixes / features)

pandu-k · 2023-03-07T05:54:41Z

bulk search tests passed
unit tests: https://github.com/marqo-ai/marqo/actions/runs/4351112004

pandu-k · 2023-03-07T06:52:26Z

to do: max batch size

Jeadie · 2023-03-07T23:08:16Z

src/marqo/tensor_search/tensor_search.py

@@ -1325,18 +1324,30 @@ def _lexical_search(
    return {'hits': res_list}


-def construct_vector_input_batches(query: Union[str, Dict], index_info) -> Union[List[str], List[List[str]]]:
+def construct_vector_input_batches(query: Union[str, Dict], index_info) -> Tuple[List[str], List[str]]:


Perhaps this could be better as Dict[ModalityType, List[str]]. It'd be easier to read, and if we add other modalities more simply

Yep that would make sense. But it might be a while before we add another modality, and we could cross that bridge when we get to it.

Jeadie · 2023-03-07T23:43:38Z

src/marqo/tensor_search/tensor_search.py

+                 content
+                ) for content, weight in ordered_queries
+            ]
+            # TODO how doe we ensure order?


Yeah, I think a big question of going from qidx_to_job: Dict[Qidx, VectorisedJobPointer] to Dict[Qidx, List[VectorisedJobPointer]] is how we maintain order of content.

I can't think of how this affects maintaining order of content, as the important thing is that VectorisedJobPointer points to the correct locations. See how pointers are added here.

What are the situations where you think this could fail? I can make some extra tests if you have ideas

Jeadie · 2023-03-08T00:37:31Z

src/marqo/tensor_search/tensor_search.py

@@ -1702,8 +1800,8 @@ def _vector_text_search(
    else:  # is dict:
        ordered_queries = list(query.items())
        if index_info.index_settings[NsField.index_defaults][NsField.treat_urls_and_pointers_as_images]:
-            text_queries = [k for k, _ in ordered_queries if _is_image(k)]


It seems this has always been a bug. What affect does this have, just a naming switch?

I think this had no real effect on the execution, as the point of this part of the code was just to separate image and text queries from each other. Just a naming switch, as you said. But this caused an issue when it was refactored into bulk search (there were issues when checking for the jobs' content-type)

pandu-k added 6 commits March 7, 2023 12:23

added notes/skeleton funcs

14e4435

Added bulk vectorising module

5c9b308

Logic is added for splitting content by type

6c9017e

created get_query_vectors_from_jobs()

b153aa1

fixed bug checking content_type

8bdcb9e

made bulk search tests pass

4331b75

pandu-k temporarily deployed to marqo-test-suite March 7, 2023 05:55 — with GitHub Actions Inactive

pandu-k changed the title ~~Pandu/bulk serach vectorising~~ Bulk vectorising for bulk search Mar 7, 2023

pandu-k added 3 commits March 7, 2023 16:57

doc string cleanup

1d0cc70

deleted bulk vectorise.py (not needed now)

88df439

deleted comment

a7400f4

Jeadie reviewed Mar 7, 2023

View reviewed changes

Jeadie reviewed Mar 8, 2023

View reviewed changes

pandu-k requested a review from Jeadie March 8, 2023 02:41

Jeadie approved these changes Mar 9, 2023

View reviewed changes

Merge branch 'jack/bulk_msearch' into pandu/bulk_serach_vectorising

b2fce42

Jeadie merged commit 0bcbd7d into jack/bulk_msearch Mar 9, 2023

Jeadie deleted the pandu/bulk_serach_vectorising branch March 9, 2023 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk vectorising for bulk search #373

Bulk vectorising for bulk search #373

pandu-k commented Mar 7, 2023

pandu-k commented Mar 7, 2023

pandu-k commented Mar 7, 2023

Jeadie Mar 7, 2023

pandu-k Mar 8, 2023

Jeadie Mar 7, 2023

pandu-k Mar 8, 2023

Jeadie Mar 8, 2023

pandu-k Mar 8, 2023

Bulk vectorising for bulk search #373

Bulk vectorising for bulk search #373

Conversation

pandu-k commented Mar 7, 2023

pandu-k commented Mar 7, 2023

pandu-k commented Mar 7, 2023

Jeadie Mar 7, 2023

Choose a reason for hiding this comment

pandu-k Mar 8, 2023

Choose a reason for hiding this comment

Jeadie Mar 7, 2023

Choose a reason for hiding this comment

pandu-k Mar 8, 2023

Choose a reason for hiding this comment

Jeadie Mar 8, 2023

Choose a reason for hiding this comment

pandu-k Mar 8, 2023

Choose a reason for hiding this comment