Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk vectorising for bulk search #373

Merged
merged 10 commits into from
Mar 9, 2023

Conversation

pandu-k
Copy link
Collaborator

@pandu-k pandu-k commented Mar 7, 2023

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

  • What is the current behavior? (You can also link to an open issue here)

  • What is the new behavior (if this is a feature change)?

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

  • Have unit tests been run against this PR? (Has there also been any additional testing?)

  • Related Python client changes (link commit/PR here)

  • Related documentation changes (link commit/PR here)

  • Other information:

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@pandu-k
Copy link
Collaborator Author

pandu-k commented Mar 7, 2023

bulk search tests passed
unit tests: https://github.com/marqo-ai/marqo/actions/runs/4351112004

@pandu-k pandu-k temporarily deployed to marqo-test-suite March 7, 2023 05:55 — with GitHub Actions Inactive
@pandu-k pandu-k changed the title Pandu/bulk serach vectorising Bulk vectorising for bulk search Mar 7, 2023
@pandu-k
Copy link
Collaborator Author

pandu-k commented Mar 7, 2023

to do: max batch size

@@ -1325,18 +1324,30 @@ def _lexical_search(
return {'hits': res_list}


def construct_vector_input_batches(query: Union[str, Dict], index_info) -> Union[List[str], List[List[str]]]:
def construct_vector_input_batches(query: Union[str, Dict], index_info) -> Tuple[List[str], List[str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be better as Dict[ModalityType, List[str]]. It'd be easier to read, and if we add other modalities more simply

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that would make sense. But it might be a while before we add another modality, and we could cross that bridge when we get to it.

content
) for content, weight in ordered_queries
]
# TODO how doe we ensure order?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think a big question of going from qidx_to_job: Dict[Qidx, VectorisedJobPointer] to Dict[Qidx, List[VectorisedJobPointer]] is how we maintain order of content.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of how this affects maintaining order of content, as the important thing is that VectorisedJobPointer points to the correct locations. See how pointers are added here.

What are the situations where you think this could fail? I can make some extra tests if you have ideas

@@ -1702,8 +1800,8 @@ def _vector_text_search(
else: # is dict:
ordered_queries = list(query.items())
if index_info.index_settings[NsField.index_defaults][NsField.treat_urls_and_pointers_as_images]:
text_queries = [k for k, _ in ordered_queries if _is_image(k)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this has always been a bug. What affect does this have, just a naming switch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this had no real effect on the execution, as the point of this part of the code was just to separate image and text queries from each other. Just a naming switch, as you said. But this caused an issue when it was refactored into bulk search (there were issues when checking for the jobs' content-type)

@pandu-k pandu-k requested a review from Jeadie March 8, 2023 02:41
@Jeadie Jeadie merged commit 0bcbd7d into jack/bulk_msearch Mar 9, 2023
@Jeadie Jeadie deleted the pandu/bulk_serach_vectorising branch March 9, 2023 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants