-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk vectorising for bulk search #373
Conversation
bulk search tests passed |
to do: max batch size |
@@ -1325,18 +1324,30 @@ def _lexical_search( | |||
return {'hits': res_list} | |||
|
|||
|
|||
def construct_vector_input_batches(query: Union[str, Dict], index_info) -> Union[List[str], List[List[str]]]: | |||
def construct_vector_input_batches(query: Union[str, Dict], index_info) -> Tuple[List[str], List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could be better as Dict[ModalityType, List[str]]
. It'd be easier to read, and if we add other modalities more simply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that would make sense. But it might be a while before we add another modality, and we could cross that bridge when we get to it.
content | ||
) for content, weight in ordered_queries | ||
] | ||
# TODO how doe we ensure order? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think a big question of going from qidx_to_job: Dict[Qidx, VectorisedJobPointer]
to Dict[Qidx, List[VectorisedJobPointer]]
is how we maintain order of content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of how this affects maintaining order of content, as the important thing is that VectorisedJobPointer
points to the correct locations. See how pointers are added here.
What are the situations where you think this could fail? I can make some extra tests if you have ideas
@@ -1702,8 +1800,8 @@ def _vector_text_search( | |||
else: # is dict: | |||
ordered_queries = list(query.items()) | |||
if index_info.index_settings[NsField.index_defaults][NsField.treat_urls_and_pointers_as_images]: | |||
text_queries = [k for k, _ in ordered_queries if _is_image(k)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this has always been a bug. What affect does this have, just a naming switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this had no real effect on the execution, as the point of this part of the code was just to separate image and text queries from each other. Just a naming switch, as you said. But this caused an issue when it was refactored into bulk search (there were issues when checking for the jobs' content-type)
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
What is the current behavior? (You can also link to an open issue here)
What is the new behavior (if this is a feature change)?
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
Have unit tests been run against this PR? (Has there also been any additional testing?)
Related Python client changes (link commit/PR here)
Related documentation changes (link commit/PR here)
Other information:
Please check if the PR fulfills these requirements