Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor tensor_search::search to use shared functions from tensor_search::bulk_search #469

Merged
merged 13 commits into from
May 22, 2023

Conversation

Jeadie
Copy link
Contributor

@Jeadie Jeadie commented May 10, 2023

Changeset

  • Bulk search now supports
    • Context vectors
    • Score modifiers
  • Standard search is now refactored to use modular functions used in bulk search.

@Jeadie Jeadie marked this pull request as draft May 10, 2023 01:08
index_info.get_vector_properties().keys()
))

# Validation for offset (pagination is single field) if offset not provided, validation is not run.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving into here so it can be used by both search/ bulk_search

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps be moved to a separate validation function (may be out-of-scope for this refactor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, but this is just a move from within the script.



def construct_msearch_body_elements(vector_properties_to_search: List[str], offset: int, filter_string: str, index_info: IndexInfo, result_count: int, query_vector: List[float], attributes_to_retrieve: List[str], index_name: str, contextualised_filter: str) -> List[Dict[str, Any]]:
def construct_msearch_body_elements(searchableAttributes: List[str], offset: int, filter_string: str, index_info: IndexInfo, result_count: int, query_vector: List[float], attributes_to_retrieve: List[str], index_name: str, score_modifiers: Optional[Dict[str, Any]] = None) -> List[Dict[str, Any]]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change lets:

  1. search use construct_msearch_body_elements
  2. Bulk search to use score modifiers

@@ -1419,14 +1406,17 @@ def bulk_msearch(config: Config, body: List[Dict]) -> List[Dict]:
except KeyError as e:
# KeyError indicates we have received a non-successful result
try:
if "index.max_result_window" in response["responses"][0]["error"]["root_cause"][0]["reason"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just makes it cleaner, and adds
elif root_cause_type == 'query_shard_exception' and root_cause_reason.startswith("Failed to parse query") to replicate what is in search.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to clean this up 👍

merged_vector = np.mean(weighted_vectors, axis=0)

custom_tensors = q.get_context_tensor()
if custom_tensors is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom tensor support for bulk_search (and search to use this method).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this var context_tensors to be consistent?

@@ -1647,6 +1649,25 @@ def create_empty_query_response(queries: List[BulkSearchQueryEntity]) -> List[Di
)
)

def run_vectorise_pipeline(config: Config, queries: List[BulkSearchQueryEntity], selected_device: Union[Device, str]) -> Dict[Qidx, List[float]]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstracts the query -> vector pipeline so we can use it for (bulk) search

@Jeadie
Copy link
Contributor Author

Jeadie commented May 10, 2023

@Jeadie Jeadie marked this pull request as ready for review May 10, 2023 01:31
@Jeadie
Copy link
Contributor Author

Jeadie commented May 12, 2023

src/marqo/tensor_search/models/api_models.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 main points:

  1. Make pydantic models immutable in 99.9%, unless there's a good reason not too
  2. Lots of unit tests, including edge cases, for all logic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including object methods. Even super trivial unit tests help Marqo more stable

src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
Copy link
Contributor

@vicilliar vicilliar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! just added some questions

src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
@Jeadie Jeadie temporarily deployed to marqo-test-suite May 16, 2023 01:12 — with GitHub Actions Inactive
src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
index_info.get_vector_properties().keys()
))

# Validation for offset (pagination is single field) if offset not provided, validation is not run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps be moved to a separate validation function (may be out-of-scope for this refactor)

@@ -1419,14 +1406,17 @@ def bulk_msearch(config: Config, body: List[Dict]) -> List[Dict]:
except KeyError as e:
# KeyError indicates we have received a non-successful result
try:
if "index.max_result_window" in response["responses"][0]["error"]["root_cause"][0]["reason"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to clean this up 👍

merged_vector = np.mean(weighted_vectors, axis=0)

custom_tensors = q.get_context_tensor()
if custom_tensors is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this var context_tensors to be consistent?

merged_vector = np.mean(weighted_vectors, axis=0)
except ValueError as e:
raise errors.InvalidArgError(f"The provided vectors are not in the same dimension of the index."
f"This causes the error when we do `numpy.mean()` over all the vectors.\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave out numpy as this is a Marqo implementation detail (just mean is fine).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy paste. I want to minimise refactors during a code reshuffle.

Copy link
Collaborator

@pandu-k pandu-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are looking good

@Jeadie Jeadie temporarily deployed to marqo-test-suite May 19, 2023 04:08 — with GitHub Actions Inactive
)
vector_jobs_tuple: Tuple[Dict[Qidx, List[VectorisedJobPointer]], Dict[JHash, VectorisedJobs]] = create_vector_jobs(queries, config, selected_device)

print(vector_jobs_tuple, create_vector_jobs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm this print statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh, I ctrl-F ed for prints in my git diff :(

self.job_to_vectors: Dict[JHash, Dict[str, List[float]]] = {
123: {"a test query": [0.5, 0.5]},
456: {"another": [0.5, 0.6]},
789: {"red herring": [0.6, 0.5]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a multimodal combination field?

Copy link
Collaborator

@pandu-k pandu-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers

@pandu-k pandu-k temporarily deployed to marqo-test-suite May 22, 2023 00:26 — with GitHub Actions Inactive
@Jeadie Jeadie merged commit c707686 into mainline May 22, 2023
@Jeadie Jeadie deleted the jack/simplify-search branch May 22, 2023 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants