Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dupe IDs are handled when use_existing_tensors=True #390

Merged
merged 8 commits into from
Mar 16, 2023

Conversation

vicilliar
Copy link
Contributor

@vicilliar vicilliar commented Mar 14, 2023

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    2 bug fixes

  • What is the current behavior? (You can also link to an open issue here)

  1. When use_existing_tensors=True and docs are added with duplicate IDs, a MarqoWebError with no status code is thrown.
  2. When a doc with no chunks is replaced with use_existing_tensors=True, a KeyError occurs because it looks for '_source['__chunks']`
  • What is the new behavior (if this is a feature change)?
  1. When use_existing_tensors=True and docs are added with duplicate IDs, docs are added normally. The last doc in the list with the same ID is the one that gets kept.
  2. Chunkless docs simply return empty lists of chunks
  3. MarqoWebError has a status code of 500 by default now.
  4. Unit tests have been added to test duplicate IDs with and without use_existing_tensors and the chunkless docs bug
  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
    No

  • Have unit tests been run against this PR? (Has there also been any additional testing?)
    Yes

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@vicilliar
Copy link
Contributor Author

@vicilliar
Copy link
Contributor Author

vicilliar commented Mar 14, 2023

@vicilliar vicilliar temporarily deployed to marqo-test-suite March 14, 2023 18:48 — with GitHub Actions Inactive
src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
@@ -85,6 +86,100 @@ def test_use_existing_tensors_non_existing(self):
document_id="123", show_vectors=True)
self.assertEqual(use_existing_tensors_doc, regular_doc)

tensor_search.delete_index(config=self.config, index_name=self.index_name_1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you delete the index, then you don't really overwrite the previous document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vicilliar
Copy link
Contributor Author

vicilliar commented Mar 15, 2023

Reports on bug found by vitus "chunks" key error:

            {"_id": ["bad", "id"], "field_1": "zzz"},
            {"_id": "proper id 2", "field_1": 90}], 2)

error happens when adding this doc repeatedly

This only happens when use_existing_tensors is True after that document ALREADY EXISTS. then it can't find chunks?

# Combine the 2 query results (loop through each doc id)
        combined_result = []
        for doc_id in document_ids:
            # There should always be 2 results per doc.
            result_list = [doc for doc in res["docs"] if doc["_id"] == doc_id]
            if len(result_list) == 0:
                continue
            if len(result_list) not in (2, 0):
                raise errors.InternalError(f"Internal error fetching old documents. "
                                           f"There are {len(result_list)} results for doc id {doc_id}.")

            for result in result_list:
                if result["found"]:
                    doc_in_results = True
>                   if result["_source"]["__chunks"] == []:
E                   KeyError: '__chunks'

src/marqo/tensor_search/tensor_search.py:930: KeyError

@vicilliar
Copy link
Contributor Author

vicilliar commented Mar 15, 2023

things we know:

  1. Does not matter if original doc was placed there with update or replace.
  2. It breaks when field content is int, but not str. It makes the source completely empty.

res_chunks

  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {},
  '_version': 1,
  'found': True},

res_data

 {'_id': 'proper id',
  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {'__chunks': [], 'field_1': 5678},
  '_version': 1,
  'found': True}

question: why does res_chunks have no chunks? in what situation would _source be empty?

@vicilliar
Copy link
Contributor Author

Another situation:
"""
Error happens with

  • available_product_codes: List and the one to remove is source_image_url
  • despite treat_urls_as_pointers
  • regardless of model used
    """

@vicilliar
Copy link
Contributor Author

Diagnosis (3/15/23)

The error

_get_documents_for_upsert
    if result["_source"]["__chunks"] == []:
KeyError: '__chunks'

Happens when the following conditions occur

  1. A document is created with no tensor fields, therefore it has no _source["__chunks"]
  2. Add documents is called using use_existing_tensors with the id of the chunkless doc.

A chunkless doc is one of the following:

  1. Doc without a string/list/anything else that gets tensorized as a field.
  • int, bool do not get tensorized
  • using non_tensor_fields on all fields to be tensorized will do this
  • Interestingly, lists should not get tensorized, but they do not have the same problem as int and bool

Solution:
If a doc is chunkless, one of the 2 results from _get_documents_for_upsert that was supposed to have chunks will have empty source like this: '_source': {}

# chunkless result
[{'_id': 'proper id 2',
  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {},
  '_version': 1,
  'found': True},
  
 {'_id': 'proper id 2',
  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {'__chunks': [], 'field_2': 123},
  '_version': 1,
  'found': True}]

If empty source is found, set that as res_chunks.

@vicilliar vicilliar temporarily deployed to marqo-test-suite March 15, 2023 18:15 — with GitHub Actions Inactive
@vicilliar vicilliar requested a review from pandu-k March 15, 2023 19:14
src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
@@ -940,12 +943,14 @@ def _get_documents_for_upsert(
dummy_res = result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be more appropriate to call this something like not_found_res

src/marqo/tensor_search/tensor_search.py Show resolved Hide resolved
@pandu-k pandu-k temporarily deployed to marqo-test-suite March 16, 2023 01:47 — with GitHub Actions Inactive
@pandu-k
Copy link
Collaborator

pandu-k commented Mar 16, 2023

@pandu-k pandu-k temporarily deployed to marqo-test-suite March 16, 2023 02:57 — with GitHub Actions Inactive
@pandu-k pandu-k merged commit fb24235 into mainline Mar 16, 2023
@pandu-k pandu-k deleted the joshua/dupe-id-fix branch March 16, 2023 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants