Dupe IDs are handled when use_existing_tensors=True #390

vicilliar · 2023-03-14T18:45:02Z

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
2 bug fixes
What is the current behavior? (You can also link to an open issue here)

When use_existing_tensors=True and docs are added with duplicate IDs, a MarqoWebError with no status code is thrown.
When a doc with no chunks is replaced with use_existing_tensors=True, a KeyError occurs because it looks for '_source['__chunks']`

What is the new behavior (if this is a feature change)?

When use_existing_tensors=True and docs are added with duplicate IDs, docs are added normally. The last doc in the list with the same ID is the one that gets kept.
Chunkless docs simply return empty lists of chunks
MarqoWebError has a status code of 500 by default now.
Unit tests have been added to test duplicate IDs with and without use_existing_tensors and the chunkless docs bug

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Have unit tests been run against this PR? (Has there also been any additional testing?)
Yes
Please check if the PR fulfills these requirements

The commit message follows our guidelines
Tests for the changes have been added (for bug fixes/features)
Docs have been added / updated (for bug fixes / features)

This reverts commit e5c4237.

vicilliar · 2023-03-14T18:45:59Z

Run passed: 268 https://github.com/marqo-ai/marqo/actions/runs/4418255521/jobs/7745155005#logs

vicilliar · 2023-03-14T18:47:39Z

monitor: run 270 https://github.com/marqo-ai/marqo/actions/runs/4419170567

src/marqo/tensor_search/tensor_search.py

pandu-k · 2023-03-14T22:34:39Z

tests/tensor_search/test_add_documents_use_existing_tensors.py

@@ -85,6 +86,100 @@ def test_use_existing_tensors_non_existing(self):
            document_id="123", show_vectors=True)
        self.assertEqual(use_existing_tensors_doc, regular_doc)

+        tensor_search.delete_index(config=self.config, index_name=self.index_name_1)


if you delete the index, then you don't really overwrite the previous document

vicilliar · 2023-03-15T08:14:05Z

Reports on bug found by vitus "chunks" key error:

            {"_id": ["bad", "id"], "field_1": "zzz"},
            {"_id": "proper id 2", "field_1": 90}], 2)

error happens when adding this doc repeatedly

This only happens when use_existing_tensors is True after that document ALREADY EXISTS. then it can't find chunks?

# Combine the 2 query results (loop through each doc id)
        combined_result = []
        for doc_id in document_ids:
            # There should always be 2 results per doc.
            result_list = [doc for doc in res["docs"] if doc["_id"] == doc_id]
            if len(result_list) == 0:
                continue
            if len(result_list) not in (2, 0):
                raise errors.InternalError(f"Internal error fetching old documents. "
                                           f"There are {len(result_list)} results for doc id {doc_id}.")

            for result in result_list:
                if result["found"]:
                    doc_in_results = True
>                   if result["_source"]["__chunks"] == []:
E                   KeyError: '__chunks'

src/marqo/tensor_search/tensor_search.py:930: KeyError

vicilliar · 2023-03-15T10:28:33Z

things we know:

Does not matter if original doc was placed there with update or replace.
It breaks when field content is int, but not str. It makes the source completely empty.

res_chunks

  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {},
  '_version': 1,
  'found': True},

res_data

 {'_id': 'proper id',
  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {'__chunks': [], 'field_1': 5678},
  '_version': 1,
  'found': True}

question: why does res_chunks have no chunks? in what situation would _source be empty?

vicilliar · 2023-03-15T11:05:36Z

Another situation:
"""
Error happens with

available_product_codes: List and the one to remove is source_image_url
despite treat_urls_as_pointers
regardless of model used
"""

vicilliar · 2023-03-15T11:56:48Z

Diagnosis (3/15/23)

The error

_get_documents_for_upsert
    if result["_source"]["__chunks"] == []:
KeyError: '__chunks'

Happens when the following conditions occur

A document is created with no tensor fields, therefore it has no _source["__chunks"]
Add documents is called using use_existing_tensors with the id of the chunkless doc.

A chunkless doc is one of the following:

Doc without a string/list/anything else that gets tensorized as a field.

int, bool do not get tensorized
using non_tensor_fields on all fields to be tensorized will do this
Interestingly, lists should not get tensorized, but they do not have the same problem as int and bool

Solution:
If a doc is chunkless, one of the 2 results from _get_documents_for_upsert that was supposed to have chunks will have empty source like this: '_source': {}

# chunkless result
[{'_id': 'proper id 2',
  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {},
  '_version': 1,
  'found': True},
  
 {'_id': 'proper id 2',
  '_index': 'my-test-index-1',
  '_primary_term': 1,
  '_seq_no': 0,
  '_source': {'__chunks': [], 'field_2': 123},
  '_version': 1,
  'found': True}]

If empty source is found, set that as res_chunks.

merging mainline

src/marqo/tensor_search/tensor_search.py

pandu-k · 2023-03-16T00:51:08Z

src/marqo/tensor_search/tensor_search.py

@@ -940,12 +943,14 @@ def _get_documents_for_upsert(
                dummy_res = result


nit: may be more appropriate to call this something like not_found_res

src/marqo/tensor_search/tensor_search.py

pandu-k · 2023-03-16T01:48:37Z

Running unit tests: https://github.com/marqo-ai/marqo/actions/runs/4432692586

vicilliar added 5 commits March 14, 2023 19:42

added dupe id fix

c84f744

fixed reference to overwritten

e5c4237

reverted marqoweberror code to None

76089d2

Revert "fixed reference to overwritten"

07fe79b

This reverts commit e5c4237.

returning status code to 500 and adding overwritten_doc

5d8178d

vicilliar temporarily deployed to marqo-test-suite March 14, 2023 18:48 — with GitHub Actions Inactive

pandu-k reviewed Mar 14, 2023

View reviewed changes

pandu-k approved these changes Mar 15, 2023

View reviewed changes

vicilliar added 2 commits March 16, 2023 01:35

bug fix for chunkless docs

40fa30d

Merge branch 'mainline' into joshua/dupe-id-fix

91ac710

merging mainline

vicilliar had a problem deploying to marqo-test-suite March 15, 2023 17:41 — with GitHub Actions Failure

vicilliar temporarily deployed to marqo-test-suite March 15, 2023 18:15 — with GitHub Actions Inactive

vicilliar had a problem deploying to marqo-test-suite March 15, 2023 18:15 — with GitHub Actions Failure

vicilliar requested a review from pandu-k March 15, 2023 19:14

pandu-k reviewed Mar 16, 2023

View reviewed changes

pandu-k temporarily deployed to marqo-test-suite March 16, 2023 01:47 — with GitHub Actions Inactive

Merge branch 'mainline' into joshua/dupe-id-fix

2581858

pandu-k approved these changes Mar 16, 2023

View reviewed changes

pandu-k temporarily deployed to marqo-test-suite March 16, 2023 02:57 — with GitHub Actions Inactive

pandu-k merged commit fb24235 into mainline Mar 16, 2023

pandu-k deleted the joshua/dupe-id-fix branch March 16, 2023 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dupe IDs are handled when use_existing_tensors=True #390

Dupe IDs are handled when use_existing_tensors=True #390

vicilliar commented Mar 14, 2023 •

edited

Loading

vicilliar commented Mar 14, 2023

vicilliar commented Mar 14, 2023 •

edited

Loading

pandu-k Mar 14, 2023

vicilliar Mar 15, 2023

vicilliar commented Mar 15, 2023 •

edited

Loading

vicilliar commented Mar 15, 2023 •

edited

Loading

vicilliar commented Mar 15, 2023

vicilliar commented Mar 15, 2023

pandu-k Mar 16, 2023

pandu-k commented Mar 16, 2023

		@@ -940,12 +943,14 @@ def _get_documents_for_upsert(
		dummy_res = result

Dupe IDs are handled when use_existing_tensors=True #390

Dupe IDs are handled when use_existing_tensors=True #390

Conversation

vicilliar commented Mar 14, 2023 • edited Loading

vicilliar commented Mar 14, 2023

vicilliar commented Mar 14, 2023 • edited Loading

pandu-k Mar 14, 2023

Choose a reason for hiding this comment

vicilliar Mar 15, 2023

Choose a reason for hiding this comment

vicilliar commented Mar 15, 2023 • edited Loading

vicilliar commented Mar 15, 2023 • edited Loading

vicilliar commented Mar 15, 2023

vicilliar commented Mar 15, 2023

pandu-k Mar 16, 2023

Choose a reason for hiding this comment

pandu-k commented Mar 16, 2023

vicilliar commented Mar 14, 2023 •

edited

Loading

vicilliar commented Mar 14, 2023 •

edited

Loading

vicilliar commented Mar 15, 2023 •

edited

Loading

vicilliar commented Mar 15, 2023 •

edited

Loading