Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_labels crashes for ElasticsearchDocumentStore for labels with long context #2621

Closed
bogdankostic opened this issue Jun 1, 2022 · 3 comments · Fixed by #3346
Closed
Labels
Contributions wanted! Looking for external contributions topic:document_store topic:elasticsearch type:bug Something isn't working

Comments

@bogdankostic
Copy link
Contributor

Describe the bug
write_labels currently crashes with a BulkIndexError if we try to write labels whose context is longer than 32766 bytes. I suspect that this is due to the mapping that we apply in _create_label_index. Probably we implicitly make use of type keyword instead of text for the content field of the document field of the label.

Error message

<class 'elasticsearch.helpers.errors.BulkIndexError'>, BulkIndexError('25 document(s) failed to index.', [{'index': {'_index': 'label', '_type': '_doc', '_id': '824075a5-9c6c-48af-b41c-d4e10d1d01d7', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document contains at least one immense term in field="document" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms...

To Reproduce

from haystack.document_stores import ElasticsearchDocumentStore
from haystack.utils import fetch_archive_from_http, launch_es

# Download evaluation data
doc_dir = "data/"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

launch_es()
document_store = ElasticsearchDocumentStore()

document_store.add_eval_data(filename="data/nq_dev_subset_v2.json")
@ZanSara ZanSara added the Contributions wanted! Looking for external contributions label Jul 19, 2022
@Winterflower
Copy link

I'm happy to take a look at this tomorrow to see if I can debug this further

@anakin87
Copy link
Member

anakin87 commented Sep 7, 2022

@bogdankostic Your intuition points in the right direction.

"mappings": {
"properties": {
"query": {"type": "text"},
"answer": {"type": "flattened"}, # light-weight but less search options than full object
"document": {"type": "flattened"},

document is mapped as a flattened field type:

Given an object, the flattened mapping will parse out its leaf values and index them into one field as keywords.

Under the hood, document content is treated like a keyword and so it is not analyzed/split and is therefore considered a single immense term.

Possible solutions

  • use a regular object for document (and for answer, since there will be the same problem for this field).
    It would be a simple and resolutive solution (I did some quick tests). But flattened was probably adopted in order to prevent performance issues (see motivation for flattened data type)...
  • use the ignore_above mapping parameter to ignore strings longer than this limit (no indexing/storing). It doesn't seem like a good solution.

@masci @Winterflower any thoughts on this?
If we find together an acceptable solution, I can take charge of this issue.

@anakin87
Copy link
Member

Related

In OpensearchDocumentStore we use nested instead of flattened, since flattened is not supported (see also #1609):

mapping = {
"mappings": {
"properties": {
"query": {"type": "text"},
"answer": {
"type": "nested"
}, # In elasticsearch we use type:flattened, but this is not supported in opensearch
"document": {"type": "nested"},

To solve the current bug, it may be reasonable to use nested also in ElasticsearchDocumentStore:
it is similar to object, but

allows arrays of objects to be indexed in a way that they can be queried independently of each other.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions topic:document_store topic:elasticsearch type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants