Indexing is failing on many structure and meta data element #4724

henning-gerhardt · 2021-10-07T13:07:08Z

Indexing a process with a lot of structure (> 450) and meta data elements (> 2960) fails with

[ERROR] 2021-10-07 11:58:09,510 [I/O dispatcher 7] org.kitodo.data.elasticsearch.index.ResponseListener - failure in bulk execution:
[1714]: index [kitodo_process], type [_doc], id [367280], message [ElasticsearchException[Elasticsearch exception [type=mapper_parsing_exception, reason=The number of nested documents has exceeded the allowed limit of [10000]. This limit can be set by changing the [index.mapping.nested_objects.limit] index level setting.]]]

The mentioned process is already available under https://digital.slub-dresden.de/id1685679609 as this issue is happened on re-indexing the process data.

The text was updated successfully, but these errors were encountered:

markusweigelt · 2021-10-07T14:21:41Z

@henning-gerhardt @Kathrin-Huber

This parameter "index.mapping.nested_objects.limit" seems to have been added as of version ElasticSearch 7.0.
https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#limit-number-nested-json-objects

The question is whether the default is too low for our purposes? Depending on the server resources available, this could also be increased. Even with a value of for example 30000, this protects against memory errors in a powerful environment. It is not without reason that this is a parameter! ;)

Nevertheless, we have to look at which ends we can optimize the source code here in order to avoid memory errors.

As a quick solution, I would recommend adjusting the parameter if there are enough resources. If there are known optimizations, we should create an issue to improve indexing.

matthias-ronge · 2021-10-08T07:06:31Z

From the ElasticSearch documentation on the nested field type:

The nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.

So, if we don’t use querying the objects independently of each other, we don’t need a nested type here. If I understand the manual correctly, this type internally creates one index object per entry of the nested structure. This makes sense in cases, where you are not only interested in whether a token is found within a record, but also where. For example: When indexing the full OCR text of a book, you want to know from the search result on which page of a book a token was found. In this case, the nested type must be formed in a way that you have one nested object per page.

As—to my knowledge—we do not use such information in our context, this isn’t necessary at the moment. However, changing the nested field from the index will remove this possibility for us in the future, which might have been intended. I cannot say anything to that, since I don’t know a documentation of the index profile.

henning-gerhardt · 2021-10-08T07:12:47Z

@markusweigelt So far as I understand this parameter, this parameter influence the behavior on the server side and not on the client side. I would suggest to make this parameter configurable through the kitodo_config.properties file including a good explanation of this parameter and when this parameter should be changed and when not. The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?

matthias-ronge · 2021-10-08T07:32:45Z

As I understand the parameter is in the ElasticSearch configuration file. If so, Production could just call a sudo script that edits the configuration file and restarts ElasticSearch. But do we need that?

matthias-ronge · 2021-10-08T07:39:36Z

How many more resources (RAM, disk space, ...) are needed?

It's hard to say in general, but as you can see, a separate index entry is created for each structure element × each metadata entry, for the example document over 10,000 index records are created, which is why the error occurs. I think it is possible to increase the parameter a bit now, but it indicates an improper implementation of the search engine usage.

markusweigelt · 2021-10-08T09:17:26Z

[...] The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?

I think there are many adjusting screws (RAM, disk space, entering data volume) here that have an influence on behavior. If we want to know exactly, we would have to use ElasticSearch in conjunction with Kibana or Grafana etc. If that is possible in the free version of ElasticSearch. Then we can change the parameter and monitor the influence.

I think that the parameter is based on the minimum requirement of ElasticSearch. If we theoretically put these in relation to our available resources, we could change them up to this maximum. I cannot currently find out why this parameter has a default value of 10000 and how this value was determined. It may also be too low in general.

henning-gerhardt · 2022-01-05T13:49:42Z

Setting the parameter "index.mapping.nested_objects.limit" to 30000 through like

curl -XPUT "<es-host>:9200/kitodo_process/_settings" -H 'Content-Type: application/json' -d' { "index.mapping.nested_objects.limit" : 30000 }'

solved temporarly the issue until the ElasticSearch index get destroyed.

Setting this value must be done after creating the mapping inside ElasticSearch but before you start the indexing of processes or you must redone everything again. Setting this parameter should be done inside the application instead of running a curl command on the ElasticSearch server.

henning-gerhardt added 3.x bug labels Oct 7, 2021

solth removed the 3.x label Jul 7, 2022

andre-hohmann mentioned this issue Feb 17, 2023

Consolidation of indexing in Kitodo.Production #5546

Open

matthias-ronge added the search search, filter label Feb 27, 2023

solth added this to Kitodo.Production - Hibernate Search and search index Jul 11, 2023

solth assigned matthias-ronge Jul 21, 2023

solth moved this to Todo in Kitodo.Production - Hibernate Search and search index Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing is failing on many structure and meta data element #4724

Indexing is failing on many structure and meta data element #4724

henning-gerhardt commented Oct 7, 2021

markusweigelt commented Oct 7, 2021 •

edited

Loading

matthias-ronge commented Oct 8, 2021 •

edited

Loading

henning-gerhardt commented Oct 8, 2021

matthias-ronge commented Oct 8, 2021

matthias-ronge commented Oct 8, 2021

markusweigelt commented Oct 8, 2021 •

edited

Loading

henning-gerhardt commented Jan 5, 2022

Indexing is failing on many structure and meta data element #4724

Indexing is failing on many structure and meta data element #4724

Comments

henning-gerhardt commented Oct 7, 2021

markusweigelt commented Oct 7, 2021 • edited Loading

matthias-ronge commented Oct 8, 2021 • edited Loading

henning-gerhardt commented Oct 8, 2021

matthias-ronge commented Oct 8, 2021

matthias-ronge commented Oct 8, 2021

markusweigelt commented Oct 8, 2021 • edited Loading

henning-gerhardt commented Jan 5, 2022

markusweigelt commented Oct 7, 2021 •

edited

Loading

matthias-ronge commented Oct 8, 2021 •

edited

Loading

markusweigelt commented Oct 8, 2021 •

edited

Loading