Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing is failing on many structure and meta data element #4724

Open
henning-gerhardt opened this issue Oct 7, 2021 · 7 comments
Open
Assignees
Labels
bug search search, filter

Comments

@henning-gerhardt
Copy link
Collaborator

Indexing a process with a lot of structure (> 450) and meta data elements (> 2960) fails with

[ERROR] 2021-10-07 11:58:09,510 [I/O dispatcher 7] org.kitodo.data.elasticsearch.index.ResponseListener - failure in bulk execution:
[1714]: index [kitodo_process], type [_doc], id [367280], message [ElasticsearchException[Elasticsearch exception [type=mapper_parsing_exception, reason=The number of nested documents has exceeded the allowed limit of [10000]. This limit can be set by changing the [index.mapping.nested_objects.limit] index level setting.]]]

The mentioned process is already available under https://digital.slub-dresden.de/id1685679609 as this issue is happened on re-indexing the process data.

@markusweigelt
Copy link
Collaborator

markusweigelt commented Oct 7, 2021

@henning-gerhardt @Kathrin-Huber

This parameter "index.mapping.nested_objects.limit" seems to have been added as of version ElasticSearch 7.0.
https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#limit-number-nested-json-objects

The question is whether the default is too low for our purposes? Depending on the server resources available, this could also be increased. Even with a value of for example 30000, this protects against memory errors in a powerful environment. It is not without reason that this is a parameter! ;)

Nevertheless, we have to look at which ends we can optimize the source code here in order to avoid memory errors.

As a quick solution, I would recommend adjusting the parameter if there are enough resources. If there are known optimizations, we should create an issue to improve indexing.

@matthias-ronge
Copy link
Collaborator

matthias-ronge commented Oct 8, 2021

From the ElasticSearch documentation on the nested field type:

The nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.

So, if we don’t use querying the objects independently of each other, we don’t need a nested type here. If I understand the manual correctly, this type internally creates one index object per entry of the nested structure. This makes sense in cases, where you are not only interested in whether a token is found within a record, but also where. For example: When indexing the full OCR text of a book, you want to know from the search result on which page of a book a token was found. In this case, the nested type must be formed in a way that you have one nested object per page.

As—to my knowledge—we do not use such information in our context, this isn’t necessary at the moment. However, changing the nested field from the index will remove this possibility for us in the future, which might have been intended. I cannot say anything to that, since I don’t know a documentation of the index profile.

@henning-gerhardt
Copy link
Collaborator Author

@markusweigelt So far as I understand this parameter, this parameter influence the behavior on the server side and not on the client side. I would suggest to make this parameter configurable through the kitodo_config.properties file including a good explanation of this parameter and when this parameter should be changed and when not. The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?

@matthias-ronge
Copy link
Collaborator

As I understand the parameter is in the ElasticSearch configuration file. If so, Production could just call a sudo script that edits the configuration file and restarts ElasticSearch. But do we need that?

@matthias-ronge
Copy link
Collaborator

How many more resources (RAM, disk space, ...) are needed?

It's hard to say in general, but as you can see, a separate index entry is created for each structure element × each metadata entry, for the example document over 10,000 index records are created, which is why the error occurs. I think it is possible to increase the parameter a bit now, but it indicates an improper implementation of the search engine usage.

@markusweigelt
Copy link
Collaborator

markusweigelt commented Oct 8, 2021

[...] The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?

I think there are many adjusting screws (RAM, disk space, entering data volume) here that have an influence on behavior. If we want to know exactly, we would have to use ElasticSearch in conjunction with Kibana or Grafana etc. If that is possible in the free version of ElasticSearch. Then we can change the parameter and monitor the influence.

I think that the parameter is based on the minimum requirement of ElasticSearch. If we theoretically put these in relation to our available resources, we could change them up to this maximum. I cannot currently find out why this parameter has a default value of 10000 and how this value was determined. It may also be too low in general.

@henning-gerhardt
Copy link
Collaborator Author

Setting the parameter "index.mapping.nested_objects.limit" to 30000 through like

curl -XPUT "<es-host>:9200/kitodo_process/_settings" -H 'Content-Type: application/json' -d' { "index.mapping.nested_objects.limit" : 30000 }'

solved temporarly the issue until the ElasticSearch index get destroyed.

Setting this value must be done after creating the mapping inside ElasticSearch but before you start the indexing of processes or you must redone everything again. Setting this parameter should be done inside the application instead of running a curl command on the ElasticSearch server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug search search, filter
Development

No branches or pull requests

4 participants