Indexing of type TASK is extremely slow #5131

matthias-ronge · 2022-05-09T18:07:47Z

On a linux server, indexing of tasks roughly takes 1 minute for every 280 tasks (1h 47' for 29.930 tasks). Meanwhile, the system is idle to 60%.

Though log level is set to TRACE for org.kitodo, there is no information in the log file. Monitoring the HTTP traffic

tcpflow -p -c -i lo port 9200

there is roughly one indexing request per minute. Looks like there is a timeout playing in, which shouldn’t be there …

Indexing was started with button Start indexing for all on Tasks row.

Additional observations: Number of indexed entries and progress bar for “Whole index” didn’t update, and indexing buttons didn’t turn back blue after indexing of tasks was complete:

Goal: Speed up indexing of TASK.

The text was updated successfully, but these errors were encountered:

henning-gerhardt · 2022-05-10T06:30:29Z

Is the application running behind an application proxy like Apache HTTP or nginx? If so did the application proxy forward the web sockets to the running Tomcat instance? The indexing page is using a websocket to send the current indexing process to the client browser who starting the the indexing. If the web socket request and data is not forwarded then you see only a indexing process if you manually refresh the indexing page.

Just some hints from a larger installation, which can maybe speed up or speed down the indexing (some need a issue to open):

Adjust the values of elasticsearch.batch and elasticsearch.indexLimit to higher or lower values, depending how many rows must be indexed
The value of elasticsearch.indexLimit should be a multiple value of elasticsearch.batch or both values should be the same (avoiding issues on indexing)
If you start indexing after the current application is running a long time and / or many processes are creating, restart the application in front of indexing as on application start some values get initialised but not updated as long the application runs. The not updated values causing problems on indexing later on and besides using a lot of resources even you don't indexing everything.
If you have a lot of data to index, do this in single steps. If you use the "indexing all" feature then you must have open the indexing page the whole time and could not close this page as switching to the next indexing term won't start. You must in this case restart the application and start the remaining indexing terms as single indexing.
After starting indexing refresh the indexing page in your browser. This is disabling the automatic refreshing of indexing process but it is lowering the access to Elastic Search as 100 requests per second are send to Elastic Search to get the current process of indexing

matthias-ronge · 2022-05-10T07:08:50Z

This is the common setup with Apache/modJK. Yes, I did index row by row, and the issues only showed for indexing Tasks, any other was fast, even Processes only took short. (At least, I didn’t notice it being slow; but I can test it again the days.)

as 100 requests per second are send to Elastic Search to get the current process of indexing

I don’t know if this is true any more, I didn’t observe these requests in HTTP traffic, only roughly one request per second, see the animation above. It looks like this is checking if the index has been created. (Could also be cached, BTW.)

henning-gerhardt · 2022-05-10T07:38:08Z

This is the common setup with Apache/modJK.

Apache with mod_jk for forwarding the requests to Tomcat did work well until web sockets are used. Mod_JK can not forward web sockets and it's looks for me that the development for mod_jk is paused. I had even issues with mod_jk if you use HTTP2 so I switched to mod_proxy and its sub-projects for forwarding all requests to Tomcat.

as 100 requests per second are send to Elastic Search to get the current process of indexing

I don’t know if this is true any more, I didn’t observe these requests in HTTP traffic, only roughly one request per second, see the animation above. It looks like this is checking if the index has been created. (Could also be cached, BTW.)

I see the amount of request to Elastic Search only as I used a similar construct in front of Elastic Search like on our application server. Maybe you did not see this messages any more as you maybe refreshed the index page in front of this. Or it depends on your Apache/mod_jk setting which is "blocking" the web sockets.

matthias-ronge · 2022-05-11T09:26:08Z

The incomplete refresh of the indexing page is definitely also something we should look into, but the primary concern in this issue is that the indexing tasks are so slow, like: Why is only one indexing request sent per minute for tasks? Who is waiting there for what, and why?

henning-gerhardt · 2022-05-11T09:38:19Z

On our system or in the logs of our system I see only bulk requests which contain the content and amount of the to be indexed data (in our case 2.500 entries per bulk request). It needs time to retrieve the data from the database (could be faster but okay), need a lot of time to transfer the data from the database object to the json object (which needs indeed a speed improvement if this is possible) and a short time to send the data to the Elastic Search instance (this done in an asynchron request).

My hope is still that we can reach a better performance on indexing by switching to hibernate-search but I did not hear something in the last couple of month how far this project is as this project is founded by the development found if I know this right.

matthias-ronge added 3.x improvement labels May 9, 2022

solth removed the 3.x label Jul 7, 2022

solth pinned this issue Jul 7, 2022

solth unpinned this issue Jul 7, 2022

andre-hohmann mentioned this issue Feb 17, 2023

Consolidation of indexing in Kitodo.Production #5546

Open

solth added this to Kitodo.Production - Hibernate Search and search index Jul 11, 2023

solth assigned matthias-ronge Jul 21, 2023

solth moved this to Todo in Kitodo.Production - Hibernate Search and search index Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing of type TASK is extremely slow #5131

Indexing of type TASK is extremely slow #5131

matthias-ronge commented May 9, 2022 •

edited

Loading

henning-gerhardt commented May 10, 2022

matthias-ronge commented May 10, 2022

henning-gerhardt commented May 10, 2022

matthias-ronge commented May 11, 2022

henning-gerhardt commented May 11, 2022

Indexing of type TASK is extremely slow #5131

Indexing of type TASK is extremely slow #5131

Comments

matthias-ronge commented May 9, 2022 • edited Loading

henning-gerhardt commented May 10, 2022

matthias-ronge commented May 10, 2022

henning-gerhardt commented May 10, 2022

matthias-ronge commented May 11, 2022

henning-gerhardt commented May 11, 2022

matthias-ronge commented May 9, 2022 •

edited

Loading