Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing of type TASK is extremely slow #5131

Open
matthias-ronge opened this issue May 9, 2022 · 5 comments
Open

Indexing of type TASK is extremely slow #5131

matthias-ronge opened this issue May 9, 2022 · 5 comments
Assignees

Comments

@matthias-ronge
Copy link
Collaborator

matthias-ronge commented May 9, 2022

On a linux server, indexing of tasks roughly takes 1 minute for every 280 tasks (1h 47' for 29.930 tasks). Meanwhile, the system is idle to 60%.

Screencast

Though log level is set to TRACE for org.kitodo, there is no information in the log file. Monitoring the HTTP traffic

tcpflow -p -c -i lo port 9200

there is roughly one indexing request per minute. Looks like there is a timeout playing in, which shouldn’t be there …

Screencast

Indexing was started with button Start indexing for all on Tasks row.

Additional observations: Number of indexed entries and progress bar for “Whole index” didn’t update, and indexing buttons didn’t turn back blue after indexing of tasks was complete:

progress

Goal: Speed up indexing of TASK.

@henning-gerhardt
Copy link
Collaborator

Is the application running behind an application proxy like Apache HTTP or nginx? If so did the application proxy forward the web sockets to the running Tomcat instance? The indexing page is using a websocket to send the current indexing process to the client browser who starting the the indexing. If the web socket request and data is not forwarded then you see only a indexing process if you manually refresh the indexing page.

Just some hints from a larger installation, which can maybe speed up or speed down the indexing (some need a issue to open):

  • Adjust the values of elasticsearch.batch and elasticsearch.indexLimit to higher or lower values, depending how many rows must be indexed
  • The value of elasticsearch.indexLimit should be a multiple value of elasticsearch.batch or both values should be the same (avoiding issues on indexing)
  • If you start indexing after the current application is running a long time and / or many processes are creating, restart the application in front of indexing as on application start some values get initialised but not updated as long the application runs. The not updated values causing problems on indexing later on and besides using a lot of resources even you don't indexing everything.
  • If you have a lot of data to index, do this in single steps. If you use the "indexing all" feature then you must have open the indexing page the whole time and could not close this page as switching to the next indexing term won't start. You must in this case restart the application and start the remaining indexing terms as single indexing.
  • After starting indexing refresh the indexing page in your browser. This is disabling the automatic refreshing of indexing process but it is lowering the access to Elastic Search as 100 requests per second are send to Elastic Search to get the current process of indexing

@matthias-ronge
Copy link
Collaborator Author

This is the common setup with Apache/modJK. Yes, I did index row by row, and the issues only showed for indexing Tasks, any other was fast, even Processes only took short. (At least, I didn’t notice it being slow; but I can test it again the days.)

as 100 requests per second are send to Elastic Search to get the current process of indexing

I don’t know if this is true any more, I didn’t observe these requests in HTTP traffic, only roughly one request per second, see the animation above. It looks like this is checking if the index has been created. (Could also be cached, BTW.)

@henning-gerhardt
Copy link
Collaborator

This is the common setup with Apache/modJK.

Apache with mod_jk for forwarding the requests to Tomcat did work well until web sockets are used. Mod_JK can not forward web sockets and it's looks for me that the development for mod_jk is paused. I had even issues with mod_jk if you use HTTP2 so I switched to mod_proxy and its sub-projects for forwarding all requests to Tomcat.

as 100 requests per second are send to Elastic Search to get the current process of indexing

I don’t know if this is true any more, I didn’t observe these requests in HTTP traffic, only roughly one request per second, see the animation above. It looks like this is checking if the index has been created. (Could also be cached, BTW.)

I see the amount of request to Elastic Search only as I used a similar construct in front of Elastic Search like on our application server. Maybe you did not see this messages any more as you maybe refreshed the index page in front of this. Or it depends on your Apache/mod_jk setting which is "blocking" the web sockets.

@matthias-ronge
Copy link
Collaborator Author

The incomplete refresh of the indexing page is definitely also something we should look into, but the primary concern in this issue is that the indexing tasks are so slow, like: Why is only one indexing request sent per minute for tasks? Who is waiting there for what, and why?

@henning-gerhardt
Copy link
Collaborator

On our system or in the logs of our system I see only bulk requests which contain the content and amount of the to be indexed data (in our case 2.500 entries per bulk request). It needs time to retrieve the data from the database (could be faster but okay), need a lot of time to transfer the data from the database object to the json object (which needs indeed a speed improvement if this is possible) and a short time to send the data to the Elastic Search instance (this done in an asynchron request).

My hope is still that we can reach a better performance on indexing by switching to hibernate-search but I did not hear something in the last couple of month how far this project is as this project is founded by the development found if I know this right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants