Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with mysql connection timeouts for long OCR jobs? #54

Open
sistason opened this issue Apr 17, 2023 · 1 comment
Open

How to deal with mysql connection timeouts for long OCR jobs? #54

sistason opened this issue Apr 17, 2023 · 1 comment

Comments

@sistason
Copy link

A pdf with a few hundred pages broke my indexing, as after all pages were OCR'ed, the run (via fulltextsearch:document:index or via full-index) crashed with:

  [PDOException (HY000)]
  SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
 

Exception trace:
  at /var/www/html/3rdparty/doctrine/dbal/src/Driver/PDO/Statement.php:92
 PDOStatement->execute() at /var/www/html/3rdparty/doctrine/dbal/src/Driver/PDO/Statement.php:92
 Doctrine\DBAL\Driver\PDO\Statement->execute() at /var/www/html/3rdparty/doctrine/dbal/src/Connection.php:1059
 Doctrine\DBAL\Connection->executeQuery() at /var/www/html/lib/private/DB/Connection.php:261
 OC\DB\Connection->executeQuery() at /var/www/html/3rdparty/doctrine/dbal/src/Query/QueryBuilder.php:345
 Doctrine\DBAL\Query\QueryBuilder->execute() at /var/www/html/lib/private/DB/QueryBuilder/QueryBuilder.php:281
 OC\DB\QueryBuilder\QueryBuilder->execute() at /var/www/html/lib/private/Comments/Manager.php:419
 OC\Comments\Manager->getForObject() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:820
 OCA\Files_FullTextSearch\Service\FilesService->updateCommentsFromFile() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:812
 OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:741
 OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:657
 OCA\Files_FullTextSearch\Service\FilesService->generateDocumentFromIndex() at /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:705
 OCA\Files_FullTextSearch\Service\FilesService->updateDocument() at /var/www/html/apps/files_fulltextsearch/lib/Provider/FilesProvider.php:314
 OCA\Files_FullTextSearch\Provider\FilesProvider->updateDocument() at /var/www/html/apps/fulltextsearch/lib/Command/DocumentIndex.php:112

So, the updateDocument seems to run into mysql connection timeouts during the main loop over the pdf pages. Limiting the pdf pages I can ocr 20 pages, but at 100 it timeouts. I presume the connection timeout is at around 5mins or 10mins.

So how do I deal with this problem?

  • I figure I could increase the mysql connection timeout in the Nextcloud settings, but I'd rather not, as this would impact a whole lot more apps/core possibly negatively, especially since the connection timeout ocr needs would be around 2 hours for 1000 pdf pages...
  • Ideally the "main loop" could ping the database connection in TesseractService.php:#L278, but as I don't see a database connection anywhere here, I presume this is handled in the general occ code. So is this even touchable in the app?
  • I don't want to limit my whole FTS to < 20 pdf pages, which also depends on the --psm and on the general load of the server and will lead to random errors when indexing. I have a few hundred users dealing with policymaking involving big pdfs so ideally, it would not be necessary to limit pdf pages at all...
@sistason sistason changed the title Deal with mysql connection timeouts for long OCR jobs How to deal with mysql connection timeouts for long OCR jobs? Apr 17, 2023
@ShinjiLE
Copy link

Maybe #60 could help. It should kill tesseract if it takes to long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants