Processor parallel pages: switch from multithreading to multiprocessing #23

bertsky · 2024-10-19T09:44:11Z

This changes the concurrent.futures.ThreadPoolExecutor for management of page worker tasks to a loky.ProcessPoolExecutor for the same purpose.

Reasons for this:

multithreading only allows utilising the potential from concurrent I/O or networking, but due to the Python GIL is not and cannot be truly multiscalar
multiprocessing also avoids problems with processors using libraries that are not thread-safe (think Tesseract)
the Python stdlib's implementation in concurrent.futures have long been buggy, esp. the memory leak and the shutdown deadlock. This has been haunting us in Python 3.8 but will not be backported before 3.10. In search for external backports I came across loky which turns out to be the actual origin for all the recent robustness improvements in stdlib (they merely cherry-picked from their standalone implementation). Since it is still available as an external library, supports 3.8 and is well maintained, I switched to this as a dependency.

I spent a lot of time trying to make error handling and shutdown really work well under the multiprocessing regime. Also, I learned that it is crucial when doing multiprocessing to avoid spawning any threads before forking.

Forking (rather than spawning fresh child processes) is necessary, because our API needs to run Processor.process_page_file on the worker, where the latter is defined by subclasses: it is not possible (due to unpicklable objects in Processor) to serialise the self when sending it to children, and would not be feasible either. By forking, we can share the processor instance via global variable, and send tasks (containing the CliendSideOcrdFile objects) and results via queues with minimal overhead.

EDIT: Note that fork means that it will not work on Windows, and MacOS might be problematic (see stdlib docs). AFAICS the issue with forking is precisely that one absolutely must avoid having multiple threads before the fork (which apparently on MacOS is not guaranteed because of some system libraries)

…tent with file config and prevent imported libraries from initing logging first), but disable propagation for ocrd loggers (to avoid duplication)

…xecutor

…r itself (rather than future query)

bertsky · 2024-10-19T11:24:49Z

Lest I forget: forking also helps for shared loggers. (I initially thought we would need to synchronize loggers, but that turns out to be unnecessary.)

One more concern: using a module-level function _page_worker in lieu of the Processor.process_page_file might make it harder for subclasses (processors) to implement their own .process_workspace logic, borrowing from the base class. If we want them to benefit from error handling and parallelization, they currently have to stick with .process_page_pcgts or custom .process_page_file.

So to alleviate that, we could split up ._process_workspace_run into

.process_workspace_submit_tasks (for the first loop) and
.process_workspace_finish_tasks (for the second loop).

Or into

.process_workspace_submit_task_files (for the first loop body) and
.process_workspace_handle_task_result (for the second loop body) .

…ny failures already (rate will be too low)

…s in the end

bertsky added 7 commits October 9, 2024 16:34

ocrd_utils.initLogging: also add handler to root logger (to be consis…

31a8474

…tent with file config and prevent imported libraries from initing logging first), but disable propagation for ocrd loggers (to avoid duplication)

CLI decorator: only import ocrd_network when needed

d7049b1

Processor w/ OCRD_MAX_PARALLEL_PAGES: ThreadPoolExecutor→ProcessPoolE…

a9d49c1

…xecutor

Processor.process_workspace: apply timeout on process_page_file worke…

588c91d

…r itself (rather than future query)

Processor w/ OCRD_MAX_PARALLEL_PAGES: concurrent.futures→loky

d126bdc

Processor w/o OCRD_MAX_PARALLEL_PAGES: dummy instead of executor

afa7f30

ocrd.process.profile logger: account for subprocess CPU time, too

5821701

bertsky requested review from kba and MehmedGIT October 19, 2024 11:25

bertsky mentioned this pull request Oct 19, 2024

Calamari2 OCR-D/ocrd_calamari#118

Open

bertsky added 4 commits October 21, 2024 12:47

Processor.process_workspace: improve reporting, raise early if too ma…

53b1854

…ny failures already (rate will be too low)

Processor: refactor process_workspace into overridable subfuncs

4d66e37

Processor.process_workspace_handle_page_task: do not handler sigint

71d6d49

Processor.process_workspace_handle_tasks: log nr of ignored exception…

d2d5290

…s in the end

bertsky merged commit 7932a6a into new-processor-api Oct 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processor parallel pages: switch from multithreading to multiprocessing #23

Processor parallel pages: switch from multithreading to multiprocessing #23

bertsky commented Oct 19, 2024 •

edited

Loading

bertsky commented Oct 19, 2024 •

edited

Loading

Processor parallel pages: switch from multithreading to multiprocessing #23

Processor parallel pages: switch from multithreading to multiprocessing #23

Conversation

bertsky commented Oct 19, 2024 • edited Loading

bertsky commented Oct 19, 2024 • edited Loading

bertsky commented Oct 19, 2024 •

edited

Loading

bertsky commented Oct 19, 2024 •

edited

Loading