Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor parallel pages: switch from multithreading to multiprocessing #23

Merged
merged 11 commits into from
Oct 30, 2024

Conversation

bertsky
Copy link
Owner

@bertsky bertsky commented Oct 19, 2024

This changes the concurrent.futures.ThreadPoolExecutor for management of page worker tasks to a loky.ProcessPoolExecutor for the same purpose.

Reasons for this:

  • multithreading only allows utilising the potential from concurrent I/O or networking, but due to the Python GIL is not and cannot be truly multiscalar
  • multiprocessing also avoids problems with processors using libraries that are not thread-safe (think Tesseract)
  • the Python stdlib's implementation in concurrent.futures have long been buggy, esp. the memory leak and the shutdown deadlock. This has been haunting us in Python 3.8 but will not be backported before 3.10. In search for external backports I came across loky which turns out to be the actual origin for all the recent robustness improvements in stdlib (they merely cherry-picked from their standalone implementation). Since it is still available as an external library, supports 3.8 and is well maintained, I switched to this as a dependency.

I spent a lot of time trying to make error handling and shutdown really work well under the multiprocessing regime. Also, I learned that it is crucial when doing multiprocessing to avoid spawning any threads before forking.

Forking (rather than spawning fresh child processes) is necessary, because our API needs to run Processor.process_page_file on the worker, where the latter is defined by subclasses: it is not possible (due to unpicklable objects in Processor) to serialise the self when sending it to children, and would not be feasible either. By forking, we can share the processor instance via global variable, and send tasks (containing the CliendSideOcrdFile objects) and results via queues with minimal overhead.

EDIT: Note that fork means that it will not work on Windows, and MacOS might be problematic (see stdlib docs). AFAICS the issue with forking is precisely that one absolutely must avoid having multiple threads before the fork (which apparently on MacOS is not guaranteed because of some system libraries)

@bertsky
Copy link
Owner Author

bertsky commented Oct 19, 2024

Lest I forget: forking also helps for shared loggers. (I initially thought we would need to synchronize loggers, but that turns out to be unnecessary.)

One more concern: using a module-level function _page_worker in lieu of the Processor.process_page_file might make it harder for subclasses (processors) to implement their own .process_workspace logic, borrowing from the base class. If we want them to benefit from error handling and parallelization, they currently have to stick with .process_page_pcgts or custom .process_page_file.

So to alleviate that, we could split up ._process_workspace_run into

  • .process_workspace_submit_tasks (for the first loop) and
  • .process_workspace_finish_tasks (for the second loop).

Or into

  • .process_workspace_submit_task_files (for the first loop body) and
  • .process_workspace_handle_task_result (for the second loop body) .

@bertsky bertsky merged commit 7932a6a into new-processor-api Oct 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant