-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processor parallel pages: switch from multithreading to multiprocessing #23
Conversation
…tent with file config and prevent imported libraries from initing logging first), but disable propagation for ocrd loggers (to avoid duplication)
…r itself (rather than future query)
Lest I forget: forking also helps for shared loggers. (I initially thought we would need to synchronize loggers, but that turns out to be unnecessary.) One more concern: using a module-level function So to alleviate that, we could split up
Or into
|
…ny failures already (rate will be too low)
This changes the
concurrent.futures.ThreadPoolExecutor
for management of page worker tasks to aloky.ProcessPoolExecutor
for the same purpose.Reasons for this:
concurrent.futures
have long been buggy, esp. the memory leak and theshutdown
deadlock. This has been haunting us in Python 3.8 but will not be backported before 3.10. In search for external backports I came across loky which turns out to be the actual origin for all the recent robustness improvements in stdlib (they merely cherry-picked from their standalone implementation). Since it is still available as an external library, supports 3.8 and is well maintained, I switched to this as a dependency.I spent a lot of time trying to make error handling and shutdown really work well under the multiprocessing regime. Also, I learned that it is crucial when doing multiprocessing to avoid spawning any threads before forking.
Forking (rather than spawning fresh child processes) is necessary, because our API needs to run
Processor.process_page_file
on the worker, where the latter is defined by subclasses: it is not possible (due to unpicklable objects inProcessor
) to serialise theself
when sending it to children, and would not be feasible either. By forking, we can share the processor instance via global variable, and send tasks (containing theCliendSideOcrdFile
objects) and results via queues with minimal overhead.EDIT: Note that
fork
means that it will not work on Windows, and MacOS might be problematic (see stdlib docs). AFAICS the issue with forking is precisely that one absolutely must avoid having multiple threads before the fork (which apparently on MacOS is not guaranteed because of some system libraries)