-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change API to get page-level parallelization everywhere #322
Comments
No experience let alone solutions. Sry. |
Portability isn't of the essence though because I don't foresee the need to support non-POSIX filesystems. No experience doing this with Python but the principle is simple and https://github.com/dmfrey/FileLock/blob/master/filelock/filelock.py (the accepted answer you linked to) looks like a reasonable solution. |
I'd like to bring in another idea: Maybe file locking is the wrong way to think of page-level parallelization altogether. We have dependencies between successive processors in a workflow, so it does not make much sense to think of running these in parallel (without going into elaborate synchronization mechanisms like semaphores for all pages), except for a very narrow set of circumstances where workflows contain parallel branches. On the other hand, what really can be run independently (at least for most processors and workflows) are the pages within one processing step. Even now, a processor implementation can do that, e.g. by using a And there's the rub really: In the current design, So for 3.0, what we should aim at is the following:
As a side note: One could of course always argue that workspace-parallel execution is much less complicated and ensures maximum throughput without extra design/implementation effort and synchronization overhead. But this does not cover the use-case of a single (large) workspace, or the use-case of minimum latency. I guess this proposal should go through the spec first? |
It's a good proposal, while the
Not really, the workspace-processor approach in core is just one implementation and while it will be some work to refactor the processors, they will still follow the same CLI, just that the code of the processors can get rid of a lot of loops and inconsistent implementations across processors. From what I see this change should not result in any changed behavior for users? I will think about it some more though, quite possible I'm missing something here. |
One thing we have to take into account are processors that run on multiple input file groups. For those, There might even be processors which read multi-page files (like COCO segmentation, TIFF or PDF). For example, https://github.com/OCR-D/ocrd_segment/blob/master/ocrd_segment/import_coco_segmentation.py. See OCR-D/spec#142 |
My toughts about parallelization:
|
It is desirable to support this kind of parallel processing, because it allows to get a much faster OCR result for a certain book. But we should not forget that it is also possible to parallelize on the book level, and that already works of course. As long as there is still a huge number of books waiting for OCR, that kind of parallelization might be sufficient for many users. Nevertheless thinking about future improvements is reasonable, and thoughts discussed here could also trigger actions starting now. If a processor should process several pages parallelized, it is important to know the required resources. This is something which the maintainers of each processor know best. So the processor description (JSON) could be extended by information like required CPU cores, maximum RAM, number of GPUs and maybe more which allows a best use of the available resources. Some of those values might depend on image size or other factors, so they will be only rough estimates, but even that can help. That additional information is also useful for parallelization on the book level which would overwise have to guess or work with try and error. |
A central problem for the page level parallelization seems to be updating the METS file. Locking that file is only one of the possible solutions. Ideally it requires that the individual page results may be added to the METS file in random order. Writing them ordered could result in processes for higher page numbers which have to wait until lower page numbers are finished, so RAM would have to be large enough for running plus waiting processors. If each processing of a single page would not write directly to the METS file, but delegate that to a central software instance, that central instance could manage orderly updates of the METS file without any locking. |
Or some kind of database, perhaps sqlite3, with conversion from/to mets as first/final step. |
Yes, that's what the above proposal is about. If we change the API for modules, moving from
What do you mean by that? You still cannot process multiple workflows on the same document, because we would need a METS locking/synchronization mechanism for that. And you can already run the same workflow on multiple documents with workflow-configuration now.
Of course, when running page-parallel, we must offer the user ways to say specifically what resources (not) to allocate. And the more intelligence we invest in automatic flow optimization, the less the user has to experiment with the settings.
I don't think we should use environment variables for that (at least not exclusively). See here for various mechanisms we could use for global options like that.
That's a hard problem indeed. Even estimating the (usually non-linear) dependency between input size and resource allocation is a daunting task for implementors. And it also depends a lot on the parameters set by the workflow (e.g. model files or search depth). That's why I think realistically we should be agnostic about resource usage and let the user or workflow engine do the guesswork in real time. (But if there's one information we do need implementors to specify statically, it's whether or not the processor tries to make use of a GPU. See
That's not true if core itself provides the parallelization (as outline above). The This problem only arises if you want to run different workflows concurrently on the same data. |
@bertsky: I've meant a completely separate parallel workflow (on other data), if it does not make sense to assign all free resources to 1 single workflow (e. g. if most processors do not use more than 8 CPUs and you have 32 of them). |
In that case everything @stweil and I said about book/document-parallel processing with ocrd-make (or plain |
BTW this is not necessarily true for bashlib based processors. Currently these have to bake some kind of shell loop around |
Today I had an eye on the resources used by the "workflow for slower processors" (https://ocr-d.de/en/workflows). I was surprised that there already exist several processors which create lots (> 10) of threads. They definitely occupy more than a single CPU core. Especially processors with TensorFlow seem to use all available cores (but very inefficiently). That makes efficient parallelization on the book level currently difficult or even impossible. Ideally it would be possible to force each processor to use only a single core. Then we could process one book per core. |
Yes, that kind of behaviour is very undesirable. It should at least be possible to disable automatic multiprocessing/multithreading when running in parallel. This is especially problematic when running on clusters with explicit resource allocation: You'll get oversubscription problems when e.g. Python processors simply use We've had this already with other libraries like OpenMP, OpenBLAS and numpy. For Tensorflow I am afraid that (again) we are in a very uncomfortable situation: For one, there is no simple environment variable to control CPU allocation – it must be done in code. Next, there have been issues with that, IIUC related to TF1 vs TF2 migration. But that ultimately means that we would have to introduce an environment variable to control this in every TF-enabled processor implementation... |
... or new parameters for the processors which use TF. I think that's also necessary in environments with one or more GPUs: parallel runs need to know which CPU core(s) and which GPU(s) to use. That sounds like a real challenging task, and optimized usage of all resources cannot be achieved with simple environment variables or parameters unless each processor uses only a single CPU core. |
Yes. My point was: changes to the modules require maximal effort and are difficult to enforce/maintain uniformly.
GPU resources can be fully controlled via
I think the only useful and general restriction we can request modules to provide is running 1 CPU core and 0/1 CUDA device when needed (via envvar / param / cfg) – instead of automatic resource maximization. This would at least allow some parallelism at the workflow level. The runtime will have to guess optimal multi-scalarity (without over/under-allocation) from that anyway, because modules themselves cannot be expected to give reliable estimates under all circumstances. And as long as enough documents are run in parallel, the differences in resource utilisation among the various processors in a workflow (vertically) will even out (horizontally) anyway. |
From the view of heidelberg digitalization dept. (final check) it would be good to process a single book as fast as possible. |
@jbarth-ubhd, tearing out the pages of the book to split it into many is not a solution? To be serious: technically it is fine to split a book into many books. That means you could make several METS files |
Has anybody some experience with OCR servers which allow many parallel threads? Maybe something like an AMD EPYC with 64 CPU cores / 128 threads, or even better a server with dual processors and 256 threads? We only have a small Intel machine with 16 cores / 32 threads ... |
This has been my suggestion in the 2000s as I saw the large cutting machines from the bookbinding dept :-) . Yes, could be done this way in parallel. I'll think sbb_textline will be the bottleneck, if it needs 1 GPU and 10 GB GPU-RAM per instance. We have a server with 2* Xeon 6150 (36 Cores, 72 Threads, 128 GB RAM), but no GPU unfortunately |
Getting back onto the original topic of changing the process API to support catching page-level processor failures or employ page-level parallelism in core, besides the concerns raised already about multi-fileGrp input and multi-page input processors and about bashlib processors in that regard, I want to point out another aspect which we forgot that will likely complicate our porting efforts among processor implementations. The current API does not make it easy to set up the processor state (resolving file paths, loading large files or initializing large memory objects, setting up configuration and informing about all of that): the constructor is not only used before entering Under the new API, however, that distinction will become obligatory. EDIT Perhaps the new API should include this directly with a new abstract method |
I wrote a summary for the above two aspects of parallelisation and error recovery, plus the additional aspect of overhead, with a proposal for an overall solution (workflow server and processor server). |
In #278 we already had the case were the METS got broken due to ctrl+C in a single process. We discussed about parallelization but agreed to defer that topic. Now this is about providing a locking mechanism that allows parallel processing.
Currently it is impossible to run in parallel over a workspace, simply because it is too dangerous: the result of one actor could be lost due to another without any indication of it. This hampers makefilization and similar efforts.
IIUC, getting FS locking in a portable way is tricky, if you want to get it right.
Does anyone have experience or solutions?
The text was updated successfully, but these errors were encountered: