Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ocrd-tesserocr with caching for Processing Server not possible #190

Closed
joschrew opened this issue Mar 13, 2023 · 5 comments · Fixed by #191
Closed

Use ocrd-tesserocr with caching for Processing Server not possible #190

joschrew opened this issue Mar 13, 2023 · 5 comments · Fixed by #191

Comments

@joschrew
Copy link
Contributor

joschrew commented Mar 13, 2023

In the processing server pull request we want to use caching. Therefore an instance of a processor without a workspace is created. Later when the user requests a processor-run we want to set the workspace into that cache-instance.

The problem seems to be that ocrd-tesserocr processors delegates the processing internally to the processor TesserocrRecognize`:

def __init__(self, *args, **kwargs):
kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
kwargs['version'] = OCRD_TOOL['version']
super().__init__(*args, **kwargs)
if hasattr(self, 'workspace'):
recognize_kwargs = {**kwargs}
recognize_kwargs.pop('dump_json', None)
recognize_kwargs.pop('dump_module_dir', None)
recognize_kwargs.pop('show_help', None)
recognize_kwargs.pop('show_version', None)
recognize_kwargs['parameter'] = self.parameter
recognize_kwargs['parameter']['overwrite_segments'] = self.parameter['overwrite_regions']
del recognize_kwargs['parameter']['overwrite_regions']
recognize_kwargs['parameter']['segmentation_level'] = "region"
recognize_kwargs['parameter']['textequiv_level'] = "region"
recognize_kwargs['parameter']['block_polygons'] = self.parameter['crop_polygons']
del recognize_kwargs['parameter']['crop_polygons']
self.recognizer = TesserocrRecognize(self.workspace, **recognize_kwargs)
self.recognizer.logger = getLogger('processor.TesserocrSegmentRegion')

The workspace is transferred to this class on creation. If the workspace is (re-)set afterwards it does not get delegated to the created instance of TesserocrRecognize

@bertsky
Copy link
Collaborator

bertsky commented Mar 13, 2023

Indeed, this pattern is a problem.

if hasattr(self, 'workspace'):

We could easily rewrite this by testing for output_file_grp instead of workspace AFAICS. (It's just there to differentiate between the other, non-processing CLI contexts, like dump and help.)

@joschrew
Copy link
Contributor Author

joschrew commented Mar 13, 2023

The only solution I see is making workspace a property in the processor-base-class. And then override the setter in ocrd_tesserorc and transfer the newly set workspace to the wrapped instance of TesserocrSegmentRegion as well. But I don't know if that can be done in ocr-d-core without problems.

@bertsky
Copy link
Collaborator

bertsky commented Mar 13, 2023

The only solution I see is making workspace a property in the processor-base-class. And then override the setter in ocrd_tesserorc and transfer the newly set workspace to the wrapped instance of TesserocrSegmentRegion as well.

I don't understand how that would help. The kwarg transfer needs to happen in the constructor already.

Within the new paradigm of the changed processor API, differentiating between processing and non-processing context becomes painless. (The former can happen in setup, including all the parameter kwarg transfers.)

But for the moment, as I said, exchanging workspace for output_file_grp test should be safe. (And of course, passing *args indiscriminately, not self.workspace explicitly.)

@joschrew
Copy link
Contributor Author

joschrew commented Mar 13, 2023

I don't understand how that would help. The kwarg transfer needs to happen in the constructor already.

That's the problem I am trying to describe. At the constructor we don't have the workspace yet because we are creating a cached instance. We want to set the workspace later after the constructor has finished. But when we do that (setting the workspace) the wrapped instance of TesserocrRecognize does not get updated with the "new" workspace.

With the code snipped (ocrd_tesserocr/ocrd_tesserocr/segment_region.py) I only wanted to show that the processor wraps an instance of TesserocrRecognize which does the real work. The if hasattr(self, 'workspace') is not a problem when the workspace is set to None with the first positional argument.

@bertsky
Copy link
Collaborator

bertsky commented Mar 13, 2023

But when we do that (setting the workspace) the wrapped instance of TesserocrRecognize does not get updated with the "new" workspace.

Ah, sry, I didn't see it! Yes, since we don't override the constructor, but construct and then wrap a separate processor object, many patterns don't work. So I would say this trick is bad practice and should be ruled out in the future.

IIRC I used that wrap pattern instead of overriding the constructor because I did not want the user to see the processor appear like ocrd-tesserocr-recognize (with all its complicated parameters). But the downsides are obviously too great.

I'll make a PR based on the simpler pattern.

The only solution I see is making workspace a property in the processor-base-class. And then override the setter in ocrd_tesserorc and transfer the newly set workspace to the wrapped instance of TesserocrSegmentRegion as well.

Yes, that would work, and maybe it's also the best pattern in general going forward for all processors. (The setter could do many other things, like syncing/closing log files, releasing locks, syncing the METS cache or METS server.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants