Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to v3 #44

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Port to v3 #44

wants to merge 30 commits into from

Conversation

kba
Copy link
Member

@kba kba commented Aug 11, 2024

No description provided.

@kba kba requested review from MehmedGIT and bertsky August 11, 2024 12:55
Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now needs update

ocrd_kraken/binarize.py Outdated Show resolved Hide resolved
ocrd_kraken/recognize.py Outdated Show resolved Hide resolved
ocrd_kraken/recognize.py Outdated Show resolved Hide resolved
ocrd_kraken/segment.py Outdated Show resolved Hide resolved
@bertsky bertsky self-requested a review August 13, 2024 22:25
ocrd_kraken/binarize.py Outdated Show resolved Hide resolved
Comment on lines +12 to +54
CONFIGS = ['', 'pageparallel', 'metscache', 'pageparallel+metscache']

@pytest.fixture(params=CONFIGS)
def workspace(tmpdir, pytestconfig, request):
def _make_workspace(workspace_path):
initLogging()
if pytestconfig.getoption('verbose') > 0:
setOverrideLogLevel('DEBUG')
with pushd_popd(tmpdir):
directory = str(tmpdir)
resolver = Resolver()
workspace = resolver.workspace_from_url(workspace_path, dst_dir=directory, download=True)
config.OCRD_MISSING_OUTPUT = "ABORT"
if 'metscache' in request.param:
config.OCRD_METS_CACHING = True
print("enabled METS caching")
if 'pageparallel' in request.param:
config.OCRD_MAX_PARALLEL_PAGES = 4
print("enabled page-parallel processing")
def _start_mets_server(*args, **kwargs):
print("running with METS server")
server = OcrdMetsServer(*args, **kwargs)
server.startup()
process = Process(target=_start_mets_server,
kwargs={'workspace': workspace, 'url': 'mets.sock'})
process.start()
sleep(1)
workspace = Workspace(resolver, directory, mets_server_url='mets.sock')
yield {'workspace': workspace, 'mets_server_url': 'mets.sock'}
process.terminate()
else:
yield {'workspace': workspace}
config.reset_defaults()
return _make_workspace


@pytest.fixture
def workspace_manifesto(workspace):
yield from workspace(assets.path_to('communist_manifesto/data/mets.xml'))

@pytest.fixture
def workspace_aufklaerung(workspace):
yield from workspace(assets.path_to('kant_aufklaerung_1784/data/mets.xml'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this could be a template for all processor tests. Testing w/o METS Server and w/o is important IMO.

We can easily add more configuration scenarios there.

Comment on lines +14 to +36
def test_recognize(workspace_aufklaerung):
# some models (like default en) require binarized images
run_processor(KrakenBinarize,
input_file_grp="OCR-D-GT-PAGE",
output_file_grp="OCR-D-GT-PAGE-BIN",
**workspace_aufklaerung,
)
run_processor(KrakenRecognize,
# re-use layout, overwrite text:
input_file_grp="OCR-D-GT-PAGE-BIN",
output_file_grp="OCR-D-OCR-KRAKEN",
parameter={'overwrite_text': True},
**workspace_aufklaerung,
)
ws = workspace_aufklaerung['workspace']
ws.save_mets()
assert os.path.isdir(os.path.join(ws.directory, 'OCR-D-OCR-KRAKEN'))
results = ws.find_files(file_grp='OCR-D-OCR-KRAKEN', mimetype=MIMETYPE_PAGE)
result0 = next(results, False)
assert result0, "found no output PAGE file"
result0 = page_from_file(result0)
text0 = result0.etree.xpath('//page:Glyph/page:TextEquiv/page:Unicode', namespaces=NAMESPACES)
assert len(text0) > 0, "found no glyph text in output PAGE file"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here the consumer part.

@@ -68,7 +68,7 @@ docker:

# Run test
test: tests/assets
$(PYTHON) -m pytest tests $(PYTEST_ARGS)
$(PYTHON) -m pytest tests --durations=0 $(PYTEST_ARGS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And with this we get to see what difference in performance these settings make:

93.35s call     tests/test_recognize.py::test_recognize[pageparallel+metscache]
92.28s call     tests/test_recognize.py::test_recognize[pageparallel]
76.19s call     tests/test_recognize.py::test_recognize[]
74.83s call     tests/test_recognize.py::test_recognize[metscache]
55.92s call     tests/test_segment.py::test_run_blla[metscache]
55.11s call     tests/test_segment.py::test_run_blla[]
48.43s call     tests/test_segment.py::test_run_blla[pageparallel+metscache]
41.80s call     tests/test_segment.py::test_run_blla[pageparallel]

(In this case, it was only 2 pages – the scaling factor is not so great.)

kba and others added 4 commits January 9, 2025 15:10
- during `setup`, instead of loading models in the processor
  directly, instantiate and spawn a singleton predictor subprocess
  with the given parameters (after resolving the model path name),
  communicating via shared (task and result) queues to synchronize
  processor and predictor processes;
  the predictor will then load models in its own address space
- at runtime, the processor merely calls the predictor with the
  respective arguments for that page, which translates into
  - putting the arguments on the task queue
  - getting the results from the result queue, blocking
- at runtime, the predictor loops into:
  - receiving inputs from the task queue, blocking
  - calling `predict` on them
  - putting outputs on the result queue
- in the predictor, tasks and results are identified via page id,
  so results get retrieved for their respective task only,
  implemented via shared dict to synchronize forked processor workers
- during `shutdown`, tell the predictor to shut down as well
  (terminating the subprocess);
  the predictor will then exit its loop and close the queues
- abstract from kraken.pageseg, kraken.blla, and kraken.rpred
  differences in initialization phase and inference phase via
  shared `common.KrakenPredictor` class, override specifics in
  - `recognize.KrakenRecognizePredictor`:
    - during `setup`, after loading the model, submit a special "task"
      to query the model's `one_channel_mode` attribute
    - at runtime, translate the model into a `defaultdict` for `mm_rpred`,
      but picklable to be compatible with mp.Queue; for the same reason,
      exhaust the result generator immediately
  - `segment.KrakenSegmentPredictor`: during `setup`, map the given
    parameters and inputs to kwargs as applicable by either `pageseg.segment`
    or `blla.segment`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants