Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_file_id: no page_id number extraction #744

Merged
merged 5 commits into from
Dec 8, 2021
Merged

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Nov 12, 2021

In make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash).

But before fallback to purely numerical ID, additionally check if the input file's ID does already contain the pageId: in that case,
only append the output fileGrp to that ID (because it is sufficiently unique already).

(Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works. On the other hand, often the file IDs from non-OCRD data contain the pageId directly, in which case it's better to stick to that convention.)

In make_file_id, if the input file's ID does not
contain the input fileGrp, then do not attempt 
to extract the numerical part of the pageId
(which might still clash); but before fallback to
purely numerical ID, additionally check if the ID
does already contain the pageId: in that case,
only append the output fileGrp to that ID.
@bertsky bertsky force-pushed the make-file-id-no-num-page-id branch from 7ba0454 to f96d3fd Compare November 12, 2021 18:27
(no numerical pageId extraction any more)
@bertsky
Copy link
Collaborator Author

bertsky commented Nov 24, 2021

Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.

To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.

To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.

IIUC this seems unnecessary. Do you have an example of an ID where this might be pertinent?

ocrd_utils/ocrd_utils/str.py Outdated Show resolved Hide resolved
@@ -289,7 +289,7 @@ def test_make_file_id_570(self):
def test_make_file_id_605(self):
"""https://github.com/OCR-D/core/pull/605"""
mets = OcrdMets.empty_mets()
f = mets.add_file('1:!GRP', ID='FOO_0001', pageId='phys0001')
f = mets.add_file('2:!GRP', ID='FOO_0001', pageId='phys0001')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably disallow filegroups starting with a number because the resulting ID might lead to an invalid xsd:ID because they mustn't start with a number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if #746 kicks in, such tests should all fail...

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 6, 2021

Do you have an example of an ID where this might be pertinent?

I recently saw it in Konzilsprotokolle GT, e.g. 1796-97_00000024_img, and Boernerianus, e.g. img100-10.

@kba kba mentioned this pull request Dec 7, 2021
@kba kba merged commit fe99986 into master Dec 8, 2021
@kba kba deleted the make-file-id-no-num-page-id branch December 8, 2021 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants