-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make_file_id: no page_id number extraction #744
Conversation
In make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash); but before fallback to purely numerical ID, additionally check if the ID does already contain the pageId: in that case, only append the output fileGrp to that ID.
7ba0454
to
f96d3fd
Compare
(no numerical pageId extraction any more)
To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.
To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.
IIUC this seems unnecessary. Do you have an example of an ID where this might be pertinent?
@@ -289,7 +289,7 @@ def test_make_file_id_570(self): | |||
def test_make_file_id_605(self): | |||
"""https://github.com/OCR-D/core/pull/605""" | |||
mets = OcrdMets.empty_mets() | |||
f = mets.add_file('1:!GRP', ID='FOO_0001', pageId='phys0001') | |||
f = mets.add_file('2:!GRP', ID='FOO_0001', pageId='phys0001') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably disallow filegroups starting with a number because the resulting ID might lead to an invalid xsd:ID
because they mustn't start with a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if #746 kicks in, such tests should all fail...
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
I recently saw it in Konzilsprotokolle GT, e.g. |
In
make_file_id
, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash).But before fallback to purely numerical ID, additionally check if the input file's ID does already contain the pageId: in that case,
only append the output fileGrp to that ID (because it is sufficiently unique already).
(Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works. On the other hand, often the file IDs from non-OCRD data contain the pageId directly, in which case it's better to stick to that convention.)