-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Various Importers] Inconsistencies in iiif links and coordinates in code and JSONs #117
[Various Importers] Inconsistencies in iiif links and coordinates in code and JSONs #117
Comments
Small update regarding SWA: Notes and discussions from Impresso I were found and contain the following information:
Unfortunately, the information at disposal does not mention the links not working. We can reach another JSON manifest by adding ".tif" to the end of the current links. |
Thanks a lot for the thorough review of this confusing situation. Here a few comments and suggestions:
|
Thank you for the feedback and informations:
|
News from SWA (translated from German): Dear Simon, dear Elias, I wanted to earn the badge (at least halfway) :-) In general, the Impresso project is only on my radar at the moment. We are currently rebuilding our IIIF infrastructure and would also like to introduce persistent identifiers (arks) to ensure better persistence. We are also constantly considering whether there are ways to publish the data differently (as there are always |
Thank you Simon for this update and Martin's contact. I also agree there is no issue in changing to ark identifiers, especially before SWA data is on the interface. Also good to know which suffix is necessary to make the current links work (in particular that SWa needs |
If the iiif Servers have different urls we should maybe add a property to the data the specifies the way how to access info or images. There are probably only a few ways. |
Yes there are a limited number of ways, but all the necessary information is already present in the canonical data, so it should not be necessary to add a new property. I just updated the rebuilder's approach to constructing the image iiif links so that small inconsistencies don't break it, and to change the suffix used if necessary. |
Many thanks for this very detailed and useful investigation :) It's good to have a fresh set of eyes on it, and a good time to refactor some of the code that has evolved organically; back in the day, no institution had a IIIF endpoint, and then it happened more and more, leading to code changes and a not-so-clean situation that now needs to be updated. About the
|
My last comment would be whether adding a IIIF-Major Version to our data would allow to predict the relevant URL for retrieving image information. I think the canonical format should be self-contained (meaning not relying on external DBs). Given that the interface is only one way how we present our data. APIs are another. |
Following our meeting discussing this, here were the decisions that were taken regarding iiif URIs: In the Issue Canonical Schema:
In the Page Canonical Schema:
Documentation: Create a "IIIF Phonebook" documenting all IIIF endpoints/links etc for the various providers Side-Note: Since multiple additions and modifications will be made to almost all existing canonical data, this may be a good opportunity to actually replace correctly the
Cons:
|
Some conclusions regarding IIIF information:Issue level:
Page level
Content item level
Collection level Version of the IIIF API A few info on frontend
|
Really sorry @piconti , I did not refresh my page and did not see your summary before commenting, now mine kinds of duplicates. But better more than not enough information (!) |
No problem!
|
Too add my 5 cents: I would rather go for the image_base_uri consequently. |
A google sheet document was created summarizing all problems that are currently in the data, including regarding this issue, as well as the fixing approach for each. After discussion, it was decided that the mentioned properties
Upcoming new data ingestions (BCUL, BL, ONB, KB) will have both properties. |
@piconti Just to be sure: Will every json of a page have a valid iiif URI in the end (meaning, e.g. for reocring, we only need access to the rebuilt pages, and not look things up in mySQL? |
Currently, no IIIF links are in the rebuilt data at all (I should have precised that the rebuilt data will be modified when applicable, here in the cases of reingestion). However, yes all pages (in canonical format) will have a valid IIIF URI, either in the original Is there an existing appraoch to reOCRing or other uses for which you would need the page iiif URI? |
Oops, I meant canonical pages, sorry for the confusion. |
This issue is directly linked to issues #104 and #105, but after a first investigation about the current situation, it seems multiple importers present inconsistencies in where and how they create the iiif links and coordinates. As a result, a general issue covering all importers seemed more adapted. Sorry in advance, it is very long, but I felt that having the detailed situation somewhere would be useful to have a broader view.
Context
The two previous issues #104 and #105 highlighted that some importers generated invalid Issue JSONs for image content items. In particular the iiif-links were not correct, and depending on the importer, iiif-links or coordinates were misplaced in the content-item object in the output JSON.
Current state of the importer code and JSON files
1. BNF
iiif_link
contains the link to the JPG image (with suffix={coords}/full/0/default.jpg
instead ofinfo.json
)iiif_link
is placed inside the content item's metadata (ci['m']['iiif_link']
)c
are not present at all in the content items3://canonical-data/excelsior/issues/excelsior-1910-issues.jsonl.bz2
:iiif
maps to a "full" manifest, usingIIIF_MANIFEST_SUFFIX = "full/full/0/manifest.json"
, (wich yields different results compared to usingIIIF_MANIFEST_SUFFIX = "info.json"
).s3://canonical-data/excelsior/pages/excelsior-1910/excelsior-1910-11-16-a-pages.jsonl.bz2
: "iiif": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f1/full/full/0/manifest.json" (compared to "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46000007/f1/info.json".)2. BNF-EN
iiif_link
contains the link to the JPG image (with suffix={coords}/full/0/default.jpg
instead ofinfo.json
)iiif_link
is placed outside the content item's metadata (ci['iiif_link']
)c
is placed inside the content item's metadata (ci['m']['c']
)s3://canonical-data/jdpl/issues/jdpl-1814-issues.jsonl.bz2
:iiif
maps to the JPG image of the full page, usingIIIF_SUFFIX = "full/full/0/default.jpg"
.s3://canonical-data/jdpl/pages/jdpl-1814/jdpl-1814-05-27-a-pages.jsonl.bz2
for the page corresponding to above example: "iiif": "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/full/full/0/default.jpg" (compared to "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4209172/f4/info.json".)3. RERO
iiif_link
contains the link to the JPG image (with suffix={coords}/full/0/default.jpg
instead ofinfo.json
)iiif_link
is placed outside the content item's metadata (ci['iiif_link']
)c
is placed inside the content item's metadata (ci['m']['c']
)s3://playground-pauline/VHT/issues/VHT-1939-issues.jsonl.bz2
(recently generated canonical data using master branch):s3://canonical-data
don't have the same object structure, and no code performing a patch or correcting this issue was found. The differences are the following:iiif_link
contains the link to the manifest (with suffix=info.json
), and is placed inside the content item's metadata (ci['m']['iiif_link']
)c
is placed outside the content item's metadata (ci['c']
)s3://canonical-data/VHT/issues/VHT-1939-issues.jsonl.bz2
:iiif
maps to a manifest, using the page'sid
as suffix to the impresso URL endpoint.s3://canonical-data/VHT/pages/VHT-1939/VHT-1939-01-06-a-pages.jsonl.bz2
for the page corresponding to above example: "iiif": "https://impresso-project.ch/api/proxy/iiif/VHT-1939-01-06-a-p0001" (this is unchanged in the recently generated data.)4. BNL/Lux
iiif_link
contains the link to the JSON manifest (with suffix=info.json
)iiif_link
is placed inside the content item's metadata (ci['m']['iiif_link']
)c
is placed outside the content item's metadata (ci['c']
)s3://canonical-data/tageblatt/issues/tageblatt-1913-issues.jsonl.bz2
:iiif
maps to the page's JSON manifest. It should also be corrected (issue change in BNL's ARK-based URLs #103).s3://canonical-data/VHT/pages/VHT-1939/VHT-1939-01-06-a-pages.jsonl.bz2
for the page corresponding to above example: "iiif": "https://iiif.eluxemburgensia.lu/iiif/2/ark:%2f70795%2ft8mg9c%2fpages%2f3/info.json" (corrected version: "https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2ft8mg9c%2fpages%2f3/info.json")5. SWA
iiif
uses the filename as suffix, usingIIIF_ENDPOINT_URL = "https://ub-sipi.ub.unibas.ch/impresso"
, but it does not seem to work.s3://canonical-data/arbeigeber/pages/arbeitgeber-1907-01-05-a-pages.jsonl.bz2
: "iiif": "https://ub-sipi.ub.unibas.ch/impresso/BAU_1_000059110_1907_0001".6. Olive
c
is placed outside the content item's metadata (ci['c']
)iiif
maps to a manifest, using the page'sid
as suffix to the impresso URL endpoint. Same as for RERO.7. TETML (FedGaz)
iiif
maps to a manifest, using the page'sid
as suffix to the impresso URL endpoint. Same as for RERO.Summary
iiif_link
placementc
placementWhat is expected
In addition, this issue suggests the coordinates should be inside the metadata, and a test was created to enforce it for RERO.
As per the Page iiif links, I have not found which module uses it, I therefore don't know if it should map to the manifest or image of the full page.
Proposed Approach
(feedback welcome)
Ideally, no canonical data that is working without issues is re-ingested, unless really necessary.
The code for all importers where relevant are matched to BNL situation, as it is the one expected in the downstream tasks. Additionally, the RERO's canonical data inside the bucket
s3://canonical-data
already matches it and it prevents substantial re-running.s3://canonical-data
follows the same structure.A function can be implemented in the rebuilder to accommodate or check for some of the cases here, as to prevent errors.
Only BNF and BNF-EN are re-ingested to correct the various issues (in particular with the page iiif link) and unify the importers. Since they are new to this release and reprensent much less data, this should be more doable.
SWA is patched to fix the page links.
A few questions remain: @mromanello @e-maud
s3://canonical-data
)?The text was updated successfully, but these errors were encountered: