Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support Mirador inline search #17

Open
alxp opened this issue Jan 10, 2023 · 23 comments
Open

[FEATURE] Support Mirador inline search #17

alxp opened this issue Jan 10, 2023 · 23 comments
Labels
enhancement New feature or request

Comments

@alxp
Copy link
Contributor

alxp commented Jan 10, 2023

Overview of feature request

Mirador 3 includes an internal search function, but requires a query endpoint. Determine the best way to add this to Islandora.

The UT Scarborough islandora developers have implemented a fork of Islandora Mirador with annotation support which we should be able to pull from in the annotations branch of the repo here:

https://github.com/digitalutsc/islandora_mirador/tree/annotations

What kind of user is the feature intended for?
(Example user roles: Collections Manager, Developer, Systems Administrator, or User)

End user

What inspired the request?

Ongoing discussion of features needed for paged content.

What existing behavior do you want changed?

Extracted text action may be modified to make associating it with a given media easier.

Any brand new behavior do you want to add to Islandora?

A compatible endpoint to serve search queries compatible with Mirador's inline search.

Any related open or closed issues to this feature request?

@alxp alxp added the enhancement New feature or request label Jan 10, 2023
@alxp alxp mentioned this issue Jan 10, 2023
@kstapelfeldt
Copy link
Member

We are ready to report back some progress, as we have an implementation of in-text search and highlighting using Simple Annotation Server on one of our production servers. You can see an example of what we have working by clicking here: https://memory.digital.utsc.utoronto.ca/61220/utsc11543?q=student - the active result is yellow in the viewer, and changes as you click through the list. Logged in users who are administrators can edit the text and save it back to the simple annotation store.

The connection with Drupal node is preserved through a custom field that matches the annotation ID and the node ID. We have a Java converter that transforms Google Vision’s output JSON into the format required for IIIF search, and we assume that this pattern could be followed for things like HOCR. I’m attaching our rough diagram of the workflow in case it’s of interest. We’re happy to answer questions, and Kyle has updated the demo implementation

https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Jan 11, 2023

Screen Shot 2023-01-11 at 1 48 59 PM @kylehuynh205 @Natkeeran

@alxp
Copy link
Contributor Author

alxp commented Jan 25, 2023

@wgilling @patdunlavey I mentioned in the committers call last week that there's a more generic plugin for highlighting hOCR in Solr that might be better than a secondary endpoint search in Mirador.

https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/

Looks like it would also require a second Solr index alongside Drupal but it would be a more unified Islandora experience and the demo does look slick. And it handles things like phrases across column / page boundaries which is really cool.

@patdunlavey
Copy link
Contributor

@alxp funny, I just got to that solr-ocrhighlighting project by other means.

I've been reviewing the SimpleAnnotationServer approach, in particular utsc's wiki page, and from what I can see the SAS is simply a cache of pre-generated annotations which can be searched to return a list of matching annotations. The process for generating the annotations and loading them into that cache is extremely specific to utsc's workflow, and involves a lot of manual steps per OCR'd page. The primary attraction seems to be that very little custom code was needed to get it working. I'm sure the manual steps could be largely mechanized, but now you're writing code - and we have to solve for a much more general case.

I've been looking at how to get the extracted hOCR indexed by search_api_solr. The search_api_attachments module should help with this, but It doesn't follow reverse entity references. This issue suggests a possible workaround, though it seems awfully hinky. Would we need to custom code something just to make the contents of the extracted OCR available to index in solr?

I'm not sure how we would instruct solr to index the hOCR currently being generated from islandora_text_extraction. Presumably that's where the solr-ocrhighlighting idea could come in. It looks like Archipelago uses this library in defining an ocr solr field type. It's not clear to me where/how the library comes to be instantiated on the solr image, since it's just using a standard solr docker image.

What indicates to you that it would need to utilize a second solr index?

@DiegoPino
Copy link

@patdunlavey @alxp to put some context into @patdunlavey statements here.

Archipelago has been using the plugin that Johannes from the Bavarian state library and his team developed for almost 3 years already. We worked with their team and our own folks to architect this deeply into our system, testing, expanding this idea into a more complex and generic way of producing the many-to-one needs coming from different sources of OCR/HOCR and have already repositories in our Archipelago community that have hit over 700K documents with real time capabilities.

But to allow this to really work the Solr side of things @patdunlavey is pointing to is just one of the factors. Drupal is terrible handling that number of entities. So we created a whole ecosystem around an entity-less custom Search API Data Source https://github.com/esmero/strawberryfield/blob/1.1.0/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php

that generates a different type of Solr Documents (same index, other indexes, multiple indexes, etc) just for this case connected/drupal data wise via a native data type that simulates what an entity would do:

https://github.com/esmero/strawberryfield/blob/1.1.0/src/TypedData/StrawberryfieldFlavorDataDefinition.php

And this is just the start. We tap into the Search API queries (modifying them at the solarium level) to allow highlights to work by disabling the native one too (both can not co-exist) and have a TON of event subscribers that track these type of document needs of update/removal plus a hierarchical backend processing plugin system to extract OCR https://github.com/esmero/strawberry_runners/tree/0.5.0 amongst many other type of data (e.g WACZ full text and URLs, XML, Simple text) that go into Strawberry Flavors (that is how we name this special thing) and if you dig deeper you will see much more integration like aggregated fields that harvest from Solr, etc is present.

Full text search is driven by custom Controllers and on our recent code we do front-end/back-end matching of our Dynamic IIIF Manifests too to allow IIIF Search API capabilities. Annotations are handled separately (the plugin that is mentioned deals with MiniOCR or ALTO only) and already embedded in each ADO (Archipelago Digital Object) JSON, we have also have done joint work with the Annotorious team to enable that.

In other words, this is a totally different architecture and implies also tons of code to make it work. If you decide to go this way, and decide to use code from our system I would appreciate you keep attribution (what we have is not test code and examples, its production code many institutions are using), researching and developing this into a production ready system in our community was a big communal effort. I would also encourage you to test Archipelago in that sense to have an idea of what is implied. Thanks a lot

@patdunlavey
Copy link
Contributor

I spoke with @ajstanley who advocated for the importance of ocr'd data being human-editable. If we accept that premise, then it may argue in favor of storing the hOCR data in a drupal field on the media. Then a field widget could, in theory, be designed for people to edit the hOCR data. I can imagine dropping in a tool like this.

@patdunlavey
Copy link
Contributor

patdunlavey commented Mar 24, 2023

Some updates to what I've been doing on this.

I added a field on the "Original File" media for storing the raw hocr text ("field_editable_hocr_text"), and added this field to the generate_hocr_extracted_text action so that the hocr output goes both to the file field and this text field:
image
(Note that this results in hocr text with embedded "<br />" tags, which is something I need to fix. For now, I just fix the malformed xml manually.)

I have solr ocrHighlighting working on my local. This took some work which I won't go into every detail on at the moment. The main points:

  • In our Makefile, we get the ocrhighlighting library like this:
	docker-compose exec -T solr with-contenv bash -lc "rm -rf /opt/solr/server/solr/ISLANDORA /opt/solr/server/solr/contrib/ocrhighlighting/lib/solr-ocrhighlighting.jar"
	docker-compose exec -T drupal with-contenv bash -lc "for_all_sites create_solr_core_with_default_config"
	curl -k -L https://github.com/dbmdz/solr-ocrhighlighting/releases/download/0.7.2/solr-ocrhighlighting-0.7.2.jar > data/solr-ocrhighlighting.jar
	docker-compose exec -T solr with-contenv bash -lc "mkdir -p /opt/solr/server/solr/contrib/ocrhighlighting/lib"
	docker cp data/solr-ocrhighlighting.jar $$(docker-compose ps -q solr):/opt/solr/server/solr/contrib/ocrhighlighting/lib/solr-ocrhighlighting.jar 
	docker-compose exec -T solr with-contenv bash -lc "chown -R solr:solr /opt/solr/server/solr/contrib/ocrhighlighting"
  • We created modified versions of the solrconfig.yml and schema.xml files that we use a similar technique in our Makefile to load into solr. I'm attaching those.
    solrconfig-schema.zip.
  • There was some troubleshooting with solr versions too that I've lost track of (it's in our docker-compose.yml).
  • I added a search api solr field definition for the hocr field:
    search_api_solr.solr_field_type.text_ocr_und_7_0_0.zip
  • I defined a field in the search api index that indexes this:
  field_editable_hocr_text:
    label: 'HOCR Text'
    datasource_id: 'entity:node'
    property_path: 'search_api_reverse_entity_references_media__field_media_of:field_editable_hocr_text'
    type: 'solr_text_custom:ocr_highlight'

Phew!

With this all in place and working, I can see this in my test solr query result:

  "ocrHighlighting":{
    "4o0hnj-default_solr_index-entity:node/99:en":{
      "tcocr_highlightm_X3b_en_field_editable_hocr_text":{
        "snippets":[{
            "text":"bands, gave <em>Bix</em> his first big job,",
            "score":62655.01,
            "pages":[{
                "id":"page_1",
                "width":2352,
                "height":2810}],
            "regions":[{
                "ulx":1702,
                "uly":1729,
                "lrx":2035,
                "lry":1754,
                "text":"bands, gave <em>Bix</em> his first big job,",
                "pageIdx":0}],
            "highlights":[[{
                  "ulx":128,
                  "uly":1,
                  "lrx":158,
                  "lry":18,
                  "text":"Bix",
                  "parentRegionIdx":0}]]}],
        "numTotal":10}}},

My next step is to write a controller to perform a search and return a list of IIIF annotations! Then, in theory, we should be able to plug that link into our IIIF manifest.

@patdunlavey
Copy link
Contributor

@ajstanley might it make sense to create a custom hocr field type/widget/formatter? For starters, it could solve the embedded "<br />" problem. My thinking is that initially we would just provide a plain text widget, and then later add in an hOCR editor like this one: https://github.com/GeReV/hocr-editor-ts. Is this a project I could entice you to take on?

Would this be part of the islandora_mirador module? In any case, at the very least, we would need a PR against islandora_text_extraction to enable using our new field type here.

@ajstanley
Copy link
Contributor

@patdunlavey I can absolutely take that widget on.
I've got https://github.com/GeReV/hocr-editor-ts working in a demo environment (it needed a LOT of updating to compile) but is not really useful in its current incarnation.

@patdunlavey
Copy link
Contributor

@ajstanley Great to hear that you got the hocr editor working, if only after a fashion. Do you think it can be made useful, or do you think you need to look elsewhere?

@ajstanley
Copy link
Contributor

@patdunlavey I think it's a non-starter. You can see my working version here, but it's going to need a whole lot of deep-tissue massaging to be useful.
This app builds the hOCR, but we're doing that already. The editing seems really clunky.

@patdunlavey
Copy link
Contributor

Do you see any path forward on this? I'd say that minimum viable (initial) product is a field type that can hold the output of hOCR. We could just modify isladora_text_extraction to permit writing to a plain text field, but I'm thinking that having a special hOCR field type would make other aspects of the overall project easier.

@ajstanley
Copy link
Contributor

The text field that's there already allows for correcting text, we could make the hOCR human-editable as well, but that would be an onerous undertaking for the unfortunate grad student who was saddled with it.

If we can start with having the hOCR viewable, but pull it from a field on the media rather than from a saved file, we can add editing functionality later.

Baby steps...

@patdunlavey
Copy link
Contributor

As I described previously, I added a field to the media type to store the hOCR. This is in addition to the file field that @alxp 's hocr text overlay work uses. My reason for adding the long text field is 1. because that's what search api can index using a reverse entity reference from the node to the field on the media, and 2. "text_long" because that's what islandora_text_extraction dictates here (thus my suggestion that we make a PR to change that). The "text_long" field would be fine with me if it didn't insist on inserting <br /> for new lines in the saved text.

Having a field type that islandora_mirador defines, my thinking goes, would permit us to not have to design hacky logic to determine what media field contains our hocr, though we could also avoid that by just making the source field name configurable in the islandora_mirador config page, which may be best in any case. So maybe I'm getting talked out of the need for a special field type, at least as long as the wysiwyg hocr editing tool is out of scope.

@patdunlavey
Copy link
Contributor

patdunlavey commented Apr 5, 2023

I have a not-yet-fully-tested version of a IIIF Search API endpoint working. It generates AnnotationLists when given a node id of the page or paged content node and a search term.

Here's my fork of islandora_mirador. There's a lot of setup involved, which I tried to fully document in the README.

The primary missing piece that I'm aware of is the part of \Drupal\islandora_iiif\Plugin\views\style\IIIFManifest::render that needs to provide the search block in which our search endpoint will be instantiated.

@alxp @ajstanley @dmer @Islandora/committers

@adam-vessey
Copy link
Contributor

adam-vessey commented Apr 5, 2023

If we're implementing a IIIF Content Search API endpoint, does it really belong as a part of Mirador (or rather, islandora_mirador), specifically? Seems more like a IIIF thing, no? Like, might belong more so in the islandora_iiif module proper? Or some other associated module?

Looking at the comparison, there's a few things to highlight:

@seth-shaw-asu
Copy link
Member

If we're implementing a IIIF Content Search API endpoint, does it really belong as a part of Mirador (or rather, islandora_mirador), specifically? Seems more like a IIIF thing, no? Like, might belong more so in the islandora_iiif module proper? Or some other associated module?

I second the idea of pushing these changes into islandora_iiif. The only bit that looks mirador specific is the mirador config form.

@mjordan
Copy link
Contributor

mjordan commented Apr 5, 2023

I agree. I think the IIF Content Search API has uses outside of Mirador and should be implemented as a separately.

@DiegoPino
Copy link

DiegoPino commented Apr 6, 2023

Hey @patdunlavey and the @Islandora/committers here. I'm raising a red flag 🟥

OSS does not mean copy and paste without attributions. You all know this. The shared work here is heavily "based" (from the devops to the implementation) on our own research and tested code and production implementations (years old already). not to mention that Pat, you are part of the Archipelago community and you had the chance to test it, used it in production and even get 1:1 with us about how it works. Not even variable name changes, even many of my inline comments.

e.g patdunlavey@9d9269c is more than heavily based on https://github.com/esmero/strawberryfield/blob/573ffa44a369ad68c59a92b2746258c2671ef13f/src/Controller/StrawberryfieldFlavorDatasourceSearchController.php#L189

And this 2.x...patdunlavey:islandora_mirador:ISSUE-17-inline-search#diff-74bee3a8afb13b6345660264b05398e816ce69bff9b8d3d26a45682f92bb8c44R91-R110 (except for the "mysterious why" comment) is 1:1 to
https://github.com/esmero/strawberryfield/blob/573ffa44a369ad68c59a92b2746258c2671ef13f/strawberryfield.module#L333-L374

But on your side of things, you should also check on these things, the fact that there are comments in https://github.com/esmero/strawberryfield/blob/1.1.0/strawberryfield.module#L333-L374 replacing mine about this not being understood

 It's a mystery as to why it should be necessary to alter the solarium query in order to add the
 * highlight parameters. We should be able to add them inside the search_api query build using the
 * `solr_param_` method to inject solarium parameters: https://git.drupalcode.org/project/search_api_solr/-/blob/4.x/src/Plugin/search_api/backend/SearchApiSolrBackend.php#L1605-1610
 * However that is not working, and so we're reduced to this ugly alter hook.

Means you (the we in that comment) are not copying the why, and the lack of research (and why I wrote it like that) implies you might be even copying bugs.

And I can keep going.

What really bothers me here is the idea of "I did the research and found out" instead of being clear of where this comes from. If this was a contribution coming from an individual not representing an institution I would be less concerned and even let a few of these pass, but this is not.

Attribution in GPL (we are V3) is not optional and this ethically also affects basically every person involved on our side, @giancarlobi worked on this, @alliomeria worked on this. Community members doing heavy work on testing, reading docs, re-testing, indexing, implementing, refining. It is a community issue of obscuring efforts. Not cool, people.

I want to hear your reactions please.

@giancarlobi
Copy link

Thanks @DiegoPino for this post that I fully agree mainly for the defense of what is a really open-source community. Dear Islandora friends, this is the reason why I abandoned you (I know, not a big loss for you) because your current idea of community is far away from the one I met in Arcidosso in 2013. Good luck.

@patdunlavey
Copy link
Contributor

Morning everybody. I'm just catching up on this. Well I do seem to have stepped in it, big time! In my meager defense, the code I've been working on - which to be clear, I gratefully acknowledge that I used Archipelago code and its underlying research as my starting point - is very much a WIP and at this point, barely a proof of concept for part of the functionality that this issue proposes. I'm nowhere near proposing a PR (the code in question is currently on a task branch of my fork of this islandora_mirador module). It is and has been my intention, if this does result in my creating a PR, to run it by @DiegoPino first to get what I hope will be his blessing and, in any case, ensure that he and others are properly credited. I sincerely apologize for the bad feeling that my premature sharing of this code has generated and I promise to be more mindful going forward.

@patdunlavey
Copy link
Contributor

Thank you @adam-vessey @seth-shaw-asu and @mjordan for your comments. Skipping for the moment your very helpful code-review comments (though I want to clarify that code review is premature at this point), it seems clear that there is a consensus that it makes most sense to re-target this issue and solution (assuming we find one we are all happy with) to the islandora_iiif module. Is that an accurate reading? @alxp do you agree with that?

@alxp
Copy link
Contributor Author

alxp commented Apr 7, 2023

Hi @patdunlavey , I definitely agree that anything primarily relating to IIIF and not specific to Mirador should be in islandora_iiif.

That module currently lives in the main Islandora module, it might be helpful to pull it out like we did with islandora_mirador but not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants