Skip to content

3. Parsing Tika Tesseract Output Inside of Solr via StatelessScriptUpdateProcessorFactory

Eric Pugh edited this page Nov 6, 2019 · 2 revisions

In our first two spikes, we ended up doing a lot of the processing work outside of Solr. It gave me a chance to polish my PowerShell skills, which was cool, and gave me a nice appreciation for having a scripting language that works on both Windows and unix systems!

However, sometimes you want the search engine to be a black box. I take documents and put them into the black box, and now they are searchable. We added some additional complexity by moving to a parent/child relationship for the PDF's because each page ended up being it's own document in Solr. This meant another parsing script that dumped more intermediate format documents.

What if we could do everything inside of Solr? What if we could take the output from Tika with the Tesseract generated OCR content, and then convert that to a set of parent/child documents that are indexed into Solr?

Time for one of my favorite Get out of Jail Free cards from Solr, the awkwardly named StatelessScriptUpdateProcessorFactory which would let us put all that parsing logic into a script run inside of Solr. I've used this in the past a couple of times, but would it work with the extraction code??

We started with setting up a custom extract end point, but this time included a update.chain parameter:

 <requestHandler name="/update/speeches"
                  class="solr.extraction.ExtractingRequestHandler" >
    <str name="parseContext.config">parseContext.xml</str>
    <lst name="defaults">
      <str name="uprefix">attr_</str>
      <str name="multipartUploadLimitInKB">20480</str> <!--Limit to 20 MB PDF-->
      <str name="update.chain">process-speech-from-extracted-text</str>
    </lst>
  </requestHandler>

The update.chain is what lets us override the normal execution flow, and inject the call to our Scripting step:

  <updateRequestProcessorChain name="process-speech-from-extracted-text">
    <processor class="solr.StatelessScriptUpdateProcessorFactory">
      <arr name="script">
        <str name="script">process-speech.js</str>
      </arr>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

This then lets us call a custom Javascript script (though we can use other languages like Ruby etc), to deal with the text.

I'll let you go through the process-speech.js script yourself. The big thing is that the Extract handler, for some odd reason, gives us not XML content, but the XML content with none of the wrapping < or > tags! So we can't use the XML parsing logic that we've used previously, instead we do lots of string splitting!

Other things to note:

  • We are able to invoke any Java methods we want by prepending Packages. to the class name, like this example of base64 encoding: logger.info("Here comes some base 64: " + Packages.org.apache.solr.common.util.Base64.byteArrayToBase64(id.getBytes()));

  • Shockingly, we can create a Solr input doucment: var childDoc = new Packages.org.apache.solr.common.SolrInputDocument(); and then add it to our parent document via just calling the Java method on the object: doc.addChildDocument(childDoc);

  • The process-speech.js script is parsed at startup, so if you have syntax errors, or non compatible Javascript, then you will get an error. While not positive, I believe the version of Javascript supported by the Rhino engine in Java (or maybe Nashorn?) is Javascript 6, so stay with the simplest version of Javascript you can.

  • It's nice you can deploy this via your Zookeeper script, and it would be interesting to see what other use cases there might be for this.