Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR #2505

Merged
merged 18 commits into from
Apr 16, 2024

Conversation

tenpai-git
Copy link
Contributor

@tenpai-git tenpai-git commented Feb 21, 2024

Hi @eikek - this is to resolve #2445 and add Japanese vertical support identified for use with the tesseract and ocrmypdf dependencies on Docspell. I went with JpnVert for the name.

I'm new to Elm and Scala but I was able to install them and run sbt fix locally and mostly perform the steps described in the add language documentation. Since JpnVert/vertical Japanese is not it's own "language" this is a bit of a special case. If there are any outstanding issues please let me know and I'll do my best to resolve them.

I have a couple questions before merging:

  1. Because the base Japanese language (horizontally) is already included, I added some comments where I felt there were redundancies and instead tried to reference the existing code for tests and dates. Similarly, Solr only mentions Japanese vertical search once, where one of the special options is not available in vertical search. This leads me to believe the existing Japanese search covers both Japanese (Horizontal) and Japanese (Vertical). Should the Solr/test section still be duplicated for other reasons? I can update if needed.

  2. The main point of this commit is to have Tesseract select jpn_vert for the {{lang}} variable if JpnVert is selected in the menu. However we also need to add -c preserve_interword_spaces=1 for vertical languages. Where is the best place to add this conditional? I think the user will want it default every time for any vertical language.

    # To convert image files to PDF files, tesseract is used. This
    # also extracts the text in one go.
    tesseract = {
      command = {
        program = "tesseract"
        args = [
          "{{infile}}",
          "out",
          "-l",
          "{{lang}}",
          "-c",
          "preserve_interword_spaces=1",
          "pdf",
          "txt"
        ]

Also I am concerned about a possible conflict with ocrmypdf where it still needs jpn for vertical in {{lang}} but not jpn_vert. ocrmypdf only needs jpn. And if we're making a conditional for this, we might as well include "--output-type", "pdf" anyway.

     # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--deskew",
	  "--output-type", "pdf",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]

Also, how would you like to handle how {{lang}} is called and ISO references?

I will commit additional documentation updates to the current docs once I know how you want to go about this. And thank you again for your work on this. The work here will be a good pre-cursor to vertical support for all languages that use it and provide good defaults.

(・∀・)b

Edit: This is in addition to the upgrade to PDFBox v3.0 which is already merged.

@eikek
Copy link
Owner

eikek commented Feb 23, 2024

Thank you very much! I need a bit of time to look at it, as it is currently too busy :/

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Feb 24, 2024

Understood @eikek - thanks for taking a look at this.

To make things a little simpler, the number 1 thing I need you to look at is:

  • How to include "preserve_interword_spaces=1", for tesseract, and;
  • How {{lang}} is called for ocrmypdf should default to normal Japanese.

I think with that the Pull Request will work well as long as we can address that change, and others can easily build off it for Chinese Vertical and Korean Vertical.

For the final user... I think it's important it's a separate language options so they can choose the setting on Docspell Share/mobile devices.

Copy link
Owner

@eikek eikek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts! To your questions:

  1. I think this is a good starting point and I guess it's is good to
    go and we can always improve further. I don't know how SOLR handles
    Japanese vertical or horizontal :) So when you think it is ok, then
    again we should go like this.

  2. Adding -c preserve_interword_spaces=1 is a bit tricky. I have to
    think how to think about adding more options depending on language.
    I'm fine adding --output-type pdf by default, if that then
    creates a PDF/A file. I think the default is pdfa, which is the
    recommended default. Then if users want it by default, it's not
    hard to change in their configurations. So I think I'm a bit
    undecided. :-)

Let me know if you want to merge this - since it doesn't interfere
with other things, I can merge if you want.

modules/webapp/src/main/elm/Messages/Data/Language.elm Outdated Show resolved Hide resolved
@eikek
Copy link
Owner

eikek commented Mar 3, 2024

Ah wait, I think I misunderstood the --output-type thing: You mean to only include it for jpn_vert - right? Sure that is fine, of course.

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 3, 2024

Thanks for your efforts! To your questions:

1. I think this is a good starting point and I guess it's is good to
   go and we can always improve further. I don't know how SOLR handles
   Japanese vertical or horizontal :) So when you think it is ok, then
   again we should go like this.

2. Adding `-c preserve_interword_spaces=1` is a bit tricky. I have to
   think how to think about adding more options depending on language.
   I'm fine adding `--output-type pdf` by default, if that then
   creates a PDF/A file. I think the default is pdfa, which is the
   recommended default. Then if users want it by default, it's not
   hard to change in their configurations. So I think I'm a bit
   undecided. :-)

Let me know if you want to merge this - since it doesn't interfere with other things, I can merge if you want.

Before merging, I am just concerned about the {{lang}} variable.

Tesseract has a dedicated Japanese vertical package jpn_vert. However, ocrmypdf does not have a dedicated vertical package and uses jpn.

If they are using the same variable from the ISO codes, then ocrmypdf should fail because jpn_vert is not language option it can select unless I'm misunderstanding how it selects the package. They both use {{lang}} but in this case they should be different.

-c preserve_interword_spaces=1 and --output-type pdf can be added later, and I can keep testing the options with a nightly build. We can merge without them for now, but I think it would be wise to add (Beta) into the language menu to temper user expectations then in the language menu. It's nice to get the kanji out of a scan even if there are spaces.

-c preserve_interword_spaces=1 will really be important for readability of any CJK (Chinese Japanese Korean) vertical scan in the future, though. It's extremely frustrating to have to go back and remove it for Japanese Vertical documents before scanning but not Japanese (Horizontal) or English documents. I think the best solution would be to be able to create options in the config to override special settings for specific languages. We could then just add language specific setting recommendations to the docs.

However, while it is maybe not good practice, just having -c preserve_interword_spaces=1 as a hardcoded default for vertical languages on this setting for Tesseract would also be fine I think.

It seems to me one way or another, for {{lang}} and -c preserve_interword_spaces=1 those Tesseract config will need to be specific for JpnVert.

Quickest solution: If JpnVert can be appended _vert to {{lang}} and -c preserve_interword_spaces=1 can be added to the Tesseract config only for JpnVert, I can change this pull request to Japanese ISO codes and that might also solve it I think (which also solves the ocrmypdf issue).

Ah wait, I think I misunderstood the --output-type thing: You mean to only include it for jpn_vert - right? Sure that is fine, of course.

Yes this was my intention. I think there is some kind of common encoding issue coming up in processing PDF/A with Japanese. Most Japanese office computers still use SHIFT-JIS and not UTF8, or there might be something with the Japanese version of Adobe products. Whatever the reason processing as --output-type pdf seems to workout better for me. Perhaps it could be the default for both Japanese and JpnVert. I think --output-type pdf is the lowest priority item, I can just add this to my custom global configuration since I run a lot of PDFs like this, that may not be the case for everyone, but I think it has come up in the issues a couple times now.

@eikek
Copy link
Owner

eikek commented Mar 3, 2024

Ok, I see. So this wouldn't quite work, because either tesseract or ocrmypdf would be misconfigured. I know that the commands in the configfile are rather rigid and cannot be adjusted based on the arguments already known. I think there should be something to generally override things. I was thinking something like this:

tesseract = {
    tesseract = {
      arg-mappings = {
        "mylang" = {
          value = "{{lang}}"
          # first match is applied
          mappings = [
            {
              matches = "jpn_vert"
              args = [ "-l", "jpn_vert", "-c", "preserve_interword_spaces=1" ]
            },
            {
              matches = ".*"
              args = [ "-l", "{{lang}}" ]
            }
          ]
        }
      }
      command = {
        program = "tesseract"
        args = [
          "{{infile}}",
          "{{mylang}}",
          "out",
          "pdf",
          "txt"
        ]
        timeout = "5 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }

Docspell would first resolve all curly braces with variables for the current command. Then find the first mapping where value matches the regex in matches and finally replace the corresponding key in the args array. This is quite complex :) but it would allow for many custom fiddling with the external tools, like doing things based on combinations (lang + encoding perhaps) - not sure if something like this is really needed.

@tenpai-git
Copy link
Contributor Author

Actually I think it's perfect.

I would be happy to write documentation for this config override and can write a new pull request to support Chinese Simplified Vertical, Chinese Traditional Vertical, Korean Vertical, and Japanese Vertical if you can add that.

@eikek
Copy link
Owner

eikek commented Mar 3, 2024

Cool! Let me do this change and then you can rebase this pr onto it. I hope to find some time for this the next couple of days. It shouldn't take long.

I would be happy to write documentation for this config override and can write a new pull request to support Chinese Simplified Vertical, Chinese Traditional Vertical, Korean Vertical, and Japanese Vertical if you can add that.

Oh that would be very much appreciated!

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 4, 2024

Sounds good! Full CJK Support will be really good and I'll look out for the change and rebase once they're ready.

Should we add the (4) exceptions (Chinese Traditional Vertical/Chinese Simplified Vertical/Korean Vertical/Japanese Vertical) into the default configuration for those languages? Or do you want to keep the default config clean and just keep the example in the documentation?

From the perspective of a user, the vertical language selection is virtually useless without the tesseract -c preserve_interword_spaces=1 option, so it's a matter of you want it to "just work" out of the box or keep a totally clean configuration for the user.

Either way the comments can separate out the vertical options for relative cleanliness but just tell me if you want the vertical tesseract options it in docs or you want me to add to the default configuration of the actual files.

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 4, 2024

I did some testing more and I just want to make sure this works for you.

So the goal is to have Tesseract select jpn_vert and apply -c preserve_interword_spaces=1 but also have ocrmypdf select jpn (the other languages all seem to follow this convention thankfully).

So if the language is jpn_vert...

tesseract = {
    tesseract = {
      arg-mappings = {
        "mylang" = {
          value = "{{lang}}" **# {{lang}} is jpn_vert** 
          # first match is applied
          mappings = [
            {
              matches = "jpn_vert" **# It Matches Here**
              args = [ "-l", "jpn_vert", "-c", "preserve_interword_spaces=1" ] #It Applies Here 
            },
            {
              matches = ".*"
              args = [ "-l", "{{lang}}" ] #It will match normally for anything else. 
            }
          ]
        }
      }
      command = {
        program = "tesseract"
        args = [
          "{{infile}}",
          "{{mylang}}", #It respects output 
          "out",
          "pdf",
          "txt"
        ]
        timeout = "5 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }

But what about ocrmypdf? It will see jpn_vert and fail because there is no jpn_vert package, only jpn.

Does the functionality also need to be added for ocrmypdf? Conceptually under ocrmypdf,

          mappings = [
            {
              matches = "jpn_vert" **# It Matches Here**
              args = [ "-l", "jpn"] #It defaults to jpn despite {{lang}} because theres no vertical package. 
            },

And is {{lang}} referenced anywhere else?

So this solves the problem of adding in the -c preserve_interword_spaces=1 but not the problem of ocrmypdf being misconfigured, unless ocrmypdf has the same ability to override the map.

@eikek
Copy link
Owner

eikek commented Mar 4, 2024

Yes, the idea is that this kind of configuration is available to all commands in the config file - I just used tesseract as an example. You would need to configure ocrmypdf in a similar way, as you described. Other commands that use {{lang}} would need to be identified, but I think it's only these two. We will double check that once it's working for jpn_vert.

Edit: For the default configuration, I think the "works out of the box" is good. Obviously, it must not interfere with existing stuff (which it doesn't) and users can always overwrite the entire thing in their config files. In a very complex case, it is also possible to write a wrapper script and let this be called from docspell.

@tenpai-git
Copy link
Contributor Author

Awesome @eikek - I think that covers all the bases then.

@eikek
Copy link
Owner

eikek commented Mar 8, 2024

Hello @tenpai-git finally I could merge something. #2536 is now in master and you could try it out. The commit message contains a bit of info - hope that helps (basically what is already written here). Curious if it is enough to make jpn_vert work.

@tenpai-git
Copy link
Contributor Author

You got it @eikek, will install within the next week or so and prepare a big CJK vertical pull request if it's all good.

@tenpai-git
Copy link
Contributor Author

Hey @eikek I can't seem to build the debian snapshots for testing. I might be missing a dependency. It seems my system can't find tailwindcss, though I did do npm install tailwindcss in the specified directory and it made node_modules for it.

Any idea on how I can get this to compile? Here's the full log.

└──╼ $sudo sbt
[info] welcome to sbt 1.9.9 (Debian Java 17.0.10)
[info] loading settings for project docspell-build from build.sbt,plugins.sbt ...
[info] loading project definition from /home/yang/git/docspell/project
[info] loading settings for project root from build.sbt,version.sbt ...
[info] resolving key references (37335 settings) ...
[info] set current project to docspell-root (in build file:/home/yang/git/docspell/)
[info] sbt server started at local:///root/.sbt/1.0/server/ede34b51599767ebe6b5/sock
[info] started sbt server
sbt:docspell-root> make-pkg
[success] Total time: 3 s, completed Mar 19, 2024, 12:12:54 AM
[info] Defining webapp / elmCompileMode
[info] The new value will be used by webapp / Compile / resourceGenerators
[info] Reapplying settings...
[info] set current project to docspell-root (in build file:/home/yang/git/docspell/)
[info] Defining webapp / stylesMode
[info] The new value will be used by webapp / stylesBuild
[info] Reapplying settings...
[info] set current project to docspell-root (in build file:/home/yang/git/docspell/)
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/VersionInfo.scala
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/JobList.scala
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/AddonSupport.scala
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/Job.scala
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/JobAndLog.scala
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/JobLogEvent.scala
[info] Writing file /home/yang/git/docspell/modules/joexapi/target/scala-2.13/src_managed/main/docspell/joexapi/model/BasicResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OptionalText.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/VersionInfo.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SecondFactor.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NameCount.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FolderList.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OptionalId.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CheckFileResult.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ImapSettingsList.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/StringValue.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/IdResult.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AddonRunExistingItem.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ShareList.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/PeriodicQuerySettings.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ShareVerifyResult.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SourceList.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CalEventCheckResult.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FolderStats.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CollectiveSettings.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OptionalText.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemInsights.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/VersionInfo.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationGotify.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SecondFactor.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/StringList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NameCount.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/EmptyTrashSetting.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FolderList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OptionalId.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/EmailSettings.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CheckFileResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AttachmentMeta.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ImapSettingsList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemsAndRef.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/StringValue.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/DeleteUserData.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/IdResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SearchStats.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AddonRunExistingItem.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/JobQueueState.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ShareList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SentMail.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/InviteResult.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/PeriodicQuerySettings.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FileRepositoryCloneRequest.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ShareVerifyResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemLightGroup.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Equipment.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SourceList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CalEventCheckResult.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FolderStats.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CustomField.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CollectiveSettings.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemsAndDirection.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemInsights.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/JobDetail.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationGotify.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Address.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/StringList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AttachmentLight.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/EmptyTrashSetting.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CalEventCheck.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/EmailSettings.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SentMails.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AttachmentMeta.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SourceTagIn.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemsAndRef.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationMail.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/DeleteUserData.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/DownloadAllSummary.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SearchStats.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemsAndName.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/JobQueueState.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Collective.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SentMail.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemProposals.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/InviteResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemsAndDate.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FileRepositoryCloneRequest.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/PeriodicDueItemsSettings.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NameCloud.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemLightGroup.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Contact.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Equipment.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ScanMailboxSettings.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CustomField.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ResetPasswordResult.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemsAndDirection.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SourceAndTags.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/TagList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/JobDetail.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationChannelTestResult.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Address.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AttachmentLight.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ClassifierSetting.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CalEventCheck.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/MoveAttachment.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SentMails.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SourceTagIn.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AddonList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationMail.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FileIntegrityCheckRequest.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/DownloadAllSummary.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemsAndName.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemDetail.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Collective.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/BasicItem.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemProposals.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemsAndRefs.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemsAndDate.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OtpConfirm.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/PeriodicDueItemsSettings.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NameCloud.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationSampleEventReq.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Contact.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FolderDetail.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/IdName.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ScanMailboxSettings.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ResetPasswordResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/UserPass.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SourceAndTags.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Registration.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/TagList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationChannelTestResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationHttp.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ClassifierSetting.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NewCustomField.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/MoveAttachment.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/JobPriority.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AddonList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ScanMailboxSettingsList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FileIntegrityCheckRequest.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OptionalDate.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemDetail.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/DownloadAllRequest.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/BasicItem.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ReferenceList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemsAndRefs.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OtpConfirm.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Person.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationSampleEventReq.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/EmailSettingsList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FolderDetail.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Source.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/IdName.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AttachmentSource.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/UserPass.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Registration.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Addon.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationHttp.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Label.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NewCustomField.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AddonRunConfig.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/JobPriority.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ScanMailboxSettingsList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OptionalDate.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/DownloadAllRequest.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Tag.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ReferenceList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Attachment.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Person.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/UserList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/EmailSettingsList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Source.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SimpleMail.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AttachmentSource.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/HighlightEntry.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Addon.scal
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FieldStats.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Label.scal
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/EquipmentList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AddonRunConfig.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationMatrix.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Tag.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemFieldValue.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Attachment.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/UserList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemsAndFieldValue.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SimpleMail.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/HighlightEntry.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/User.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FieldStats.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/JobLogEvent.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/EquipmentList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ResetPassword.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationMatrix.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemFieldValue.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ShareData.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemLightList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemsAndFieldValue.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/IdList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/User.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/BasicResult.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/JobLogEvent.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/SimpleShareMail.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ResetPassword.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/TagCount.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ShareData.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/IdRefStats.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemLightList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CustomFieldList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/IdList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AddonRegister.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/BasicResult.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ContactList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/SimpleShareMail.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/TagCount.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/Organization.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/IdRefStats.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/BookmarkedQuery.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CustomFieldList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AddonRef.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AddonRegister.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ContactList.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/CustomFieldValue.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/DirectionValue.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/Organization.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OtpResult.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/BookmarkedQuery.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemLinkData.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AddonRef.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/PersonList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/CustomFieldValue.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/FolderItem.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/DirectionValue.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemQuery.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OtpResult.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemLinkData.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/PersonList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/FolderItem.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NewFolder.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemQuery.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OtpState.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NewFolder.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OtpState.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AuthResult.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/AddonRunConfigList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/TagCloud.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AuthResult.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemLight.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/AddonRunConfigList.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ShareDetail.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationChannelRef.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/TagCloud.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ShareSecret.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ItemUploadMeta.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemLight.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/ImapSettings.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ShareDetail.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/NotificationHook.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationChannelRef.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/PasswordChange.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ShareSecret.elm
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/OrganizationList.scala
[info] Writing file /home/yang/git/docspell/modules/restapi/target/scala-2.13/src_managed/main/docspell/restapi/model/GenInvite.scala
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ItemUploadMeta.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/ImapSettings.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/NotificationHook.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/PasswordChange.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/OrganizationList.elm
[info] Writing file /home/yang/git/docspell/modules/webapp/target/elm-src/Api/Model/GenInvite.elm
[success] Total time: 1 s, completed Mar 19, 2024, 12:13:16 AM
[info] compiling 6 Scala sources to /home/yang/git/docspell/modules/totp/target/scala-2.13/classes ...
[info] compiling 8 Scala sources to /home/yang/git/docspell/modules/logging/api/target/scala-2.13/classes ...
[info] compiling 17 Scala sources to /home/yang/git/docspell/modules/query/jvm/target/scala-2.13/classes ...
[info] compiling 3 Scala sources to /home/yang/git/docspell/modules/jsonminiq/target/scala-2.13/classes ...
[info] Copy webjar resources from 1 files/directories.
[info] Compiling css stylesheets…
[info] Running tailwindcss --input /home/yang/git/docspell/modules/webapp/src/main/styles/index.css -o /home/yang/git/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.42.0-SNAPSHOT/css/styles.css --minify
[info] Compile elm files ...
[info] Running elm make --optimize --output /home/yang/git/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.42.0-SNAPSHOT/docspell-app.js /home/yang/git/docspell/modules/webapp/src/main/elm/Main.el
[info] compiling 18 Scala sources to /home/yang/git/docspell/modules/query/js/target/scala-2.13/classes ...
[info] Compiling ...
[info] Compiling (1)
[info] Compiling (2)
[info] Compiling (3)
[info] Compiling (4)
[info] Compiling (5)
[info] Compiling (6)
[info] Compiling (7)
[info] Compiling (8)
[info] Compiling (9)
[info] Compiling (10)
[info] Compiling (11)
[info] Compiling (12)
[info] Compiling (13)
[info] Compiling (14)
[info] Compiling (15)
[info] Compiling (16)
[info] Compiling (17)
[info] Compiling (18)
[info] Compiling (19)
[info] Compiling (20)
[info] Compiling (21)
[info] Compiling (22)
[info] Compiling (23)
[info] Compiling (24)
[info] Compiling (25)
[info] Compiling (26)
[info] Compiling (27)
[info] Compiling (28)
[info] Compiling (29)
[info] Compiling (30)
[info] Compiling (31)
[info] Compiling (32)
[info] Compiling (33)
[info] Compiling (34)
[info] Compiling (35)
[info] Compiling (36)
[info] Compiling (37)
[info] Compiling (38)
[info] Compiling (39)
[info] Compiling (40)
[info] Compiling (41)
[info] Compiling (42)
[info] Compiling (43)
[info] Compiling (44)
[info] Compiling (45)
[info] Compiling (46)
[info] Compiling (47)
[info] Compiling (48)
[info] Compiling (49)
[info] Compiling (50)
[info] Compiling (51)
[info] Compiling (52)
[info] Compiling (53)
[info] Compiling (54)
[info] Compiling (55)
[info] Compiling (56)
[info] Compiling (57)
[info] Compiling (58)
[info] Compiling (59)
[info] Compiling (60)
[info] Compiling (61)
[info] Compiling (62)
[info] Compiling (63)
[info] Compiling (64)
[info] Compiling (65)
[info] Compiling (66)
[info] Compiling (67)
[info] Compiling (68)
[info] Compiling (69)
[info] Compiling (70)
[info] Compiling (71)
[info] Compiling (72)
[info] Compiling (73)
[info] Compiling (74)
[info] Compiling (75)
[info] Compiling (76)
[info] Compiling (77)
[info] Compiling (78)
[info] Compiling (79)
[info] Compiling (80)
[info] Compiling (81)
[info] Compiling (82)
[info] Compiling (83)
[info] Compiling (84)
[info] Compiling (85)
[info] Compiling (86)
[info] Compiling (87)
[info] Compiling (88)
[info] Compiling (89)
[info] Compiling (90)
[info] Compiling (91)
[info] Compiling (92)
[info] Compiling (93)
[info] Compiling (94)
[info] Compiling (95)
[info] Compiling (96)
[info] Compiling (97)
[info] Compiling (98)
[info] Compiling (99)
[info] Compiling (100)
[info] Compiling (101)
[info] Compiling (102)
[info] Compiling (103)
[info] Compiling (104)
[info] Compiling (105)
[info] Compiling (106)
[info] Compiling (107)
[info] Compiling (108)
[info] Compiling (109)
[info] Compiling (110)
[info] Compiling (111)
[info] Compiling (112)
[info] Compiling (113)
[info] Compiling (114)
[info] Compiling (115)
[info] Compiling (116)
[info] Compiling (117)
[info] Compiling (118)
[info] Compiling (119)
[info] Compiling (120)
[info] Compiling (121)
[info] Compiling (122)
[info] Compiling (123)
[info] Compiling (124)
[info] Success! Compiled 124 modules.
[info] 
[info]     Main ───> /home/yang/git/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.42.0-SNAPSHOT/docspell-app.js
[info] NerModels: Filtering artifacts...
[info] compiling 2 Scala sources to /home/yang/git/docspell/modules/totp/target/scala-2.13/test-classes ...
[info] compiling 4 Scala sources to /home/yang/git/docspell/modules/jsonminiq/target/scala-2.13/test-classes ...
[info] compiling 6 Scala sources to /home/yang/git/docspell/modules/logging/scribe/target/scala-2.13/classes ...
[info] compiling 3 Scala sources to /home/yang/git/docspell/modules/logging/api/target/scala-2.13/test-classes ...
[info] compiling 97 Scala sources to /home/yang/git/docspell/modules/common/target/scala-2.13/classes ...
[info] compiling 12 Scala sources to /home/yang/git/docspell/modules/query/jvm/target/scala-2.13/test-classes ...
[info] Full optimizing /home/yang/git/docspell/modules/query/js/target/scala-2.13/docspell-query-opt
[info] compiling 12 Scala sources to /home/yang/git/docspell/modules/query/js/target/scala-2.13/test-classes ...
[info] Copy webjar resources from 1 files/directories.
[info] compiling 1 Scala source to /home/yang/git/docspell/modules/logging/scribe/target/scala-2.13/test-classes ...
[info] Closure: 0 error(s), 0 warning(s)
[info] Produced query js file: /home/yang/git/docspell/modules/query/js/target/scala-2.13/docspell-query-opt.js
[info] Copy webjar resources from 1 files/directories.
[info] compiling 5 Scala sources to /home/yang/git/docspell/modules/fts-client/target/scala-2.13/classes ...
[info] compiling 10 Scala sources to /home/yang/git/docspell/modules/oidc/target/scala-2.13/classes ...
[info] compiling 11 Scala sources to /home/yang/git/docspell/modules/common/target/scala-2.13/test-classes ...
[info] compiling 4 Scala sources and 2 Java sources to /home/yang/git/docspell/modules/files/target/scala-2.13/classes ...
[info] compiling 6 Scala sources to /home/yang/git/docspell/modules/pubsub/api/target/scala-2.13/classes ...
[info] compiling 21 Scala sources to /home/yang/git/docspell/modules/analysis/target/scala-2.13/classes ...
[info] compiling 14 Scala sources to /home/yang/git/docspell/modules/notification/api/target/scala-2.13/classes ...
[info] compiling 20 Scala sources and 19 Java sources to /home/yang/git/docspell/modules/extract/target/scala-2.13/classes ...
[info] compiling 7 Scala sources to /home/yang/git/docspell/modules/files/target/scala-2.13/test-classes ...
[info] compiling 26 Scala sources to /home/yang/git/docspell/modules/addonlib/target/scala-2.13/classes ...
[info] compiling 18 Scala sources to /home/yang/git/docspell/modules/convert/target/scala-2.13/classes ...
[info] /home/yang/git/docspell/modules/extract/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java: Some input files use or override a deprecated API.
[info] /home/yang/git/docspell/modules/extract/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java: Recompile with -Xlint:deprecation for details.
[info] compiling 13 Scala sources to /home/yang/git/docspell/modules/fts-solr/target/scala-2.13/classes ...
[info] compiling 1 Scala source to /home/yang/git/docspell/modules/oidc/target/scala-2.13/test-classes ...
[info] compiling 25 Scala sources to /home/yang/git/docspell/modules/scheduler/api/target/scala-2.13/classes ...
[info] compiling 7 Scala sources to /home/yang/git/docspell/modules/extract/target/scala-2.13/test-classes ...
[info] compiling 7 Scala sources to /home/yang/git/docspell/modules/analysis/target/scala-2.13/test-classes ...
[info] compiling 4 Scala sources to /home/yang/git/docspell/modules/convert/target/scala-2.13/test-classes ...
[info] compiling 1 Scala source to /home/yang/git/docspell/modules/scheduler/api/target/scala-2.13/test-classes ...
[info] compiling 7 Scala sources to /home/yang/git/docspell/modules/addonlib/target/scala-2.13/test-classes ...
[info] compiling 133 Scala sources to /home/yang/git/docspell/modules/restapi/target/scala-2.13/classes ...
[info] compiling 171 Scala sources to /home/yang/git/docspell/modules/store/target/scala-2.13/classes ...
[info] compiling 8 Scala sources to /home/yang/git/docspell/modules/joexapi/target/scala-2.13/classes ...
[info] compiling 2 Scala sources to /home/yang/git/docspell/modules/pubsub/naive/target/scala-2.13/classes ...
[info] compiling 21 Scala sources to /home/yang/git/docspell/modules/notification/impl/target/scala-2.13/classes ...
[info] compiling 63 Scala sources to /home/yang/git/docspell/modules/backend/target/scala-2.13/classes ...
[info] compiling 13 Scala sources to /home/yang/git/docspell/modules/store/target/scala-2.13/test-classes ...
[info] compiling 10 Scala sources to /home/yang/git/docspell/modules/fts-psql/target/scala-2.13/classes ...
[info] compiling 18 Scala sources to /home/yang/git/docspell/modules/scheduler/impl/target/scala-2.13/classes ...
[info] compiling 1 Scala source to /home/yang/git/docspell/modules/notification/impl/target/scala-2.13/test-classes ...
[info] compiling 4 Scala sources to /home/yang/git/docspell/modules/pubsub/naive/target/scala-2.13/test-classes ...
[info] compiling 3 Scala sources to /home/yang/git/docspell/modules/fts-psql/target/scala-2.13/test-classes ...
[info] compiling 6 Scala sources to /home/yang/git/docspell/modules/config/target/scala-2.13/classes ...
[info] compiling 1 Scala source to /home/yang/git/docspell/modules/scheduler/impl/target/scala-2.13/test-classes ...
[info] compiling 2 Scala sources to /home/yang/git/docspell/modules/config/target/scala-2.13/test-classes ...
[info] compiling 87 Scala sources to /home/yang/git/docspell/modules/joex/target/scala-2.13/classes ...
[info] compiling 1 Scala source to /home/yang/git/docspell/modules/backend/target/scala-2.13/test-classes ...
[info] compiling 3 Scala sources to /home/yang/git/docspell/modules/joex/target/scala-2.13/test-classes ...
[error] stack trace is suppressed; run last webapp / stylesBuild for the full output
[error] (webapp / stylesBuild) java.io.IOException: Cannot run program "tailwindcss" (in directory "/home/yang/git/docspell/modules/webapp"): error=2, No such file or directory
[error] Total time: 475 s (07:55), completed Mar 19, 2024, 12:21:10 AM
sbt:docspell-root> last webapp / stylesBuild
[info] Compiling css stylesheets…
[info] Running tailwindcss --input /home/yang/git/docspell/modules/webapp/src/main/styles/index.css -o /home/yang/git/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.42.0-SNAPSHOT/css/styles.css --minify
[error] java.io.IOException: Cannot run program "tailwindcss" (in directory "/home/yang/git/docspell/modules/webapp"): error=2, No such file or directory
[error] 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143)
[error] 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
[error] 	at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$anonfun$runBuffered$1(ProcessBuilderImpl.scala:154)
[error] 	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
[error] 	at scala.sys.process.ProcessLogger$$anon$1.buffer(ProcessLogger.scala:103)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.runBuffered(ProcessBuilderImpl.scala:154)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:120)
[error] 	at docspell.build.Cmd$.exec(Cmd.scala:33)
[error] 	at docspell.build.Cmd$.run(Cmd.scala:19)
[error] 	at docspell.build.StylesPlugin$.runTailwind(StylesPlugin.scala:102)
[error] 	at docspell.build.StylesPlugin$.$anonfun$stylesSettings$6(StylesPlugin.scala:55)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[error] 	at sbt.Execute.work(Execute.scala:292)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[error] 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
[error] 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[error] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[error] 	at java.base/java.lang.Thread.run(Thread.java:840)
[error] Caused by: java.io.IOException: error=2, No such file or directory
[error] 	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
[error] 	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
[error] 	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
[error] 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
[error] 	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)
[error] 	at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$anonfun$runBuffered$1(ProcessBuilderImpl.scala:154)
[error] 	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
[error] 	at scala.sys.process.ProcessLogger$$anon$1.buffer(ProcessLogger.scala:103)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.runBuffered(ProcessBuilderImpl.scala:154)
[error] 	at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:120)
[error] 	at docspell.build.Cmd$.exec(Cmd.scala:33)
[error] 	at docspell.build.Cmd$.run(Cmd.scala:19)
[error] 	at docspell.build.StylesPlugin$.runTailwind(StylesPlugin.scala:102)
[error] 	at docspell.build.StylesPlugin$.$anonfun$stylesSettings$6(StylesPlugin.scala:55)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[error] 	at sbt.Execute.work(Execute.scala:292)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[error] 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
[error] 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[error] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[error] 	at java.base/java.lang.Thread.run(Thread.java:840)
[error] (webapp / stylesBuild) java.io.IOException: Cannot run program "tailwindcss" (in directory "/home/yang/git/docspell/modules/webapp"): error=2, No such file or directory
sbt:docspell-root> 

@eikek
Copy link
Owner

eikek commented Mar 18, 2024

[error] java.io.IOException: Cannot run program "tailwindcss" (in directory "/home/yang/git/docspell/modules/webapp"): error=2, No such file or directory

Yeah, sorry things changed a bit :) You need now to install tailwindcss cli tool. Or install nix and run nix develop .#ci.

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 20, 2024

Ah it uses tailwindcss-cli, okay thanks for that. Setting it up was easy enough. I did it without installing Nix, as I use Qubes btw :)

So good news - the override and mapping works. I am still testing different configuration options for the absolute best tesseract and ocrmypdf configs, but I will be ready with those soon. You might want to leave the "-l" argument in the default and not include it in the custom mappings just so the defaults are consistent between updates (since the configurations won't change between updates?). But it's really solid and I am getting the results I want.

But are you ready for some potential scope increase now @eikek ? :)

The converted pdf from ocrmypdf for vertical documents still has spaces inbetween each kanji for ocrmypdf because unlike tesseract, there is no option for that in ocrmypdf for -c preserve_interword_spaces=1. So an output would look like:

image

In Docspell's Extracted Metadata and copy and paste like that out of the file. That makes the text very hard to work with, and the best solution would be for the Extracted Metadata to come out horizontal, like tesseract does with -c preserve_interword_spaces=1 because on computers, almost nobody is using vertical.

Now, you would hope that ocrmypdf supports this with the ocrmypdf --tesseract-config CFG.file - but it does not. If you load a config file with ocrmypdf with preserve_interword_spaces 1 this option will still not work on the PDF. I tested it.

This really confused me and I went through a lot of Github issues. However, one semi-working solution as posted in this Chinese blog is to use the --sidecar option, and then it will read the configuration.

ocrmypdf --force-ocr --tesseract-config CFG.file -l jpn --sidecar working_text_no_extra_spaces_file.txt input.pdf output.pdf

If you use the above, the --sidecar file output of working_text_no_extra_spaces_file.txt will have the text in a useable state, even though the output.pdf will still have the spaces built in.

So that made me think, what if we can put that into Docspell for the Extracted Metadata?

So the new docspell feature would be, if and only if --sidecar file.txt is detected from ocrmypdf, write that to the Extracted Metadata section rather than the current default. It would also be good to install a one line file in /etc/docspell-joex like vertical.cfg with the one line in it for ocrmypdf.

Do you think that's feasible? It would make the entire feature much more useable.

As an alternative configuration/workaround in ocrmypdf I can also use the following:

ocrmypdf -l jpn_vert --tesseract-oem 0 input.pdf output.pdf

This will work with the current config, but it's not a good workaround because it really reduces recognition quality significantly and I'm not sure I want it to be default for docspell configuration.

For now I will make a note of the workaround as a potential configuration option and proceed with good defaults, but this new feature of reading a --sidecar would be enormously useful for all CJK Vertical scans for the end user I think. Also overwriting the default Extracted Metadata this way wouldn't require and db schema changes I would imagine. And if the user really wants the vertical text with newline spaces, they still have the converted pdf they can check directly. So all in all a good solution? Unless the extracted metadata is used elsewhere or interferes in something I don't know about.

Also how do you feel about adding a couple super common Japanese font dependencies if the license is useable?

@tenpai-git tenpai-git changed the title Add Japanese Vertical Support Branch for Tesseract OCR Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR Mar 20, 2024
@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 20, 2024

Here are my configuration defaults. I tried to keep it to the Default Configuration but there is no helping changing {{lang}} to {{mylang}}.

Tesseract Japanese Vertical Mapping:

    # To convert image files to PDF files, tesseract is used. This
    # also extracts the text in one go.
    tesseract = {
      command = {
        program = "tesseract"
    ### Argument mappings for processing vertical text or other custom language variables.
    arg-mappings = {
      "mylang" = {
        value = "{{lang}}"
        mappings = [
	  #Japanese Vertical
          {
            matches = "jpn_vert"
            args = [ "jpn_vert", "-c", "preserve_interword_spaces=1" ]
          },
	  #Other CJK/Vertical Language Mappings to be added here.
          {
            matches = ".*"
            args = [ "{{lang}}" ] 
          }
        ]
      }
    }
     ### End custom language argument mappings and configurations. 
        args = [
          "{{infile}}",
          "out",
          "-l",
          "{{mylang}}", #Inserts Argument Mappings
          "pdf",
          "txt"
        ]
        timeout = "5 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }

image

Test output looking good. While it was always possible in Tesseract, the conversion from vertical to horizontal with appropriate spacing and an end user being able to access that easily in DocSpell truly is a feat of open source accomplishment in my view.

People learning the language can easily scan letters in form mail, critical personal documents or notices, or language learning materials from phone pictures to lookup kanji without any help when dealing with printed material.

Tesseract Test Output:

Wed, March 20th, 2024, 12:23: ============ Start processing tesseract_jpn_vert.png ============
Wed, March 20th, 2024, 12:23: Checking for duplicate files
Wed, March 20th, 2024, 12:23: Creating new item with 1 attachment(s)
Wed, March 20th, 2024, 12:23: Creating item finished in 16 ms
Wed, March 20th, 2024, 12:23: Not an archive: image/png
Wed, March 20th, 2024, 12:23: Converting file Some(tesseract_jpn_vert.png) (image/png) into a PDF
Wed, March 20th, 2024, 12:23: Storing input to file /tmp/docspell-convert/docspell-tesseract15426178036928886978/infile for running tesseract
Wed, March 20th, 2024, 12:23: Running external command: tesseract /tmp/docspell-convert/docspell-tesseract15426178036928886978/infile out -l jpn_vert -c preserve_interword_spaces=1 pdf txt
Wed, March 20th, 2024, 12:23: Waiting for command to terminate…
Wed, March 20th, 2024, 12:23: [tesseract (err)]: Detected 176 diacritics
Wed, March 20th, 2024, 12:23: [tesseract (err)]:
Wed, March 20th, 2024, 12:23: Conversion to pdf+txt successful. Saving file.
Wed, March 20th, 2024, 12:23: Closing process: `tesseract /tmp/docspell-convert/docspell-tesseract15426178036928886978/infile out -l jpn_vert -c preserve_interword_spaces=1 pdf txt`
Wed, March 20th, 2024, 12:23: Starting text extraction for 1 files
Wed, March 20th, 2024, 12:23: TextExtraction skipped, since text is already available.
Wed, March 20th, 2024, 12:23: Storing extracted texts …
Wed, March 20th, 2024, 12:23: Extracted text stored.
Wed, March 20th, 2024, 12:23: Add to fts index 2 records
Wed, March 20th, 2024, 12:23: Text extraction finished in 14 ms.
Wed, March 20th, 2024, 12:23: Creating preview images for 1 files…
Wed, March 20th, 2024, 12:23: Preview generated, saving to database…
Wed, March 20th, 2024, 12:23: Retrieving page count for 1 files…
Wed, March 20th, 2024, 12:23: Found number of pages: 1
Wed, March 20th, 2024, 12:23: Update attachment 8J4yE5Yo6vB-Xp5oZRB9i7z-696STXDs27b-n8ss8Jkes5Z with page count Some(1)
Wed, March 20th, 2024, 12:23: Stored page count (1).
Wed, March 20th, 2024, 12:23: Starting text analysis
Wed, March 20th, 2024, 12:23: Max text length limit disabled.
Wed, March 20th, 2024, 12:23: Storing tags: List(RAttachmentMeta(Ident(8J4yE5Yo6vB-Xp5oZRB9i7z-696STXDs27b-n8ss8Jkes5Z),None,List(),MetaProposalList(List()),None,Some(JpnVert)))
Wed, March 20th, 2024, 12:23: Classification is disabled. Check config or settings.
Wed, March 20th, 2024, 12:23: Guessing label for correspondentorg …
Wed, March 20th, 2024, 12:23: No classifier model found.
Wed, March 20th, 2024, 12:23: Guessing label for correspondentperson …
Wed, March 20th, 2024, 12:23: No classifier model found.
Wed, March 20th, 2024, 12:23: Guessing label for concernedperson …
Wed, March 20th, 2024, 12:23: No classifier model found.
Wed, March 20th, 2024, 12:23: Guessing label for concernedequip …
Wed, March 20th, 2024, 12:23: No classifier model found.
Wed, March 20th, 2024, 12:23: Text-Analysis finished in 65 ms
Wed, March 20th, 2024, 12:23: Starting find-proposal
Wed, March 20th, 2024, 12:23: Looking up classifier results: List()
Wed, March 20th, 2024, 12:23: Storing proposals
Wed, March 20th, 2024, 12:23: Storing attachment proposals: MetaProposalList(List())
Wed, March 20th, 2024, 12:23: Starting linking proposals
Wed, March 20th, 2024, 12:23: No value for CorrOrg
Wed, March 20th, 2024, 12:23: No value for CorrPerson
Wed, March 20th, 2024, 12:23: No value for ConcPerson
Wed, March 20th, 2024, 12:23: No value for ConcEquip
Wed, March 20th, 2024, 12:23: No value for DocDate
Wed, March 20th, 2024, 12:23: No value for DueDate
Wed, March 20th, 2024, 12:23: Starting setting given data
Wed, March 20th, 2024, 12:23: Set item folder: 'None'
Wed, March 20th, 2024, 12:23: Set tags from given data: List()
Wed, March 20th, 2024, 12:23: Running 0 addon tasks for trigger Set(FinalProcessItem)
Wed, March 20th, 2024, 12:23: Job execution successful

Tesseract adds -c preserve_interword_spaces=1 and uses jpn_vert for the language.

Ocrmypdf Japanese Vertical Mapping

    # The tool ocrmypdf can be used to convert pdf files to pdf files
    # in order to add extracted text as a separate layer. This makes
    # image-only pdfs searchable and you can select and copy/paste the
    # text. It also converts pdfs into pdf/a type pdfs, which are best
    # suited for archiving. So it makes sense to use this even for
    # text-only pdfs.
    #
    # It is recommended to install ocrympdf, but it also is optional.
    # If it is enabled but fails, the error is not fatal and the
    # processing will continue using the original pdf for extracting
    # text. You can also disable it to remove the errors from the
    # processing logs.
    #
    # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
    ### Argument mappings for processing vertical text or other custom language variables.
    arg-mappings = {
      "mylang" = {
        value = "{{lang}}"
	# Japanese Vertical Mapping
        mappings = [
          {
            matches = "jpn_vert"
            args = [ "jpn", "--pdf-renderer", "sandwich", "--output-type", "pdf" ]
          },
	# Other CJK Vertical Mappings to go here.
          {
            matches = ".*"
            args = [ "{{lang}}" ]
          }
        ]
      }
    }
    ### End custom language mappings.

        args = [
          "-l", "{{mylang}}", #Inserts custom mappings.
          "--skip-text",
          "--deskew",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]
        timeout = "25 minutes" #Increase for e-books.
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }

image
The test output is clear and can later be improved with --sidecar features mentioned earlier for better spacing.

Ocrmypdf Test Output:

Wed, March 20th, 2024, 13:07: ============ Start processing 01_script_vert.pdf ============
Wed, March 20th, 2024, 13:07: Checking for duplicate files
Wed, March 20th, 2024, 13:07: Creating new item with 1 attachment(s)
Wed, March 20th, 2024, 13:07: Creating item finished in 41 ms
Wed, March 20th, 2024, 13:07: Not an archive: application/pdf
Wed, March 20th, 2024, 13:07: Converting file Some(01_script_vert.pdf) (application/pdf) into a PDF
Wed, March 20th, 2024, 13:07: Storing input to file /tmp/docspell-convert/docspell-ocrmypdf11329622709699135761/infile for running ocrmypdf
Wed, March 20th, 2024, 13:07: Trying to read the PDF using 0 passwords
Wed, March 20th, 2024, 13:07: Running external command: ocrmypdf -l jpn --pdf-renderer sandwich --output-type pdf --redo-ocr -j 1 /tmp/docspell-convert/docspell-ocrmypdf11329622709699135761/infile /tmp/docspell-convert/docspell-ocrmypdf11329622709699135761/out.pdf
Wed, March 20th, 2024, 13:07: Waiting for command to terminate…
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 1 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 2 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 1 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 1 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 3 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 2 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 4 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 3 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 5 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 4 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 6 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 5 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 7 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 6 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 8 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 7 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 9 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 8 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 10 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 9 Opened a file
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 11 redoing OCR
Wed, March 20th, 2024, 13:07: [ocrmypdf (err)]: 10 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 12 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 11 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 13 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 12 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 14 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 13 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 15 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 14 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 16 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 15 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 17 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 16 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 18 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 17 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 19 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 18 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 20 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 19 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 21 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 20 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 22 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 21 Opened a file
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 23 redoing OCR
Wed, March 20th, 2024, 13:08: [ocrmypdf (err)]: 22 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 24 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 23 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 25 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 24 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 26 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 25 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 27 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 26 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 28 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 27 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 29 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 28 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 30 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 29 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 31 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 30 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 32 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 31 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 33 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 32 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 34 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 33 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 35 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 34 Opened a file
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 36 redoing OCR
Wed, March 20th, 2024, 13:09: [ocrmypdf (err)]: 35 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 37 redoing OCR
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 36 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 38 redoing OCR
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 37 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 39 redoing OCR
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 38 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 40 redoing OCR
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 39 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 41 redoing OCR
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 40 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: 41 Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Postprocessing...
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Optimize ratio: 1.00 savings: 0.5%
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]: Opened a file
Wed, March 20th, 2024, 13:10: [ocrmypdf (err)]:
Wed, March 20th, 2024, 13:10: Conversion to pdf successful. Saving file.
Wed, March 20th, 2024, 13:10: Closing process: `ocrmypdf -l jpn --pdf-renderer sandwich --output-type pdf --redo-ocr -j 1 /tmp/docspell-convert/docspell-ocrmypdf11329622709699135761/infile /tmp/docspell-convert/docspell-ocrmypdf11329622709699135761/out.pdf`
Wed, March 20th, 2024, 13:10: Starting text extraction for 1 files
Wed, March 20th, 2024, 13:10: Extracting text for attachment 01_script_vert.converted
Wed, March 20th, 2024, 13:10: Trying to strip text from pdf using pdfbox.
Wed, March 20th, 2024, 13:10: Extracting text for attachment 01_script_vert.converted finished in 1095 ms
Wed, March 20th, 2024, 13:10: Storing extracted texts …
Wed, March 20th, 2024, 13:10: Extracted text stored.
Wed, March 20th, 2024, 13:10: Add to fts index 2 records
Wed, March 20th, 2024, 13:10: Text extraction finished in 1204 ms.
Wed, March 20th, 2024, 13:10: Creating preview images for 1 files…
Wed, March 20th, 2024, 13:10: Preview generated, saving to database…
Wed, March 20th, 2024, 13:10: Retrieving page count for 1 files…
Wed, March 20th, 2024, 13:10: Found number of pages: 41
Wed, March 20th, 2024, 13:10: Update attachment BxSEtFuMRTt-H6eppqfdLgH-zBrzJDoX9HG-k7gbcAGiafP with page count Some(41)
Wed, March 20th, 2024, 13:10: Stored page count (1).
Wed, March 20th, 2024, 13:10: Starting text analysis
Wed, March 20th, 2024, 13:10: Max text length limit disabled.
Wed, March 20th, 2024, 13:10: Storing tags: List(RAttachmentMeta(Ident(BxSEtFuMRTt-H6eppqfdLgH-zBrzJDoX9HG-k7gbcAGiafP),None,List(NerLabel(2024-03-20,Date,35923,35933)),MetaProposalList(List()),None,Some(JpnVert)))
Wed, March 20th, 2024, 13:10: Classification is disabled. Check config or settings.
Wed, March 20th, 2024, 13:10: Guessing label for correspondentorg …
Wed, March 20th, 2024, 13:10: No classifier model found.
Wed, March 20th, 2024, 13:10: Guessing label for correspondentperson …
Wed, March 20th, 2024, 13:10: No classifier model found.
Wed, March 20th, 2024, 13:10: Guessing label for concernedperson …
Wed, March 20th, 2024, 13:10: No classifier model found.
Wed, March 20th, 2024, 13:10: Guessing label for concernedequip …
Wed, March 20th, 2024, 13:10: No classifier model found.
Wed, March 20th, 2024, 13:10: Text-Analysis finished in 3690 ms
Wed, March 20th, 2024, 13:10: Starting find-proposal
Wed, March 20th, 2024, 13:10: Looking up classifier results: List()
Wed, March 20th, 2024, 13:10: Storing proposals
Wed, March 20th, 2024, 13:10: Storing attachment proposals: MetaProposalList(List(MetaProposal(DocDate,NonEmptyList(Candidate(IdRef(Ident(2024-03-20),2024-03-20),Set(NerLabel(2024-03-20,Date,35923,35933)),Some(0.9997217042829711))))))
Wed, March 20th, 2024, 13:10: Starting linking proposals
Wed, March 20th, 2024, 13:10: No value for CorrOrg
Wed, March 20th, 2024, 13:10: No value for CorrPerson
Wed, March 20th, 2024, 13:10: No value for ConcPerson
Wed, March 20th, 2024, 13:10: No value for ConcEquip
Wed, March 20th, 2024, 13:10: No value for DocDate
Wed, March 20th, 2024, 13:10: No value for DueDate
Wed, March 20th, 2024, 13:10: Starting setting given data
Wed, March 20th, 2024, 13:10: Set item folder: 'None'
Wed, March 20th, 2024, 13:10: Set tags from given data: List(F22AMs2ewKc-TpDk5o1YCk1-4MLPESNdzgN-FEe1kUAbY12)
Wed, March 20th, 2024, 13:10: Running 0 addon tasks for trigger Set(FinalProcessItem)
Wed, March 20th, 2024, 13:10: Job execution successful

Ocrmypdf maintains jpn when discovering {{lang}} variable jpn_vert. Was using --redo-ocr for testing this configuration, but output looks clean. This prevents the configuration from breaking in either Tesseract or Ocrympdf.

One thing I'm not sure about is if the "--deskew" option hurts or helps the output since I don't have a huge sample of vertical documents, but I can just note it in the documentation rather than bug you for a way to remove argument mapping from the default or set it only for every non-vertical language.

The commit works as is but I'll clean it up a little bit before merging.

Korean should be easy, but Chinese has two writing systems. I'll work on those and the documentation later for full CJK language support, let's focus this commit on Japanese for now.

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 20, 2024

Once we resolve iso2 we should be good on JpnVert as a base, except for implementing the configuration.

What should we do about adding the vertical mapping to the Default Configuration? I am fine to add it into the Default Configuration but I am slightly worried someone might accidentally override their configuration, if there's anything we can do about that or note it in the next release description.

I looked through the files but I'm not sure how to update the Default Configuration file directly. Is it located somewhere or is it built during run time?

This Pull Request isn't useful without the configuration changes (or will be broken) without the custom mapping configurations I posted above. I don't just want to leave it in the docs, it should ship with the release in some way so it works by default.

How can I update the Default Configuration, or how do you normally go about this @eikek ?

Copy link
Owner

@eikek eikek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tenpai-git for your ongoing efforts!

Regarding default configuration: you can look for reference.conf in the joex module. This contains the default configuraiton and any user supplied one is merged with it. I think it could contain a standard mapping, like lang-def that includes all the details for jpn-vert and a fallback to the default.

scope creep: haha, yes so I think it sounds like without it, the feature is not very useful. I'm wondering if docspell can just always use this sidecar file? Or should it only be used in case for jpn_vert?

modules/webapp/package.json Show resolved Hide resolved
website/.npmrc Outdated Show resolved Hide resolved
@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 21, 2024

Thank you @tenpai-git for your ongoing efforts!

Regarding default configuration: you can look for reference.conf in the joex module. This contains the default configuraiton and any user supplied one is merged with it. I think it could contain a standard mapping, like lang-def that includes all the details for jpn-vert and a fallback to the default.

Okay thank you for that, I will fix reference.conf in the next few days or so with the correct defaults.

scope creep: haha, yes so I think it sounds like without it, the feature is not very useful. I'm wondering if docspell can just always use this sidecar file? Or should it only be used in case for jpn_vert?

Well it's very useful as is for tesseract as of now, but less so for ocrmypdf since it's not formatted well as shown above.

It's a good question. I'm not sure how the Extracted Data is currently generated/comes from, so hard for me to answer.

--sidecar will only be generated from ocrmypdf if the option is called, and --sidecar will output to it's own completely new text file. I would say it should always be used for Extracted Metadata IF the option is called and the --sidecar file exists.

So you have to check if the option was used and replace Extracted Metadata with the --sidecar txt output file if detected/the option is used. But only if ocrmypdf generates the --sidecar pdf. Otherwise Extracted Metadata should be the existing default behavior.

If we use this logic, the language should not matter.

I can then include --sidecar argument mapping option for Chinese/Korean/Japanese vertical language defaults and include it in documentation. I think some Middle Eastern languages may also benefit from this. I think this would make docspell uniquely useful to the users of these languages. Being able to open that extracted metadata horizontally (how we almost always work with text at a computer even if it's vertical by default), would be enormously convenient.

Output would be like: ocrmypdf --force-ocr --tesseract-config cfg.file -l jpn --sidecar extracted_metadata.txt 01_script_vert.pdf 01_script_vert.converted.pdf --output-type pdf

cfg.file also needs to be able to be read in... but if that's in /etc/docspell-joex it should be okay?

EDIT: On second thought, after more testing the result I'm getting from tests is a much lower extracted_metadata quality from using --sidecar even though it is horizontal. I'm not exactly sure why. But the vertical text is still good even though it has the annoying spaces in my tests. Let's skip this for now. I'll leave it as a commented option in the default config or documentation, maybe? I'd rather have reliable text with bad spacing than poor quality horizontal.

EDIT 2: Took a few dozen tests but I got something working with --sidecar...

Fundamentally, the following output will remove the existing metadata (which will always be new line spaced), rasterize the document, and rescan it. The re-scan, if using a loseless document is actually okay (the document might need like ~1200 dpi and loseless compression to be scanned well).

The secret is --tesseract-pagesegmode 5 which will teach it to expect vertical text.

ocrmypdf --force-ocr -l jpn_vert --sidecar extracted_metadata.txt vert_1_page.pdf vert_1_page.converted.pdf --tesseract-pagesegmode 5 --tesseract-config cfg.file --output-type pdf --pdf-renderer sandwich

Where cfg.file is:

preserve_interword_spaces 1
tessedit_load_sublangs jpn_vert

This produces an okay output of useable horizontal text from vertical text in --sidecar extracted_metadata.txt. 🎉 🎉

But if the existing metadata is already there and is accurate, vert_1_page.converted.pdf will be much lower quality than the original on the newly generated pdf. So even though --sidecar extracted_metadata.txt will be decent and horizontal, vert_1_page.converted.pdf will get worse. And much worse if the source is low quality.

What I kinda want to do is it to run ocrmypdf twice. Once with --skip-text, to preserve the newly converted pdf selectable text, and once with --force-ocr to pull out a good horizontal --sidecar... so if --sidecar is detected, then process both?

It would definitely be good if the output in docspell if Extracted Metadata could use --sidecar extracted_metadata.txt:
ocrmypdf --force-ocr -l jpn_vert --sidecar extracted_metadata.txt vert_1_page.pdf vert_1_page.converted.pdf --tesseract-pagesegmode 5 --tesseract-config cfg.file --output-type pdf --pdf-renderer sandwich

And then the converted file itself could use (just replacing --skip-text and no --sidecar):
ocrmypdf --skip-text -l jpn_vert vert_1_page.pdf vert_1_page.converted.pdf --tesseract-pagesegmode 5 --tesseract-config cfg.file --pdf-renderer sandwich
Which converts very quickly without much overhead on --skip-text... but it's getting complicated.

Worse off, if you use --sidecar and --skip-text together, the --side-car extracted_metadata.txt will only be:

[OCR skipped on page(s) 1]

So we can't have it always use --sidecar with --skip-text and that case has to be handled, too.

Maybe we just need a separate language option tickbox/button in the preview menu for --sidecar generation (🚗 emoji?) all together and save it for later. Maybe using the "Extracted Metadata" tab is not good for --sidecar. What we have now is not bad and a huge improvement on tesseract and I think it needs its own option. I kinda want --sidecar as an option in the language sources menu for easy upload/processing even on the mobile applications.

Unless you want to process ocrmypdf twice and add a button or something for side-car, let's just skip this for this Pull Request? I think I'll leave it as a commented option in the default config or documentation, maybe?

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 24, 2024

@eikek If I put a file in docspell/modules/joex/src/main/resources/ next to reference.conf, will that default file also be loaded into /etc/docspell/docspell-joex as well?

I would like to include a one-line file for ocrmypdf config to reference. Unfortunately it's not supported as a command line flag, but only can be read from a file. It's not a super necessary flag but it'd be nice to have (and the user could manipulate ocrmypdf with supported tesseract flags for greater versatility).

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Mar 24, 2024

These should be some good sane defaults, let me know if the changes to reference.conf are okay @eikek

If everything is good, might as well push all to nightly and I'll do one more test and start adding documentation.

Copy link
Owner

@eikek eikek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it took a bit. Looks great! I think there is just one tiny copy&paste thing, but not so sure actually.

Another thought I head: Now we encoded the vertical text into the language. Perhaps (later!) it might be better to use a separate field for this? I know many languages don't really have a vertical mode, but in theory it could even apply to German when processing posters or such things. Now we would at least need to create a separate "artificial" language for each that naturally has a vertical mode (like Chinese I think?).

I think I would still go with this for now anyways, but perhaps move it to a separate field in the future.

modules/webapp/src/main/elm/Data/Language.elm Outdated Show resolved Hide resolved
modules/joex/src/main/resources/reference.conf Outdated Show resolved Hide resolved
@tenpai-git
Copy link
Contributor Author

#2505 (comment)

Any thoughts on this @eikek ?

@eikek
Copy link
Owner

eikek commented Mar 28, 2024

@eikek If I put a file in docspell/modules/joex/src/main/resources/ next to reference.conf, will that default file also be loaded into /etc/docspell/docspell-joex as well?

I would like to include a one-line file for ocrmypdf config to reference. Unfortunately it's not supported as a command line flag, but only can be read from a file. It's not a super necessary flag but it'd be nice to have (and the user could manipulate ocrmypdf with supported tesseract flags for greater versatility).

Hi @tenpai-git sorry I missed that comment! So this file will not be picked up by default. It is not so simple to reference a file like this, because the command will be running in some temporary environment and the code has to set this up correctly. I'm also not quite sure If I understand what you are trying to achieve. If this is something that a user may want to configure, maybe it is better kept in the docs?

@tenpai-git
Copy link
Contributor Author

@eikek If I put a file in docspell/modules/joex/src/main/resources/ next to reference.conf, will that default file also be loaded into /etc/docspell/docspell-joex as well?
I would like to include a one-line file for ocrmypdf config to reference. Unfortunately it's not supported as a command line flag, but only can be read from a file. It's not a super necessary flag but it'd be nice to have (and the user could manipulate ocrmypdf with supported tesseract flags for greater versatility).

Hi @tenpai-git sorry I missed that comment! So this file will not be picked up by default. It is not so simple to reference a file like this, because the command will be running in some temporary environment and the code has to set this up correctly. I'm also not quite sure If I understand what you are trying to achieve. If this is something that a user may want to configure, maybe it is better kept in the docs?

I guess it doesn't need to be in this Pull Request, but consider it a feature request.

I believe yes, the user would like to control this because it gives ocrmypdf the ability to have tesseract config options.

See: https://ocrmypdf.readthedocs.io/en/latest/advanced.html#control-of-ocr-options

It's not just useful for vertical languages (though this is a big use case), choosing the page segmentation mode makes a big difference in output. For example, if the image was orientated a specific way, or mostly numbers, you would want to add these configuration variables to ocrmypdf for scanning. I just would like to include a small file to be installed in the /etc/docspell/docspell-joex directory for vertical defaults.

Can the command just relative reference the file like ./ocrmypdf-config or simply be included in temporary environments along with the docspell joex file?

If not that's okay for now but definitely something to work on for vertical support as an option.

@eikek
Copy link
Owner

eikek commented Apr 2, 2024

Ok thanks for the explanation! I think this should definitely be a separate issue and feature request.

I just would like to include a small file to be installed in the /etc/docspell/docspell-joex directory for vertical defaults.

Can the command just relative reference the file like ./ocrmypdf-config or simply be included in temporary environments along with the docspell joex file?

No this is not easily possible, because docspell can be run in many different ways (there may be no /etc at all). But it can be done of course. Since the code has to setup the environment anyways, these settings can just be separate options inside the ocrmypdf command. It might be also interesting to think about using the options from the tesseract command by default in this case?

@tenpai-git
Copy link
Contributor Author

tenpai-git commented Apr 4, 2024

Ok thanks for the explanation! I think this should definitely be a separate issue and feature request.

I just would like to include a small file to be installed in the /etc/docspell/docspell-joex directory for vertical defaults.

Can the command just relative reference the file like ./ocrmypdf-config or simply be included in temporary environments along with the docspell joex file?

No this is not easily possible, because docspell can be run in many different ways (there may be no /etc at all). But it can be done of course. Since the code has to setup the environment anyways, these settings can just be separate options inside the ocrmypdf command. It might be also interesting to think about using the options from the tesseract command by default in this case?

I wish that it was natively supported but the flag only seems to be able to be called from a file. I'll see if I can finagle something in bash to make it work without that.

Regardless though I think this is about ready to be merged. Can I add the documentation here or do you want me to open a separate PR into current-docs?

modules/webapp/src/main/elm/Messages/Data/Language.elm Outdated Show resolved Hide resolved
modules/webapp/package.json Show resolved Hide resolved
modules/webapp/package.json Show resolved Hide resolved
website/.npmrc Outdated Show resolved Hide resolved
modules/joex/src/main/resources/reference.conf Outdated Show resolved Hide resolved
modules/webapp/src/main/elm/Data/Language.elm Outdated Show resolved Hide resolved
modules/webapp/src/main/elm/Data/Language.elm Outdated Show resolved Hide resolved
@tenpai-git
Copy link
Contributor Author

Okay @eikek I think this is good for a nightly test? We talked about a lot but I am ready to do a final test run and then I can write the documentation.

@eikek
Copy link
Owner

eikek commented Apr 16, 2024

Thank you @tenpai-git for all these efforts!

@eikek eikek merged commit e731d82 into eikek:master Apr 16, 2024
5 checks passed
@tenpai-git
Copy link
Contributor Author

Thank you @tenpai-git for all these efforts!

Hey @eikek just letting you know I've been traveling for a bit and I'll get the documentation and everything ready in a couple weeks.

@eikek
Copy link
Owner

eikek commented May 2, 2024

@tenpai-git don't worry about this, there is no rush at all. and: thank you!

madduck pushed a commit to madduck/docspell that referenced this pull request May 14, 2024
…ikek#2505)

* Add Japanese Vertical Support 
* Adds Japanese Vertical mappings to default configuration.
@eikek eikek added this to the Docspell 0.42.0 milestone May 27, 2024
@eikek eikek added joex affects the joex component feature labels May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature joex affects the joex component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extracted text is unreadable (random glyphs) for PDFs with Japanese text
2 participants