-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR #2505
Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR #2505
Conversation
Thank you very much! I need a bit of time to look at it, as it is currently too busy :/ |
Understood @eikek - thanks for taking a look at this. To make things a little simpler, the number 1 thing I need you to look at is:
I think with that the Pull Request will work well as long as we can address that change, and others can easily build off it for Chinese Vertical and Korean Vertical. For the final user... I think it's important it's a separate language options so they can choose the setting on Docspell Share/mobile devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your efforts! To your questions:
-
I think this is a good starting point and I guess it's is good to
go and we can always improve further. I don't know how SOLR handles
Japanese vertical or horizontal :) So when you think it is ok, then
again we should go like this. -
Adding
-c preserve_interword_spaces=1
is a bit tricky. I have to
think how to think about adding more options depending on language.
I'm fine adding--output-type pdf
by default, if that then
creates a PDF/A file. I think the default is pdfa, which is the
recommended default. Then if users want it by default, it's not
hard to change in their configurations. So I think I'm a bit
undecided. :-)
Let me know if you want to merge this - since it doesn't interfere
with other things, I can merge if you want.
Ah wait, I think I misunderstood the |
Before merging, I am just concerned about the Tesseract has a dedicated Japanese vertical package If they are using the same variable from the ISO codes, then ocrmypdf should fail because
However, while it is maybe not good practice, just having It seems to me one way or another, for Quickest solution: If JpnVert can be appended
Yes this was my intention. I think there is some kind of common encoding issue coming up in processing PDF/A with Japanese. Most Japanese office computers still use |
Ok, I see. So this wouldn't quite work, because either tesseract or ocrmypdf would be misconfigured. I know that the commands in the configfile are rather rigid and cannot be adjusted based on the arguments already known. I think there should be something to generally override things. I was thinking something like this:
Docspell would first resolve all curly braces with variables for the current command. Then find the first mapping where |
Actually I think it's perfect. I would be happy to write documentation for this config override and can write a new pull request to support Chinese Simplified Vertical, Chinese Traditional Vertical, Korean Vertical, and Japanese Vertical if you can add that. |
Cool! Let me do this change and then you can rebase this pr onto it. I hope to find some time for this the next couple of days. It shouldn't take long.
Oh that would be very much appreciated! |
Sounds good! Full CJK Support will be really good and I'll look out for the change and rebase once they're ready. Should we add the (4) exceptions (Chinese Traditional Vertical/Chinese Simplified Vertical/Korean Vertical/Japanese Vertical) into the default configuration for those languages? Or do you want to keep the default config clean and just keep the example in the documentation? From the perspective of a user, the vertical language selection is virtually useless without the tesseract Either way the comments can separate out the vertical options for relative cleanliness but just tell me if you want the vertical tesseract options it in docs or you want me to add to the default configuration of the actual files. |
I did some testing more and I just want to make sure this works for you. So the goal is to have Tesseract select So if the language is jpn_vert...
But what about ocrmypdf? It will see jpn_vert and fail because there is no jpn_vert package, only jpn. Does the functionality also need to be added for ocrmypdf? Conceptually under ocrmypdf, mappings = [
{
matches = "jpn_vert" **# It Matches Here**
args = [ "-l", "jpn"] #It defaults to jpn despite {{lang}} because theres no vertical package.
}, And is So this solves the problem of adding in the |
Yes, the idea is that this kind of configuration is available to all commands in the config file - I just used tesseract as an example. You would need to configure ocrmypdf in a similar way, as you described. Other commands that use Edit: For the default configuration, I think the "works out of the box" is good. Obviously, it must not interfere with existing stuff (which it doesn't) and users can always overwrite the entire thing in their config files. In a very complex case, it is also possible to write a wrapper script and let this be called from docspell. |
Awesome @eikek - I think that covers all the bases then. |
Hello @tenpai-git finally I could merge something. #2536 is now in master and you could try it out. The commit message contains a bit of info - hope that helps (basically what is already written here). Curious if it is enough to make jpn_vert work. |
You got it @eikek, will install within the next week or so and prepare a big CJK vertical pull request if it's all good. |
…ical-support Merging in master branch.
Hey @eikek I can't seem to build the debian snapshots for testing. I might be missing a dependency. It seems my system can't find Any idea on how I can get this to compile? Here's the full log.
|
Yeah, sorry things changed a bit :) You need now to install tailwindcss cli tool. Or install nix and run |
Ah it uses So good news - the override and mapping works. I am still testing different configuration options for the absolute best But are you ready for some potential scope increase now @eikek ? :) The converted pdf from ocrmypdf for vertical documents still has spaces inbetween each kanji for In Docspell's Extracted Metadata and copy and paste like that out of the file. That makes the text very hard to work with, and the best solution would be for the Extracted Metadata to come out horizontal, like Now, you would hope that This really confused me and I went through a lot of Github issues. However, one semi-working solution as posted in this Chinese blog is to use the
If you use the above, the So that made me think, what if we can put that into Docspell for the Extracted Metadata? So the new docspell feature would be, if and only if Do you think that's feasible? It would make the entire feature much more useable. As an alternative configuration/workaround in ocrmypdf I can also use the following:
This will work with the current config, but it's not a good workaround because it really reduces recognition quality significantly and I'm not sure I want it to be default for docspell configuration. For now I will make a note of the workaround as a potential configuration option and proceed with good defaults, but this new feature of reading a Also how do you feel about adding a couple super common Japanese font dependencies if the license is useable? |
Once we resolve iso2 we should be good on JpnVert as a base, except for implementing the configuration. What should we do about adding the vertical mapping to the Default Configuration? I am fine to add it into the Default Configuration but I am slightly worried someone might accidentally override their configuration, if there's anything we can do about that or note it in the next release description. I looked through the files but I'm not sure how to update the Default Configuration file directly. Is it located somewhere or is it built during run time? This Pull Request isn't useful without the configuration changes (or will be broken) without the custom mapping configurations I posted above. I don't just want to leave it in the docs, it should ship with the release in some way so it works by default. How can I update the Default Configuration, or how do you normally go about this @eikek ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tenpai-git for your ongoing efforts!
Regarding default configuration: you can look for reference.conf
in the joex module. This contains the default configuraiton and any user supplied one is merged with it. I think it could contain a standard mapping, like lang-def
that includes all the details for jpn-vert and a fallback to the default.
scope creep: haha, yes so I think it sounds like without it, the feature is not very useful. I'm wondering if docspell can just always use this sidecar file? Or should it only be used in case for jpn_vert?
modules/analysis/src/test/scala/docspell/analysis/date/DateFindTest.scala
Outdated
Show resolved
Hide resolved
Okay thank you for that, I will fix
Well it's very useful as is for
EDIT 2: Took a few dozen tests but I got something working with Fundamentally, the following output will remove the existing metadata (which will always be new line spaced), rasterize the document, and rescan it. The re-scan, if using a loseless document is actually okay (the document might need like ~1200 dpi and loseless compression to be scanned well). The secret is
Where
This produces an okay output of useable horizontal text from vertical text in But if the existing metadata is already there and is accurate, What I kinda want to do is it to run It would definitely be good if the output in docspell if Extracted Metadata could use And then the converted file itself could use (just replacing Worse off, if you use
So we can't have it always use Maybe we just need a separate language option tickbox/button in the preview menu for Unless you want to process |
@eikek If I put a file in I would like to include a one-line file for |
These should be some good sane defaults, let me know if the changes to If everything is good, might as well push all to nightly and I'll do one more test and start adding documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it took a bit. Looks great! I think there is just one tiny copy&paste thing, but not so sure actually.
Another thought I head: Now we encoded the vertical text into the language. Perhaps (later!) it might be better to use a separate field for this? I know many languages don't really have a vertical mode, but in theory it could even apply to German when processing posters or such things. Now we would at least need to create a separate "artificial" language for each that naturally has a vertical mode (like Chinese I think?).
I think I would still go with this for now anyways, but perhaps move it to a separate field in the future.
Any thoughts on this @eikek ? |
Hi @tenpai-git sorry I missed that comment! So this file will not be picked up by default. It is not so simple to reference a file like this, because the command will be running in some temporary environment and the code has to set this up correctly. I'm also not quite sure If I understand what you are trying to achieve. If this is something that a user may want to configure, maybe it is better kept in the docs? |
I guess it doesn't need to be in this Pull Request, but consider it a feature request. I believe yes, the user would like to control this because it gives ocrmypdf the ability to have tesseract config options. See: https://ocrmypdf.readthedocs.io/en/latest/advanced.html#control-of-ocr-options It's not just useful for vertical languages (though this is a big use case), choosing the page segmentation mode makes a big difference in output. For example, if the image was orientated a specific way, or mostly numbers, you would want to add these configuration variables to ocrmypdf for scanning. I just would like to include a small file to be installed in the /etc/docspell/docspell-joex directory for vertical defaults. Can the command just relative reference the file like If not that's okay for now but definitely something to work on for vertical support as an option. |
…tainer. Co-Authored-By: eikek <eike.kettner@posteo.de>
…al-support' into add-japanese-vertical-support
…apanese-vertical-support
Ok thanks for the explanation! I think this should definitely be a separate issue and feature request.
No this is not easily possible, because docspell can be run in many different ways (there may be no |
I wish that it was natively supported but the flag only seems to be able to be called from a file. I'll see if I can finagle something in bash to make it work without that. Regardless though I think this is about ready to be merged. Can I add the documentation here or do you want me to open a separate PR into current-docs? |
modules/analysis/src/test/scala/docspell/analysis/date/DateFindTest.scala
Outdated
Show resolved
Hide resolved
…npai-git/docspell into add-japanese-vertical-support
…apanese-vertical-support
Okay @eikek I think this is good for a nightly test? We talked about a lot but I am ready to do a final test run and then I can write the documentation. |
Thank you @tenpai-git for all these efforts! |
Hey @eikek just letting you know I've been traveling for a bit and I'll get the documentation and everything ready in a couple weeks. |
@tenpai-git don't worry about this, there is no rush at all. and: thank you! |
…ikek#2505) * Add Japanese Vertical Support * Adds Japanese Vertical mappings to default configuration.
Hi @eikek - this is to resolve #2445 and add Japanese vertical support identified for use with the
tesseract
andocrmypdf
dependencies on Docspell. I went withJpnVert
for the name.I'm new to Elm and Scala but I was able to install them and run
sbt fix
locally and mostly perform the steps described in the add language documentation. SinceJpnVert
/vertical Japanese is not it's own "language" this is a bit of a special case. If there are any outstanding issues please let me know and I'll do my best to resolve them.I have a couple questions before merging:
Because the base Japanese language (horizontally) is already included, I added some comments where I felt there were redundancies and instead tried to reference the existing code for tests and dates. Similarly, Solr only mentions Japanese vertical search once, where one of the special options is not available in vertical search. This leads me to believe the existing Japanese search covers both Japanese (Horizontal) and Japanese (Vertical). Should the Solr/test section still be duplicated for other reasons? I can update if needed.
The main point of this commit is to have Tesseract select
jpn_vert
for the{{lang}}
variable if JpnVert is selected in the menu. However we also need to add-c preserve_interword_spaces=1
for vertical languages. Where is the best place to add this conditional? I think the user will want it default every time for any vertical language.Also I am concerned about a possible conflict with
ocrmypdf
where it still needsjpn
for vertical in {{lang}} but not jpn_vert.ocrmypdf
only needsjpn
. And if we're making a conditional for this, we might as well include"--output-type", "pdf"
anyway.Also, how would you like to handle how
{{lang}}
is called and ISO references?I will commit additional documentation updates to the current docs once I know how you want to go about this. And thank you again for your work on this. The work here will be a good pre-cursor to vertical support for all languages that use it and provide good defaults.
(・∀・)b
Edit: This is in addition to the upgrade to PDFBox v3.0 which is already merged.