Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubles with Tesseract/Leptonica presets #36

Closed
maximumspatium opened this issue Feb 10, 2015 · 13 comments
Closed

Troubles with Tesseract/Leptonica presets #36

maximumspatium opened this issue Feb 10, 2015 · 13 comments

Comments

@maximumspatium
Copy link

Hello,
first of all - thank you for bringing us this amazing project! We're using JavaCPP in our music recognition engine (www.audiveris.org) since 2012 for communicating with Tesseract OCR. For this purpose I've programmed an elegant interface manually, based on JavaCPP. It worked for us for several years until now.

For a couple of days I've tried to compile my old code with the current JavaCPP v0.10 and noticed that it doesn't compile anymore. While examining the new library code I've noticed several big changes. The biggest one was the introduction of an automatic system that does scan C++ headers and produce the appropriate Java interface.

I spent several days playing with the new JavaCPP library and its Tesseract presets in order to integrate it with my project. I must admit that I finally gave up on it after a while, not being able to get it work at all.

I would very appreciate it if someone could shed some light on the following issues I encountered with the Tesseract preset:

  1. The documentation on manual installation says "Just put all the desired JAR files somewhere in your CLASSPATH". OK, it doesn't work out-of-box. My Ubuntu 14 and Netbeans 8 simply refuse to load the native libraries from supplied JARs. The native libraries have to be EXTRACTED first, otherwise the famous Unsatisfied Link exception is thrown. Is it the normal behaviour or am I missing something? Is there any dedicated LOAD method I forget to call?

  2. I wasn't able to get Leptonica's pixReadMemTiff function work. It always returns "false" that indicates some internal error. Reading TIFF files does work though. My project relies heavily on the former method because we're constructing images on the fly. I have to debug the native library in order to be able to say more. Any suggestions on why it doesn't work?

  3. Leptonica's native library is huge (2,3 MB) but Tesseract utilizes just a few methods, mostly PIX-related. The rest (over 90%) is never used but will be wastefully kept in memory. Is there any possibility to strip out unused code if the intended usage of Leptonica will be Tesseract alone?

  4. Tesseract API supplies an iterator - ResultIterator - to be used for extraction of recognition results. This ResultIterator is inherited from LTRResultIterator which, in turn, is inherited from PageIterator. So it's basically possible to use this ResultIterator for accessing results at different levels like pages, words or single symbols.
    In C++, it's possible to access PageIterator's method Empty() using ResultIterator instance:

ResultIterator rit = api.GetIterator();
if (rit.Empty())
    continue;

It doesn't work in JavaCPP v0.10 because of the famous issue with the multiple inheritance in Java. From what I see in the generated code, ResultIterator extends LTRResultIterator which extends Pointer. Where is PageIterator?
It looks like I need to explicitely request PageIterator instance like this:

ResultIterator rit = api.GetIterator();
PageIterator   pit = TessResultIteratorGetPageIterator(rit);

With the earlier versions of JavaCPP the C++ way has just worked, without any casts or extra accessors. Am I missing some important technical point here?

  1. LTRResultIterator::WordFontAttributes() uses several BOOL * parameters for returning font properties (monospaced, bold, italic etc.) My manual code utilizes BoolPointer for this purpose. The automatic code uses IntPointer though which is very inconvenient because client code has to do type conversion. Is there any reason for this change?

I kinda like the idea of automatic Java interface generation. It looks promising at saving us a lot of time. I would like to use this new feature in my project. That's why I'd highly appreciate any help on getting it up and running.

Thank you in advance!
Best regards
Max P.

saudet added a commit that referenced this issue Feb 10, 2015
@saudet
Copy link
Member

saudet commented Feb 10, 2015

Great to hear that you guys are using JavaCPP! Thanks for the feedback. So, let's see if we can get this working properly.

  1. If you try the BasicExample with Maven on the command line using the given pom.xml file, what happens?

  2. Could you provide a code snippet that should work so I can debug this here?

  3. It's always possible to modify the presets for Tesseract and link Leptonica statically with it, yes. But it gets complicated because to use Tesseract we need to use the header files from Leptonica, so we would basically have to expose its API through the Tesseract interface, which I feel did not make much sense... How would you see this working out better than the way it is now?

  4. Oops, fixed: 727acaf

  5. BOOL is #define BOOL int, so it's an int, thus 4 bytes. The bool type usually has only 1 byte, so we can't safely use BoolPointer or bool[]. We might end up corrupting memory. Though, it's always possible to write a small wrapper using bool[] if this bothers you too much. But I would prefer to have the original API fixed. :)

Thanks for your support! It's great to see more and more people finding this project useful.

@maximumspatium
Copy link
Author

Hello Samuel,

I've finally managed to get the Tesseract preset working for me. I just want to answer your questions for the sake of completeness. Besides the fact the whole build system works I've discovered several things need to be improved.

  1. If you try the BasicExample with Maven on the command line using the given pom.xml file, what happens?

It fails with java.lang.UnsatisfiedLinkError because two native libraries - libpng.so.16 and libjpeg.so.62 - leptonica was linked with couldn't be found. I had to compile them from sources in order to solve this issue because my recent Ubuntu 14.04 does still include ancient versions of these libraries.

  1. Could you provide a code snippet that should work so I can debug this here?

OK, I naturally expect the following code to work because there is a variant of pixReadMemTiff that accepts ByteBuffer as first argument:

// Convert buffered image to in-memory TIFF and convert it to PIX
ByteBuffer buf = toTiffBuffer(bufferedImage);
buf.position(0);
image = pixReadMemTiff(buf, buf.capacity(), 0);

This one works as expected:

// Convert buffered image to in-memory TIFF and convert it to PIX
ByteBuffer buf = toTiffBuffer(bufferedImage);
buf.position(0);
image = pixReadMemTiff(buf.array(), buf.capacity(), 0);
  1. It's always possible to modify the presets for Tesseract and link Leptonica statically with it, yes. But it gets complicated because to use Tesseract we need to use the header files from Leptonica, so we would basically have to expose its API through the Tesseract interface, which I feel did not make much sense... How would you see this working out better than the way it is now?

Hmm, what I did manually was a PIX wrapper class accepting and passing opaque PIX object to and from Leptonica:

@Opaque
public static class PIX extends Pointer {
    static {
        Loader.load()
    }
    public static PIX readMemTiff (ByteBuffer buf, int bufsize, int n)
    {
        BytePointer bptr = new BytePointer(buf);
        ptr.position(0);
        return pixReadMemTiff(ptr, bufsize, n);
    }
    private static native PIX pixReadMemTiff(@Cast("const l_uint8*") BytePointer buf, int bufsize, int n);
}

And yes, Leptonica's PIX was exposed as a part of Tesseract API. Is it possible to link jnitesseract to both tesseract and leptonica? This way only necessary methods will be wrapped and no dead code will be loaded into memory.

  1. Oops, fixed: 727acaf

The fix works, thanks!

  1. BOOL is #define BOOL int, so it's an int, thus 4 bytes. The bool type usually has only 1 byte, so we can't safely use BoolPointer or bool[]. We might end up corrupting memory. Though, it's always possible to write a small wrapper using bool[] if this bothers you too much. But I would prefer to have the original API fixed. :)

It does surprisingly appear as BoolPointer after the last fix 👍 ))

Finally, I just want to mention two issues needed to be urgently adressed/fixed:

Leptonica dependes on several native libraries (libtiff, libgiff, libpng, libjpeg, zlib etc.) in order to get work properly. These dependencies should be present in a target system before the cppbuild.sh script is executed. Moreover, the installed dependencies should match Leptonica's configuration in "src/environ.h" (#define HAVE_???, for example #define HAVE_LIBTIFF 1). Otherwise, the build may fail in a subtle way. Unfortunately, the supplied cppbuild.sh script doesn't check whether all required dependencies are installed or not.

It's possible to build only selected presets. As for me I did the following:

bash cppbuild.sh -platform linux-x86 install tesseract

It failed because Tesseract depends on Leptonica but the latter wasn't present in the build directory. This one works though:

bash cppbuild.sh -platform linux-x86 install leptonica tesseract

Tesseract's cppbuild.sh need to be modified to process Leptonica first.

It all looks like a limitation of the current build system (Bash/Maven). Do you still look forward to replace it with Gradle?

Best regards
Max P.

saudet added a commit that referenced this issue Feb 14, 2015
…type of Leptonica (issue #36)

 * Add `preload` for `gif`, `jpeg`, `png`, `tiff`, and `webp` libraries in presets for Leptonica (issue #36)
@saudet
Copy link
Member

saudet commented Feb 14, 2015

I've added libpng, libjpeg, etc to the list of preload, so that should fix the issue with dependencies, but ideally we should be recompiling everything from source. Anyway, with that, a build on Fedora should at least work on Ubuntu.

As for BoolPointer, both GCC and MSVC allocate memory in blocks of at least 8 bytes, so for output parameters with only one value as it is used in Leptonica it should be safe to use it on any platform instead of IntPointer, so I modified that. Although it might not work on big-endian platforms... Well, that'll be something for someone else to test out :)

About NIO buffers, currently only direct ones are supported. It would be convenient to support non-direct ones as well, but they would incur additional overhead, so I just have not made a priority out of them. Besides, as you found out, we can simply create a new Pointer object from one, and everything will work as expected, albeit with additional overhead. Is this a problem for your application?

The build system is quite hackish, yes, but there is no precedent for this kind of tool. AFAIK, we're basically creating something that no one has ever attempted to do before! On any platform with languages such as Java, Python, Ruby, C#, JavaScript, etc, this is a first in history. At this point in time, Gradle seems like the most promising alternative, but someone needs to try and make it work. Would you yourself be interested in undertaking that challenge?

In any case, the build system isn't intended for end users. As long as the binary artifacts work with normal Maven builds on target platforms, whatever needs to happen for the native compilation phase with Bash, Gradle, etc should not matter. We could of course have custom build options that could, for example, create a merged Leptonica/Tesseract artifact, and I would be glad to reflect the changes in the source code, but the binary artifacts would not be uploaded to the Central Repository. Does that all make sense?

@maximumspatium
Copy link
Author

Hello Samuel,

thank you for your investigation and patches. Please refer to my inline comments below.

As for BoolPointer, both GCC and MSVC allocate memory in blocks of at least 8 bytes, so for output parameters with only one value as it is used in Leptonica it should be safe to use it on any platform instead of IntPointer, so I modified that. Although it might not work on big-endian platforms... Well, that'll be something for someone else to test out :)

I didn't try your recent patch but using BoolPointer has worked for me before. Regarding big-endian machines I could give a try - I still own a working iMac from 2005 equipped with G5 PowerPC processor that I used for catching endiannes-related bugs for FFMpeg and Tesseract. This test has low priority now because noone seems to use such machines at the time being.

About NIO buffers, currently only direct ones are supported. It would be convenient to support non-direct ones as well, but they would incur additional overhead, so I just have not made a priority out of them. Besides, as you found out, we can simply create a new Pointer object from one, and everything will work as expected, albeit with additional overhead. Is this a problem for your application?

Well, it depends on the produced overhead. We usually call pixReadMemTiff several hundred times per page. I currently don't mind but wouldn't say no to any possible improvements.

At this point in time, Gradle seems like the most promising alternative, but someone needs to try and make it work. Would you yourself be interested in undertaking that challenge?

I'm not a Gradle expert but I'm prepared to give a try. The biggest challenge would be the compilation of native libraries. But this is something we should discuss somewhere else (E-Mail, Chat, Skype whatever). Feel free to contact me at - maximumspatium at googlemail dot com.

Best regards
Max P.

@saudet
Copy link
Member

saudet commented Feb 14, 2015

The overhead to use anything non-direct from JNI is pretty much always 1) memory allocation on the native heap and 2) a data copy. That is what is happening right now, but it doesn't it do it automatically for non-direct NIO buffers, that's all.

The biggest hurdle that I see in adopting something else than Bash is to find a replacement for shell commands commonly used like patch. But we can start with something more simple. For example, we could try to get this running entirely from Gradle somehow:
https://github.com/bytedeco/javacpp-presets/wiki/Create-New-Presets
Of course make is still going to require Bash, but at least the idea is what we won't be creating any new script files, so it could work without Bash on Windows in the case of libraries making use of something else like CMake, for example.

I think the mailing list would be appropriate for discussion, but sure, private messages are fine too! Thanks

BTW, if what you need urgently is a smaller JNI library, it would probably be easier to modify the existing cppbuild.sh files... Let me know what you decide, thanks!

@maximumspatium
Copy link
Author

The overhead to use anything non-direct from JNI is pretty much always 1) memory allocation on the native heap and 2) a data copy. That is what is happening right now, but it doesn't it do it automatically for non-direct NIO buffers, that's all.

What changes are necessary for non-direct NIO buffers to work?

if what you need urgently is a smaller JNI library, it would probably be easier to modify the existing cppbuild.sh files... Let me know what you decide, thanks!

Yes, it would be nice to have that. What's needed to be modified?

I think the mailing list would be appropriate for discussion

Does JavaCPP project have a mailing list somewhere or do you mean issue comments?

Thank you in advance!
Best regards
Max P.

@saudet
Copy link
Member

saudet commented Feb 15, 2015

We'd need to add things here and there in Generator.java to perform the following operations in JNI:

ptr = GetDirectBufferAddress(buffer);
if (ptr == NULL) {
   arr = buffer->array();
   ptr = Get????ArrayElements(arr)
}
...
if (arr != NULL) {
    Release????ArrayElements(arr, ptr)
}

Like I said, to modify the native library files, we need to modify the cppbuild.sh script files.

The mailing list is here: https://groups.google.com/group/javacpp-project

saudet added a commit to bytedeco/javacpp that referenced this issue Feb 18, 2015
@saudet
Copy link
Member

saudet commented Feb 18, 2015

As indicated above, I've added support for non-direct NIO buffers. Could you confirm that this change works well with your application? Thanks!

@maximumspatium
Copy link
Author

Hello Samuel,

As indicated above, I've added support for non-direct NIO buffers. Could you confirm that this change works well with your application? Thanks!

Thank you very much for the patch! I've recompiled both projects from source and tested with our app. The following line works now:

ByteBuffer buf = toTiffBuffer(bufferedImage);
buf.position(0);
image = pixReadMemTiff(buf, buf.capacity(), 0);

It does look much nicer to me. As to speed, I cannot notice any difference.

I had two further improvements:

Leptonica comes with several additional programs in the prog subdirectory (regression tests, examples etc.). They usually aren't required for the library itself. IIUC, JavaCPP doesn't use them either. The following, simple patch switches off compilation of these additional programs using --disable-programs option. This speeds up Leptonica compilation abit and consumes less disk space.

Does it make sense to bump JavaCPP version, to said 0.11-SNAPSHOT? The current code is beyond the scope of the 0.10 release. It would it easier to test new commits from local Maven repository by bumping the dependency version.

Best regards
Max

@saudet
Copy link
Member

saudet commented Feb 24, 2015

If you can make that patch available as a pull request, I'll merge it right away, thanks!

One of the main issues I'm having with Leptonica though is the reflection API from the JDK slowing down to a crawl on large classes. If you figure out a way to work around that one, let me know, thanks!

As for the version number, it's because I still find this system a bit inconvenient... I plan to bump it when I start making incompatible changes, pretty soon now ;)

@maximumspatium
Copy link
Author

Hello Samuel,

One of the main issues I'm having with Leptonica though is the reflection API from the JDK slowing down to a crawl on large classes. If you figure out a way to work around that one, let me know, thanks!

I'm not quite sure if we both mean the same but I noticed that Leptonica's JNI library does require ca. 28 minutes to build while Tesseract does require less than one minute. It's not quite clear to me why. Leptonica has been programmed in C so there is no notion of any classes.
If you could give me some points or concrete examples for this "slow down" issue I'd do some investigation.

Best regards
Maxim

@saudet
Copy link
Member

saudet commented Mar 7, 2015

It's precisely because Leptonica has no notion of class that we end up putting all the functions in one big class in Java:
https://github.com/bytedeco/javacpp-presets/blob/master/leptonica/src/main/java/org/bytedeco/javacpp/lept.java
If we profile the org.bytedeco.javacpp.tools.Generator running on that, we see that 99% of the time is spent in two methods only, something like:

java.lang.reflect.Method.getParameterAnnotations()  80.78357    239,148 ms (80.8%)  239,148 ms
java.lang.reflect.AccessibleObject.getAnnotations() 18.976791   56,178 ms (19%) 56,178 ms

So, for some reason, it looks like the JDK is having a hard time querying annotations on methods when those methods are in a class with a lot of other methods, and that is what would need to be investigated...

@saudet
Copy link
Member

saudet commented Apr 11, 2015

Most of the above has been fixed in the -0.11 release, so I'll close this issue. Thanks for reporting and testing everything! I think the only two remaining issues of interest are:

  1. How to make Tesseract depend less on Leptonica, without breaking existing functionality in the process
  2. Figure out why the JDK has such a hard time parsing the class file for Leptonica

Let us discuss these two issues, or anything else I missed, in a new thread... Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants