Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct memory allocation is slow #299

Open
zakgof opened this issue May 13, 2019 · 9 comments
Open

Struct memory allocation is slow #299

zakgof opened this issue May 13, 2019 · 9 comments

Comments

@zakgof
Copy link

zakgof commented May 13, 2019

I run a simple benchmark calling window API's GetSystemTime using JavaCpp's built-in windows API wrappers. This code allocates a struct, calls the native API and fetches some field from the struct:

		SYSTEMTIME systemtime = new SYSTEMTIME();
		windows.GetSystemTime(systemtime);
		return systemtime.wSecond();

Profiling shows that the first line takes >90% of the overall execution time

image

I believe that there is some space for optimization here. The same thing implemented with Bridj or JNR outperforms JNI+JavaCpp just because of faster allocation, see the benchmark at https://github.com/zakgof/java-native-benchmark.

Say, with Bridj allocation takes <50% of the overall time:

image

@saudet
Copy link
Member

saudet commented May 13, 2019

Memory allocation with MSVC is known to be slow, that's not really JavaCPP's fault. JNA and BridJ don't use C++ to allocate memory. We could allocate memory the same way for JavaCPP with, for example, Pointer.malloc(), cast it to SYSTEMTIME, and that should be faster. Could you give that a try?

@zakgof
Copy link
Author

zakgof commented May 13, 2019

The below code performs indeed much faster:

		Pointer raw = Pointer.malloc(systemTimeStructLength); // precalculated as systemTimeStructLength = new SYSTEMTIME().sizeof()
		SYSTEMTIME systemtime = new SYSTEMTIME(raw);
		windows.GetSystemTime(systemtime);
		return systemtime.wSecond();

image

Now the question is, why not to generate a struct's default constructor implementation with Pointer.malloc instead ?

@saudet
Copy link
Member

saudet commented May 13, 2019

We could, but it wouldn't be C++ :) I think Win32 doesn't throw C++ exceptions though, so we can probably speed this up with a @NoException like here:
https://github.com/bytedeco/javacpp-presets/blob/master/mkl/src/main/java/org/bytedeco/mkl/presets/mkl_rt.java#L56

@saudet
Copy link
Member

saudet commented May 13, 2019

Ah, no, we already have @NoException there. One other thing to be careful about on Windows: Memory deallocation is excruciatingly slow when a lot of memory is allocated, so make sure to deallocate as fast as possible. In this case, this will deallocate right away just before return:

try (SYSTEMTIME systemtime = new SYSTEMTIME()) {
    windows.GetSystemTime(systemtime);
    return systemtime.wSecond();
}

@zakgof
Copy link
Author

zakgof commented May 13, 2019

SYSTEMTIME is a C struct (with no constructor), and I believe that library users would prefer faster implementation with C rather than a slower one with C++.

I'd suggest modifying the parser to

  • precalculate sizeof() for a struct at parsing time, and let the sizeof() return the precalculated constant (currently it checks the members map which is definitely slower).
  • if a struct(class) has no constructors defined and no parent classes, implement its Java counterpart constructor with malloc. Or, at least provide a new static method:
class SYSTEMTIME extends Pointer {

    private static final long STRUCT_SIZE = 16; // Calculated at generation time

    public static long sizeof() {
        return STRUCT_SIZE;
    }

    public static SYSTEMTIME malloc() {
        return new SYSTEMTIME(Pointer.malloc(STRUCT_SIZE));
    }
}

@saudet
Copy link
Member

saudet commented May 13, 2019

Before we start modifying everything just because MSVC allocation is slow, let's check how the try-with-resources version performs. It should work well enough.

@saudet
Copy link
Member

saudet commented May 16, 2019

Actually, no, C++ allocation isn't the bottleneck at all here. It's the deallocator registration which is slow. Pointer.malloc() doesn't register any deallocator, so that's why it's fast.

@saudet
Copy link
Member

saudet commented Jun 12, 2019

More than half the time seems to be spent by the garbage collector browsing through the doubly-linked list of phantom references. If that's the case, there might not be much we can do about this other than simply not rely on the GC at all. The JDK itself uses doubly-linked lists for its own use of phantom references:
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/Cleaner.java
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java
BTW, JDK 11 seems to be a lot better at this than JDK 8. Make sure to upgrade your JDK!

@saudet
Copy link
Member

saudet commented Dec 3, 2021

FYI, starting with JavaCPP 1.5.6, we can now skip all that overhead and get very low latency by setting the "org.bytedeco.javacpp.nopointergc" system property to "true", see tensorflow/java#313.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants