-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FFI/Jtreg_JDK21] IllegalArgumentException detected in TestNested.java on AIX #18287
Comments
Based on failing test cases are as follows:
plus the corresponding native code at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/libNested.c#L44C1-L46C60
against the captured exceptions in the description, it indicates that there might be a padding issue with
to debug the test as follows:
which means the failing jtreg test struct |
To isolate the test cases, I modified the failing test struct as follows:
and disabled the check on the returned value at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/TestNested.java#L73
plus the modified native code (to simply return the passed-in struct rather than triggering the upcall thunk) at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/9e4783d1f1520d34f5261f8e94ec5d22004ade08/test/jdk/java/foreign/nested/libNested.c#L79 to ensure it only works for downcall.
which works good and the failing tests passed without any issue. That being said, there is no issue with downcall as long as the padding bytes are correct on AIX. However, it crashed when invoking the upcall thunk (not yet reaching the upcall dispather) when enabling the upcall in the native code:
with the following stacktrace:
Considering there is no issue on pLinux and other platforms in these tests, there should be no problem with the encoded native signature as they are shared on all supported platforms, in which case it is mostly likely to be an issue with this type of struct in the upcall thunk on AIX. |
@zl-wang, could you please help take a look at what really happened to the thunk on AIX? |
The changes reflect the actual padding bytes of the failing test structs support double on AIX. Related: #eclipse-openj9/openj9/issues/18287 Signed-off-by: ChengJin01 <jincheng@ca.ibm.com>
all of your modifications still made the struct as if it were packed (i.e. members are not aligned right). is it intended to be that way? the correct padding should look like: char, 7-byte padding, two-double-array, char (with or without padding i think would be fine). |
i guessed the struct C declaration might look like:
Is my guess right? then, 7-byte padding should be expected. the size is 32-byte i think. |
Not really. It is 3 bytes in terms of the padding detected at #18287 (comment)
|
3-byte padding is a valid choice, but only for 32-bit mode (i.e. 32-bit executable). it is not right for 64-bit mode (our JVM mode ... we don't do 32bit build after java8). |
it is an easy test: xlC -q64 your-C-test.c to see if 3-byte or 7-byte padding is applied. |
Here's the result of the test as follows:
in which I didn't notice there is any difference in padding as compared to the 32-bit mode. |
that looks like a compiler bug. we can pursue this bug separately. in any case, what did you present to the thunk generation mechanism? it should be AGGREGATE_OTHER (if I remember correctly) with 24-byte in size, then it should work. |
That' correct as the struct size for upcall is computed ahead of time in the java code before passing it to
If we change the code in https://github.com/eclipse-openj9/openj9/blob/master/runtime/vm/LayoutFFITypeHelpers.hpp for the native signature encoding on AIX, does it mean the upcall thunk should be updated accordingly to differentiate AIX from other platform in such case? |
yes, different AGGREGATE_OTHER and size combination presented to the thunk gen ... different thunk code(s) will be generated to fit with the upcall. i hoped the signature coming in on AIX already has the correct padding identified (at least for the time being as it is). |
If so, this is something we should coordinate to ensure the new type (.e.g. |
that sounds unsustainable. you cannot cover all these unlimited possibilities ... C_D, C_C_D, etc etc i remembered you told us there is signature with correct padding available to you. |
You mean the existing definition we already have now in the code? but I can't determine which one defined from openj9/runtime/oti/j9nonbuilder.h Line 6047 in d5a2e02
|
AGGREGATE_OTHER is the suitable/correct one to encode to ... of course need the right size (24 on AIX and 32 on others). |
Then there is no need to add a new definition given the correct struct size is already set to |
With the following struct in the test on AIX:
the printing-out message from the code seems correct in terms of sigArray input. Now the question is why it crashed in there.
|
If a bug, it's not clear how it could be fixed without breaking compatibility with existing binaries. |
it is a bug from expectation perspective (aligned the same as other 64bit platforms). there is no compatibility issue here (i.e. it is consistent always within AIX). |
If the first failure relates to this: static final StructLayout S9 = MemoryLayout.structLayout(
C_CHAR.withName("f0"),
MemoryLayout.paddingLayout(7),
MemoryLayout.sequenceLayout(2, C_DOUBLE).withName("f1"),
C_CHAR.withName("f2"),
MemoryLayout.paddingLayout(7),
S8.withName("f3")
).withName("S9"); then I believe the test is just broken in that it assumes 7 bytes of padding between Alternatively, the checking in |
I had the same expectation, but I don't see we have any alternatives to working with what the compiler does (and I expect must continue to do). Calling it a "bug" suggests it might someday be fixed, but, as I said, I don't think this can be "fixed". |
@ChengJin01 David (@edelsohn ) provided a few informative libffi links as below: To verify if this PR is the same, you can test out this below struct: It is a simplified emulation of this PR's problematic (for libffi) struct. Exactly the same size, padding, and memory layout ... |
That's how we implemented the union in FFI at #18291
This is something I already verified previous by modifying the failing test which crashed in the same way as |
i didn't notice the ending *p. I intended to say testing out this one: |
After changing the java code of
plus the native code as follows:
the failing test for S9 passed without any issue. That being said, the crash occurred only when there is a nested double array in the struct. |
As @zl-wang quoted the libffi documentation link that I sent him, arrays are not first-class objects in libffi. Arrays and nested arrays do not function in libffi reliably. Can the Java code pretend that the pair of doubles are non-nested members of the struct and claim victory? |
Theoretically we could do by changing the array to primitives in the java code that but there is no easy way to determine determine what scenario it should be replaced with simply primitives (rather than a nested array) given |
from the testings we have done (which case succeeded and which failed) and what @edelsohn described arrays and nested-arrays as unreliable for libffi, I think we can claim victory from java perspective for the time being. @edelsohn is this-unreliability happening on other platforms too? plus, the main change (for specific struct) is in C code when you fill in the ffi_type struct in order to use libffi where you can emulate arrays as sequence of scalar in providing struct definition to libffi. the java-side changes are only relevant to calculating field-offset within C struct for java code to access them in the corresponding/reflective MemorySegment. |
If one looks at the Python and libffi file history, it has been tweaked for multiple platforms, but fewer and fewer have the type of ABI quirks as AIX. Because the inlined double members behave as expected and the array doesn't, this probably is related to libffi emulating the array as a nested struct, and a struct with the first member a double has a stricter alignment and padding than an interior double. The "as if" behavior of libffi is not correct in this case and the member position doesn't agree. |
It's not clear that case is broken either, I tried this case in an libffi test. We preform a call which takes the struct by value, increments the members and returns the struct and everything comes back fine. The struct:
How Clang computes layout on AIX:
The libffi type description:
The resulting size and alignment from libffi:
I think to clarify the crashing case, we'd need to be clear on what the ffitype description java passes to fficall looks like. That would help clearly identify if the libffi_call is faulty.
|
The code in there (java & native) is correct to generate the simplified string like Your case is wrong as it is verifying Based on what we observed in the dump, the elements |
Maybe it doesn't matter but we are using |
|
Ah point taken, I've clearly misread that. Yes, while the call works libffi and the compiler do disagree about the size, and when I add the extra parameter I get the crash you are seeing. |
Hi @daltenty, is there any update on your side? |
I've posted a PR to libffi to hopefully correct the layout computation: libffi/libffi#805 I think you can try applying that patch to the openj9 local copy of libffi to test and see if it resolves the issue we're seeing here. |
Many thanks. I will try this patch to in our code to see how it goes. |
I've verified the path offered by @daltenty at libffi/libffi#805 with a recompiled build which works good as expected and all test cases passed in the test suite:
|
The change simply adopts the fix at libffi/libffi#805 to resolve the the issue with the nested struct in libffi on AIX. Related: eclipse-openj9#18287 Signed-off-by: ChengJin01 <jincheng@ca.ibm.com>
Now that #18375 has been merged, is there an excluded set of tests that need to be enabled? |
TestNested seems to be disabled for all platforms for jdk21+: |
We will need to double-check the test suite again before enabling it on all supported platforms. |
This is waiting for libffi/libffi#805 to merge so we can update to the approved content. See #18375 (comment) |
libffi/libffi#805 has merged without further changes. Closing this. |
With my changes (intended for union in downcall & upcall) at ChengJin01@5c3f1e4#diff-710539390ec016913e8ff24f64faf40f7e64cc9de4b1f113e9f0c9820b22908e, two of the subtests in
testNested
at https://github.com/ibmruntimes/openj9-openjdk-jdk21/blob/openj9/test/jdk/java/foreign/nested/TestNested.java failed on AIX with the following exceptions:FYI: @tajila, @pshipton
The text was updated successfully, but these errors were encountered: