Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal test failures in MathLoadTest_autosimd #14755

Open
cjjdespres opened this issue Mar 21, 2022 · 12 comments
Open

Internal test failures in MathLoadTest_autosimd #14755

cjjdespres opened this issue Mar 21, 2022 · 12 comments
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project

Comments

@cjjdespres
Copy link
Contributor

Observed in the JITAAS pipeline tests:

MathLoadTest_autosimd_CS_5m_0

  • job/Test_openjdk8_j9_sanity.system_x86-64_linux_jit_Personal/292/

MathLoadTest_autosimd_special_5m_29

  • job/Test_openjdk17_j9_special.system_x86-64_linux_jit_Personal/213/
  • job/Test_openjdk8_j9_special.system_x86-64_linux_jit_Personal/291/
  • job/Test_openjdk17_j9_special.system_x86-64_linux_jit_Personal/217/

Console output for MathLoadTest_autosimd_CS_5m_0

MLT >>> Captured test output >>>
MLT Test failed:
MLT java.lang.RuntimeException: test failure
MLT 	at net.adoptopenjdk.test.autosimd.AutoSIMDTestInteger.checkThat(AutoSIMDTestInteger.java:583)
MLT 	at net.adoptopenjdk.test.autosimd.AutoSIMDTestInteger.checkThat(AutoSIMDTestInteger.java:578)
MLT 	at net.adoptopenjdk.test.autosimd.AutoSIMDTestInteger.testSimpleBinary(AutoSIMDTestInteger.java:373)
MLT 	at net.adoptopenjdk.test.autosimd.AutoSIMDTestInteger.testSub(AutoSIMDTestInteger.java:264)
MLT 	at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
MLT 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
MLT 	at java.lang.reflect.Method.invoke(Method.java:498)
MLT 	at net.adoptopenjdk.loadTest.adaptors.ArbitraryJavaAdaptor.executeTest(ArbitraryJavaAdaptor.java:102)
MLT 	at net.adoptopenjdk.loadTest.LoadTestRunner$2.run(LoadTestRunner.java:182)
MLT 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
MLT 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
MLT 	at java.lang.Thread.run(Thread.java:832)
MLT <<<

I haven't been able to reproduce MathLoadTest_autosimd_special_5m_29 in Grinder, but MathLoadTest_autosimd_CS_5m_0 was reproduced with 1/80 failures in job/Grinder/21725/ with JITAAS and 0/200 failures in job/Grinder/21737/ without JITAAS.

@mpirvu mpirvu added the comp:jitserver Artifacts related to JIT-as-a-Service project label May 17, 2022
@cjjdespres
Copy link
Contributor Author

I looked at this a bit more, since this still happens fairly regularly in the nightly tests.

  1. It still seems to happen only with JITServer.
  2. It's reproducible locally, but very inconsistently. This is true even when I use a fixed seed from a nightly test failure as the initial seed for the test.
  3. I thought what was being tested was the optimization pass implemented by TR_SPMDKernelParallelizer. That's the only thing that seems to relate to autosimd that I can see, anyway. However, if I pass in -Xjit:disableAutoSIMD the test still fails in grinder, and it still fails even if I use -Xjit:disableSIMD.
  4. The test may be timing-sensitive - out of curiosity, I ran the test locally and tried to get compilation logs for certain methods, but doing that appears to cause the failure to go away. This could also be due to the underlying low failure rate.
  5. The test is failing right at this line in the testSub case in AutoSIMDTestInteger. I've also seen a failure in that same place in AutoSIMDTestLong.

I'll see if I can narrow down the failure in grinder.

@cjjdespres
Copy link
Contributor Author

The failure appears with -Xjit:limit={n*}, so that's some progress.

@cjjdespres
Copy link
Contributor Author

I managed to get a failure (out of ~350 attempts) with the following limit file:

+ (hot) net/adoptopenjdk/test/autosimd/AutoSIMDTestFloat.testCopy()V @ 00007FB170C80C38-00007FB170C9410C OrdinaryMethod 4.40% T Q_SZ=12 Q_SZI=9 QW=150 j9m=00000000005A0150 bcsz=425 OSR remote time=3260847us mem=[region=128 system=2048]KB compThreadID=0 CpuLoad=775%(96%avg) JvmCpu=396% queueTime=6369636us
+ (warm) net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.testDiv()V @ 00007FB170C02D18-00007FB170C02DE8 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=6 j9m=00000000005A5DC0 bcsz=11 remote time=12750us mem=[region=128 system=2048]KB compThreadID=3 CpuLoad=726%(90%avg) JvmCpu=623% queueTime=12782us
+ (warm) net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.testCopy()V @ 00007FB170C04578-00007FB170C064E9 OrdinaryMethod - Q_SZ=19 Q_SZI=19 QW=102 j9m=00000000005A5D40 bcsz=425 remote time=108229us mem=[region=128 system=2048]KB compThreadID=1 CpuLoad=358%(44%avg) JvmCpu=139% queueTime=163413us
+ (cold) net/adoptopenjdk/test/autosimd/AutoSIMDTestDouble.sourceData1(I)D @ 00007FB170C03038-00007FB170C03118 OrdinaryMethod - Q_SZ=33 Q_SZI=33 QW=110 j9m=000000000059EA60 bcsz=10 remote time=2996us mem=[region=64 system=2048]KB compThreadID=0 CpuLoad=358%(44%avg) JvmCpu=139% queueTime=72158us
+ (hot) net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.testSimpleBinary(Lnet/adoptopenjdk/test/autosimd/BinaryOpSIMDInteger;)V @ 00007FB170CD5880-00007FB170CE5F80 OrdinaryMethod 3.10% T Q_SZ=1 Q_SZI=0 QW=60 j9m=000000000059D370 bcsz=832 OSR remote time=2222907us mem=[region=128 system=2048]KB compThreadID=3 CpuLoad=755%(94%avg) JvmCpu=542% queueTime=2223404us
+ (warm) net/adoptopenjdk/test/autosimd/AutoSIMDTestFloat.populate([FLnet/adoptopenjdk/test/autosimd/IntToFloat;II)V @ 00007FB170C00FA8-00007FB170C01484 OrdinaryMethod C Q_SZ=1 Q_SZI=0 QW=78 j9m=00000000005A0310 bcsz=30 remote time=17747us mem=[region=64 system=2048]KB compThreadID=1 CpuLoad=778%(97%avg) JvmCpu=191% queueTime=99554us
+ (cold) net/adoptopenjdk/test/autosimd/AutoSIMDTestDouble.sourceData2(I)D @ 00007FB170C02CD8-00007FB170C02EA8 OrdinaryMethod - Q_SZ=43 Q_SZI=43 QW=130 j9m=000000000059EA80 bcsz=40 remote time=5062us mem=[region=128 system=2048]KB compThreadID=0 CpuLoad=358%(44%avg) JvmCpu=139% queueTime=90317us
+ (warm) net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.testDiv()V @ 00007FB170C03AB8-00007FB170C03B74 OrdinaryMethod - Q_SZ=8 Q_SZI=7 QW=50 j9m=000000000059D310 bcsz=11 remote time=14862us mem=[region=128 system=2048]KB compThreadID=3 CpuLoad=756%(94%avg) JvmCpu=544% queueTime=2061781us
+ (cold) net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.simdLincombWithTemp([J[J[JJJI)V @ 00007FB170C1AF28-00007FB170C1B099 OrdinaryMethod - Q_SZ=80 Q_SZI=80 QW=198 j9m=00000000005A5CA0 bcsz=55 remote time=6834us mem=[region=64 system=2048]KB compThreadID=0 CpuLoad=575%(71%avg) JvmCpu=385% queueTime=171767us
+ (cold) net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.simdNeg([I[I[II)V @ 00007FB170C1BEA8-00007FB170C1C005 OrdinaryMethod - Q_SZ=18 Q_SZI=18 QW=64 j9m=000000000059D1D0 bcsz=32 remote time=3418us mem=[region=64 system=2048]KB compThreadID=1 CpuLoad=575%(71%avg) JvmCpu=385% queueTime=68449us
+ (warm) net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.populate([JLnet/adoptopenjdk/test/autosimd/IntToLong;II)V @ 00007FB170C014E8-00007FB170C016CD OrdinaryMethod C Q_SZ=0 Q_SZI=0 QW=66 j9m=00000000005A5F00 bcsz=30 remote time=10501us mem=[region=64 system=2048]KB compThreadID=1 CpuLoad=778%(97%avg) JvmCpu=191% queueTime=13461us
+ (cold) net/adoptopenjdk/test/autosimd/AutoSIMDTestDouble.testFill()V @ 00007FB170C16B18-00007FB170C192E8 OrdinaryMethod - Q_SZ=122 Q_SZI=122 QW=292 j9m=000000000059E900 bcsz=743 remote time=21100us mem=[region=128 system=2048]KB compThreadID=0 CpuLoad=575%(71%avg) JvmCpu=385% queueTime=288337us
+ (hot) net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.populate([JLnet/adoptopenjdk/test/autosimd/IntToLong;II)V @ 00007FB170C01008-00007FB170C013E4 OrdinaryMethod 1.60% T Q_SZ=0 Q_SZI=0 QW=30 j9m=00000000005A5F00 bcsz=30 OSR remote time=50426us mem=[region=64 system=2048]KB compThreadID=3 CpuLoad=737%(92%avg) JvmCpu=621% queueTime=167944us
+ (cold) net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.simdDiv([J[J[JI)V @ 00007FB170C1FF28-00007FB170C200BA OrdinaryMethod - Q_SZ=28 Q_SZI=28 QW=80 j9m=00000000005A5C60 bcsz=42 remote time=3703us mem=[region=64 system=2048]KB compThreadID=2 CpuLoad=575%(71%avg) JvmCpu=385% queueTime=53507us
+ (hot) net/adoptopenjdk/test/autosimd/AutoSIMDTestDouble.testCopy()V @ 00007FB170C5B818-00007FB170C6D6C0 OrdinaryMethod 5.60% T Q_SZ=1 Q_SZI=0 QW=96 j9m=000000000059E920 bcsz=425 OSR remote time=2981576us mem=[region=128 system=2048]KB compThreadID=1 CpuLoad=778%(97%avg) JvmCpu=191% queueTime=3492119us
+ (warm) net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.simdPartialSums([I[II)V @ 00007FB170C039E8-00007FB170C03A62 OrdinaryMethod - Q_SZ=9 Q_SZI=7 QW=138 j9m=000000000059D250 bcsz=30 remote time=12324us mem=[region=64 system=2048]KB compThreadID=3 CpuLoad=778%(97%avg) JvmCpu=377% queueTime=1100464us

I tried to refine that list by running the test over the weekend, randomly permuting that limit file before each run and limiting to the first half of it. I got no failures in ~3000 runs.

Given where the failure occurs, I'm assuming that it's net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.testSimpleBinary(Lnet/adoptopenjdk/test/autosimd/BinaryOpSIMDInteger;)V that's the problem, if it is just a single method.

@cjjdespres
Copy link
Contributor Author

I tried getting compilation logs for AutoSIMDTestInteger.testSimpleBinary. I did get some during a crash, and couldn't see anything wrong with the trees or generated code. However, I'm fairly sure that the logs I got were for a compilation that did not cause the crash. I think that adding log options to one of the methods that can cause the failure (AutoSIMDTestInteger.testSimpleBinary or AutoSIMDTestLong.testSimpleBinary) will make that particular method less likely to fail. So, likely what happened was that AutoSIMDTestLong.testSimpleBinary was the method that was involved in the crash in that run, and I just didn't notice.

So, I tried getting logs for AutoSIMDTestInteger.testSimpleBinary while excluding AutoSIMDTestLong.testSimpleBinary. This time, I actually started getting crashes at the server. One in particular is:

Unhandled exception
Type=Segmentation error vmState=0x000566a0
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FD36E9CAE30 Handler2=00007FD36E533B70 InaccessibleAddress=00007FD2E0B4ADE8
RDI=00007FD2E0B4ADE8 RSI=00007FD36FD38320 RAX=0000000000000DE8 RBX=00007FD354ADBF70
RCX=0000000000000000 RDX=00007FD354ADC110 R8=0000000000000000 R9=00007FD36FD38220
R10=00007FD2E0B4ADE8 R11=00007FD36FD382A0 R12=0000000000000073 R13=00000000FFFFFFFF
R14=0000000000000000 R15=00007FD354ADC0E8
RIP=00007FD36FC6485C GS=0000 FS=0000 RSP=00007FD354ADBA18
EFlags=0000000000010283 CS=0033 RBP=0000000000000000 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007FD2E0B4ADE8
xmm0=00007fd222422000 (f: 574758912.000000, d: 6.943623e-310)
xmm1=0000003000000010 (f: 16.000000, d: 1.018558e-312)
xmm2=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm3=0000000000000005 (f: 5.000000, d: 2.470328e-323)
xmm4=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm5=3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
xmm6=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm7=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm8=ffffffff00000000 (f: 0.000000, d: -nan)
xmm9=195a1403141f392e (f: 337590560.000000, d: 1.498373e-186)
xmm10=121b14195a140314 (f: 1511260928.000000, d: 1.872785e-221)
xmm11=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm12=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/lib64/libc.so.6
Module_base_address=00007FD36FBA8000

Unhandled exception in signal handler. Protected function: writeCrashDataToConsole (0x0)

----------- Stack Backtrace -----------
__strlen_evex+0x1c (0x00007FD36FC6485C [libc.so.6+0xbc85c])
_IO_vfprintf+0x1e6d (0x00007FD36FC1029D [libc.so.6+0x6829d])
vsnprintf+0x94 (0x00007FD36FC369C4 [libc.so.6+0x8e9c4])
_ZN22TR_PrettyPrinterString7appendfEPKcz+0xa5 (0x00007FD36025E7D5 [libj9jit29.so+0x91a7d5])
_ZN8TR_Debug13printNodeInfoEPN2TR4NodeER22TR_PrettyPrinterStringb+0xa15 (0x00007FD360260A65 [libj9jit29.so+0x91ca65])
_ZN8TR_Debug5printEPN2TR11FilePointerEPNS0_4NodeEjb+0xdc (0x00007FD36026474C [libj9jit29.so+0x92074c])
_ZN8TR_Debug5printEPN2TR11FilePointerEPNS0_4NodeEjb+0x5b4 (0x00007FD360264C24 [libj9jit29.so+0x920c24])
_ZN8TR_Debug12printIRTreesEPN2TR11FilePointerEPKcPNS0_20ResolvedMethodSymbolE+0x11b (0x00007FD36025E27B [libj9jit29.so+0x91a27b])
_ZN3OMR11Compilation15dumpMethodTreesEPKcPN2TR20ResolvedMethodSymbolE+0x40 (0x00007FD35FEE4620 [libj9jit29.so+0x5a0620])
_ZN3OMR9Optimizer8optimizeEv+0x236 (0x00007FD3600D76B6 [libj9jit29.so+0x7936b6])
_ZN3OMR11Compilation7compileEv+0xb59 (0x00007FD35FEE7C49 [libj9jit29.so+0x5a3c49])
_ZN2TR28CompilationInfoPerThreadBase7compileEP10J9VMThreadPNS_11CompilationEP17TR_ResolvedMethodR11TR_J9VMBaseP19TR_OptimizationPlanRKNS_16SegmentAllocatorE+0x4a2 (0x00007FD35FAB8DE2 [libj9jit29.so+0x174de2])
_ZN2TR28CompilationInfoPerThreadBase14wrappedCompileEP13J9PortLibraryPv+0x315 (0x00007FD35FAB9E15 [libj9jit29.so+0x175e15])
omrsig_protect+0x2a7 (0x00007FD36E534957 [libj9prt29.so+0x28957])
_ZN2TR28CompilationInfoPerThreadBase7compileEP10J9VMThreadP21TR_MethodToBeCompiledRN2J917J9SegmentProviderE+0x379 (0x00007FD35FAB7869 [libj9jit29.so+0x173869])
_ZN2TR30CompilationInfoPerThreadRemote12processEntryER21TR_MethodToBeCompiledRN2J917J9SegmentProviderE+0xd89 (0x00007FD35FB2B7E9 [libj9jit29.so+0x1e77e9])
_ZN2TR24CompilationInfoPerThread14processEntriesEv+0x387 (0x00007FD35FAB6B27 [libj9jit29.so+0x172b27])
_ZN2TR24CompilationInfoPerThread3runEv+0x42 (0x00007FD35FAB6E42 [libj9jit29.so+0x172e42])
_Z30protectedCompilationThreadProcP13J9PortLibraryPN2TR24CompilationInfoPerThreadE+0x82 (0x00007FD35FAB6EF2 [libj9jit29.so+0x172ef2])
omrsig_protect+0x2a7 (0x00007FD36E534957 [libj9prt29.so+0x28957])
_Z21compilationThreadProcPv+0x174 (0x00007FD35FAB72B4 [libj9jit29.so+0x1732b4])
thread_wrapper+0x162 (0x00007FD36E2FFF12 [libj9thr29.so+0x9f12])
start_thread+0xea (0x00007FD36FF861CA [libpthread.so.0+0x81ca])
clone+0x43 (0x00007FD36FBE18D3 [libc.so.6+0x398d3])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2024/07/24 20:46:26 - please wait.
JVMDUMP032I JVM requested System dump using '/home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/core.20240724.204626.2025265.0001.dmp' in response to an event
JVMDUMP010I System dump written to /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/core.20240724.204626.2025265.0001.dmp
JVMDUMP032I JVM requested Java dump using '/home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/javacore.20240724.204626.2025265.0002.txt' in response to an event
JVMDUMP012E Error in Java dump: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/javacore.20240724.204626.2025265.0002.txt
JVMDUMP032I JVM requested Snap dump using '/home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/Snap.20240724.204626.2025265.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/Snap.20240724.204626.2025265.0003.trc
JVMDUMP032I JVM requested JIT dump using '/home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/jitdump.20240724.204626.2025265.0004.dmp' in response to an event
JVMDUMP051I JIT dump occurred in 'JIT Compilation Thread-004' thread 0x0000000000D3B400
JVMDUMP049I JIT dump notified all waiting threads of the current method to be compiled
JVMDUMP054I JIT dump is tracing the IL of the method on the crashed compilation thread
JVMDUMP052I JIT dump recursive crash occurred on diagnostic thread
JVMDUMP048I JIT dump method being compiled is an ordinary method
JVMDUMP053I JIT dump is recompiling net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.testSimpleBinary(Lnet/adoptopenjdk/test/autosimd/BinaryOpSIMDInteger;)V
Assertion failed at /home/despresc/dev/testing/openj9-openjdk-jdk21/openj9/runtime/compiler/env/J9KnownObjectTable.cpp:401: (numOfEntries == endIndex)
VMState: 0x00050080
	The client table size 46 is different from the server table size 30
compiling net/adoptopenjdk/test/autosimd/AutoSIMDTestInteger.testSimpleBinary(Lnet/adoptopenjdk/test/autosimd/BinaryOpSIMDInteger;)V at level: hot
#0: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x8fd0b5) [0x7fd3602410b5]
#1: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x90a5d0) [0x7fd36024e5d0]
#2: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x5e336e) [0x7fd35ff2736e]
#3: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x5e3ecd) [0x7fd35ff27ecd]
#4: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x2209ed) [0x7fd35fb649ed]
#5: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x5a0658) [0x7fd35fee4658]
#6: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x79377a) [0x7fd3600d777a]
#7: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x5bd277) [0x7fd35ff01277]
#8: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x5a32ef) [0x7fd35fee72ef]
#9: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x174de2) [0x7fd35fab8de2]
#10: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x175e15) [0x7fd35fab9e15]
#11: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9prt29.so(+0x28957) [0x7fd36e534957]
#12: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x1739ba) [0x7fd35fab79ba]
#13: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x1e77e9) [0x7fd35fb2b7e9]
#14: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x172b27) [0x7fd35fab6b27]
#15: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x172e42) [0x7fd35fab6e42]
#16: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x172ef2) [0x7fd35fab6ef2]
#17: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9prt29.so(+0x28957) [0x7fd36e534957]
#18: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9jit29.so(+0x1732b4) [0x7fd35fab72b4]
#19: /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/lib/default/libj9thr29.so(+0x9f12) [0x7fd36e2fff12]
#20: /lib64/libpthread.so.0(+0x81ca) [0x7fd36ff861ca]
#21: /lib64/libc.so.6(clone+0x43) [0x7fd36fbe18d3]

JVMDUMP052I JIT dump recursive crash occurred on diagnostic thread
JVMDUMP010I JIT dump written to /home/despresc/dev/testing/openj9-openjdk-jdk21/build/linux-x86_64-server-release/images/jdk/bin/jitdump.20240724.204626.2025265.0004.dmp
JVMDUMP013I Processed dump event "gpf", detail "".

If I understand correctly, the server encountered an inaccessible address while attempting to write the compilation log for the method. Then, it tried dumping the IL of the failing method and encountered a mismatch in the size of the known object table. Maybe that second crash is expected? I'm not sure if the server can become out-of-sync with the client if a crash occurs on a server compilation thread. Other crashes have not had the secondary crash on the diagnostic thread. Otherwise there could be a problem with the known object table.

@cjjdespres
Copy link
Contributor Author

I looked into the server crash a bit more. It was crashing after printing a node introduced in the SPMDKernelParallelizer. I think it might have been corrupted during the isub -> vsub op conversion, though it's possible that the problem was introduced before that.

Since it's a little long, I'll just attach the compilation log that I got: tracelogInteger.1722517.server.140.zip

What I think what's happening is:

  1. We start with block_143, which I'm pretty sure is vectorCombine inlined where it's called on this line, i.e. just before the place this test is failing. (I think the refined array shadows all refer to the same dst array).
  2. After some checks, this block is identified as a loop suitable for autosimd at the line starting with: 143 and piv = 931 collected for Auto-Vectorization.
  3. Judging from the message Checking Loop ==== (iter = -1, unroll = 16, vectorSize = 4) and unroller._spillLoopRequired 1 and the subsequent logs, the optimizer creates a spill loop for loop 143, that then gets unrolled into three single iterations over the three blocks 2115, 2116, 2117.
  4. The SPMDKernelParallelizer then visits the original block 143 and the three iteration blocks, and transforms every isub instruction under the first istorei into a vsub. The three iteration blocks are also fused together into a single block 2115. (I'm not sure that it's valid to perform that transformation on the three iteration blocks, incidentally, but it's hard for me to say.)

After this, the log is slightly corrupted. The isub line in block 143 goes from:

n4382n      isub                                                                              [0x7fc6e87d4970] bci=[11,20,-] rc=1 vc=10258 vn=207 li=- udi=- nc=2

to this vsub line:

n4382n      vsubp<97><93>^CÇ^?                                                                        [0x7fc6e87d4970] bci=[11,20,-] rc=1 vc=15174 vn=- li=- udi=- nc=2

The vsub lines in block 2115 exhibit the same corruption. bThe server crashes a little while later, just as it starts printing the vsub line of block 2115.

I'm not as familiar with how autosimd and the node/tree creation and cloning going on here is supposed to work. Marius mentioned to me that @hzongaro might be able to help, or suggest someone that might be able to help, which would be appreciated.

@cjjdespres
Copy link
Contributor Author

I said in #14755 (comment) that block 143 looked like the vectorCombine inlined during the "all three aliased", but in its first appearance (right after inlining) it looks more like one of the earlier out-of-bounds cases.

Also, I looked at the server core (the one dumped during the crash while writing that log file) and the node that the server was attempting to print had an _opCode._opCode of 1618, which corresponds to vector operation 59, which has a table index of 676, and that does correspond to vsub.

However, immediately after the log file prints the opcode name, it then tries to print getName(opcode.getVectorResultDataType()), since this is a single type vector code, which should result in the full vector name in the log. But getVectorResultDataType() is:

   static TR::DataType getVectorResultDataType(ILOpCode op)
      {
      TR_ASSERT_FATAL(isVectorOpCode(op), "getVectorResultDataType() can only be called for vector opcode\n");

      TR::ILOpCodes opcode = op._opCode;

      return (opcode < (TR::NumScalarIlOps + TR::NumOneVectorTypeOps)) ?
             (TR::DataTypes)((opcode - TR::NumScalarIlOps) % TR::NumVectorTypes + TR::FirstVectorType) :
             (TR::DataTypes)(((opcode - TR::NumScalarIlOps - TR::NumOneVectorTypeOps) % (TR::NumVectorTypes * TR::NumVectorTypes)) % TR::NumVectorTypes + TR::FirstVectorType);
      }

and we must have gone down the true branch of that conditional expression. But for me, according to gdb,

opcode = 1681
TR::NumScalarIlOps = 617
TR::NumVectorTypes = 18
TR::FirstVectorType = 17
====>
(opcode - TR::NumScalarIlOps) % TR::NumVectorTypes + TR::FirstVectorType = 19

and the inaccessible address that the log writer tried to access is exactly OMRDataTypeNames[19]! So the server was trying to print the vector result data type name when it crashed. So earlier in the log that memory address wasn't inaccessible; it was just full of garbage?

The vector name initialization routine is here in omr:

https://github.com/eclipse/omr/blob/382516c26daf626cb7da336aaa0628569fd324a2/compiler/il/OMRDataTypes.cpp#L254-L270

In my core, the values of the constants are:

TR::NumVectorTypes = 18
TR::FirstVectorType = 17
TR::LastVectorType = 34

which does seem to check out. When I print out the entries of OMRDataTypeNames in GDB, entries 0-8 inclusive ("NoType" to "Aggregate" are fine), entries 9-16 inclusive are all NULL, and the entries starting at 17 (including 19) are all inaccessible. So something odd must have happened with the memory allocated by TR_Memory::jitPersistentAlloc(MAX_TYPE_NAME_LENGTH * TR::NumVectorTypes*sizeof(char)).

@cjjdespres
Copy link
Contributor Author

It looks like the documentation says that TR_MemoryBase::jitPersistentAlloc isn't thread-safe. I do see the following in getName:

const char *
OMR::DataType::getName(TR::DataType dt)
   {
   if (dt.isVector())
      {
      // to avoid any race conditions, initialize all vector names once,
      // as soon as first one is requested
      static bool staticallyInitialized = initVectorNames();
      TR_ASSERT_FATAL(staticallyInitialized && (OMRDataTypeNames[dt] != NULL), "Vector names should've been initialized");
      }

Could there be some strange race condition where two threads simultaneously try to initVectorNames(), and the resulting jitPersistentAllocs conflict with each other? I don't think the static bool there is enough to prevent that, but I could be wrong.

@cjjdespres
Copy link
Contributor Author

In addition to the possible unsolved race condition, the staticallyInitialized variable isn't even inspected before initVectorNames is called, so initVectorNames() is called every time getName is called.

@cjjdespres
Copy link
Contributor Author

I'm remembering that static local variables should only be initialized once, and this should happen even if the relevant function is called simultaneously on multiple threads. So the existing code ought to be fine in that respect. It could still be that the jitPersistentAlloc is somehow conflicting with a simultaneous use of the global allocator elsewhere.

@cjjdespres
Copy link
Contributor Author

Or there's something weird going on with translation units.

@cjjdespres
Copy link
Contributor Author

I now know what's actually going wrong with the server logs. The initialization function for the vector names is here, like I said above:

https://github.com/eclipse/omr/blob/382516c26daf626cb7da336aaa0628569fd324a2/compiler/il/OMRDataTypes.cpp#L254-L270

Notice that the memory for the names is allocated with TR_Memory::jitPersistentAlloc(). This allocates from J9::CompilerEnv::persistentMemory(), which is here:

TR_PersistentMemory *
J9::CompilerEnv::persistentMemory()
{
#if defined(J9VM_OPT_JITSERVER)
if (J9::PersistentInfo::_remoteCompilationMode == JITServer::SERVER)
{
auto compInfoPT = TR::compInfoPT;
if (compInfoPT && compInfoPT->getPerClientPersistentMemory())
{
// Returns per-client persistent memory after enterPerClientPersistentAllocator() is called
return compInfoPT->getPerClientPersistentMemory();
}
}
#endif
return OMR::CompilerEnv::persistentMemory();
}

If the JITServer is in a compilation thread and per-client memory is enabled, it will use that per-client memory, and otherwise use the global persistent memory. But the vector names are initialized on-demand within a compilation thread! So we're allocating the vector names using one client's memory, which gets freed when that client disconnects. That's how the vector name pointers in that array can suddenly point to inaccessible memory.

@cjjdespres
Copy link
Contributor Author

cjjdespres commented Oct 10, 2024

While I was testing this fix, I did another test run and confirmed that the test error still occurs with -Xjit:disableAutoSIMD. Indeed, it occurs when the only successful compilations of testSimpleBinary have been cold or warm, and the SPMD parallelizer runs in the hot optimization level. The test error also occurs without -Xgc:concurrentScavenge (so with just the default gencon). That concurrentScavenge was the only non-default option that was enabled for this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project
Projects
None yet
Development

No branches or pull requests

2 participants