-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Iteration for achieving Multi-Threaded Execution Plans #389
Conversation
The issue was that the `DataObjectState` was shared across multiple threads. Instead, each execution plan needs a private instance of the object state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Opencl works for me and for ptx a did a few runs without getting any errors.
I get this for OpenCL
This happens both when I run the test individually and through the |
PTX:
|
For SPIR-V the tests pass:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In macOS (OpenCL):
tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB" -m tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00000001ddb1a170, pid=155, tid=43875
#
# JRE version: OpenJDK Runtime Environment Microsoft-9388422 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Microsoft-9388422 (21.0.3+9-LTS, mixed mode, tiered, jvmci, parallel gc, bsd-aarch64)
# Problematic frame:
# C [OpenCL+0x22170] clLogMessagesToStderrAPPLE+0x2dc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/thanos/repositories/TornadoVM-stratika/hs_err_pid155.log
[2.208s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
# https://github.com/microsoft/openjdk/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
In Linux OS with SPIR-V BACKEND when I run tornado -ea --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.device.memory=4GB" -m tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "Thread-17" java.lang.NullPointerException: Cannot load from object array because "this.commandQueueGroupProperties" is null
at beehive.levelzero.jni@0.1.3/uk.ac.manchester.tornado.drivers.spirv.levelzero.LevelZeroDevice.getCommandQueueGroupProperties(LevelZeroDevice.java:115)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.getCommandQueueOrdinal(SPIRVCommandQueueTable.java:125)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.createCommandQueue(SPIRVCommandQueueTable.java:88)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.get(SPIRVCommandQueueTable.java:73)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable.get(SPIRVCommandQueueTable.java:56)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.getCommandQueueForDevice(SPIRVLevelZeroContext.java:153)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.enqueueWriteBuffer(SPIRVLevelZeroContext.java:558)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVDeviceContext.enqueueWriteBuffer(SPIRVDeviceContext.java:346)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.mm.SPIRVMemorySegmentWrapper.enqueueWrite(SPIRVMemorySegmentWrapper.java:159)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVTornadoDevice.streamIn(SPIRVTornadoDevice.java:398)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.transferHostToDeviceAlways(TornadoVMInterpreter.java:480)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:305)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:855)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:125)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:112)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.scheduleInner(TornadoTaskGraph.java:859)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1366)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1378)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TaskGraph.execute(TaskGraph.java:777)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.ImmutableTaskGraph.execute(ImmutableTaskGraph.java:49)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.lambda$execute$0(TornadoExecutionPlan.java:406)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.execute(TornadoExecutionPlan.java:406)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan.execute(TornadoExecutionPlan.java:117)
at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.compute(TestMultiThreadedExecutionPlans.java:183)
at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.lambda$test03$4(TestMultiThreadedExecutionPlans.java:193)
at java.base/java.lang.Thread.run(Thread.java:1583)
Exception in thread "Thread-263" java.lang.NullPointerException: Cannot load from object array because "this.commandQueueGroupProperties" is null
at beehive.levelzero.jni@0.1.3/uk.ac.manchester.tornado.drivers.spirv.levelzero.LevelZeroDevice.getCommandQueueGroupProperties(LevelZeroDevice.java:115)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.getCommandQueueOrdinal(SPIRVCommandQueueTable.java:125)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.createCommandQueue(SPIRVCommandQueueTable.java:88)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.get(SPIRVCommandQueueTable.java:73)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable.get(SPIRVCommandQueueTable.java:56)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.getCommandQueueForDevice(SPIRVLevelZeroContext.java:153)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.enqueueWriteBuffer(SPIRVLevelZeroContext.java:558)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVDeviceContext.enqueueWriteBuffer(SPIRVDeviceContext.java:346)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.mm.SPIRVMemorySegmentWrapper.enqueueWrite(SPIRVMemorySegmentWrapper.java:159)
at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVTornadoDevice.streamIn(SPIRVTornadoDevice.java:398)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.transferHostToDeviceAlways(TornadoVMInterpreter.java:480)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:305)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:855)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:125)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:112)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.scheduleInner(TornadoTaskGraph.java:859)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1366)
at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1378)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TaskGraph.execute(TaskGraph.java:777)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.ImmutableTaskGraph.execute(ImmutableTaskGraph.java:49)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.lambda$execute$0(TornadoExecutionPlan.java:406)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.execute(TornadoExecutionPlan.java:406)
at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan.execute(TornadoExecutionPlan.java:117)
at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.compute(TestMultiThreadedExecutionPlans.java:183)
at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.lambda$test04$6(TestMultiThreadedExecutionPlans.java:230)
at java.base/java.lang.Thread.run(Thread.java:1583)
Running thread t0Running thread t1Test: class uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
Running test: test01 ................ [PASS]
Running test: test02 ................ [PASS]
Running test: test03 ................ [PASS]
Running test: test04 ................ [PASS]
Test ran: 4, Failed: 0, Unsupported: 0 However, when I run this test as standalone, it does not throw that exception. Has anyone observed similar behaviour? |
OpenCL on Popos:
|
I see you run it with Microsoft jdk, can you try a different one? |
Sure, I also ran with OpenJDK 21: tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB" -m tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00000001ddb1a170, pid=7096, tid=90907
#
# JRE version: Java(TM) SE Runtime Environment (21.0.2+13) (build 21.0.2+13-LTS-58)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0.2+13-LTS-58, mixed mode, tiered, jvmci, parallel gc, bsd-aarch64)
# Problematic frame:
# C [OpenCL+0x22170] clLogMessagesToStderrAPPLE+0x2dc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/thanos/repositories/TornadoVM-stratika/hs_err_pid7096.log
[2.732s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
# |
Thank you all for the extensive testing. This PR needs to be combined with #387. So let's merge that first and then we tested again. |
Tests are passing in my configuration. Since we merged with #387, can we all check if this fixes the issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I tested also TornadoVM-Ray-Tracer and an implementation of Kfusion which is compatible with the new API changes.
Great, I will merge. |
Improvements ~~~~~~~~~~~~~~~~~~ - [beehive-lab#369](beehive-lab#369): Introduction of Tensor types in TornadoVM API and interoperability with ONNX Runtime. - [beehive-lab#370](beehive-lab#370): Array concatenation operation for TornadoVM native arrays. - [beehive-lab#371](beehive-lab#371): TornadoVM installer script ported for Windows 10/11. - [beehive-lab#372](beehive-lab#372): Add support for ``HalfFloat`` (``Float16``) in vector types. - [beehive-lab#374](beehive-lab#374): Support for TornadoVM array concatenations from the constructor-level. - [beehive-lab#375](beehive-lab#375): Support for TornadoVM native arrays using slices from the Panama API. - [beehive-lab#376](beehive-lab#376): Support for lazy copy-outs in the batch processing mode. - [beehive-lab#377](beehive-lab#377): Expand the TornadoVM profiler with power metrics for NVIDIA GPUs (OpenCL and PTX backends). - [beehive-lab#384](beehive-lab#384): Auto-closable Execution Plans for automatic memory management. Compatibility ~~~~~~~~~~~~~~~~~~ - [beehive-lab#386](beehive-lab#386): OpenJDK 17 support removed. - [beehive-lab#390](beehive-lab#390): SapMachine OpenJDK 21 supported. - [beehive-lab#395](beehive-lab#395): OpenJDK 22 and GraalVM 22.0.1 supported. - TornadoVM tested with Apple M3 chips. Bug Fixes ~~~~~~~~~~~~~~~~~~ - [beehive-lab#367](beehive-lab#367): Fix for Graal/Truffle languages in which some Java modules were not visible. - [beehive-lab#373](beehive-lab#373): Fix for data copies of the ``HalfFloat`` types for all backends. - [beehive-lab#378](beehive-lab#378): Fix free memory markers when running multi-thread execution plans. - [beehive-lab#379](beehive-lab#379): Refactoring package of vector api unit-tests. - [beehive-lab#380](beehive-lab#380): Fix event list sizes to accommodate profiling of large applications. - [beehive-lab#385](beehive-lab#385): Fix code check style. - [beehive-lab#387](beehive-lab#387): Fix TornadoVM internal events in OpenCL, SPIR-V and PTX for running multi-threaded execution plans. - [beehive-lab#388](beehive-lab#388): Fix of expected and actual values of tests. - [beehive-lab#392](beehive-lab#392): Fix installer for using existing JDKs. - [beehive-lab#389](beehive-lab#389): Fix ``DataObjectState`` for multi-thread execution plans. - [beehive-lab#396](beehive-lab#396): Fix JNI code for the CUDA NVML library access with OpenCL.
Description
This PR fixes an issue sharing the
dataObjectState
with multiple Java threads.Problem description
The problem was that this state must be per thread (private), otherwise, when a buffer is allocated/removed, the device buffer taken might end-up being deallocated twice, thus, provoking a seg-fault from the driver side. This PR solves this issue.
Disclaimer: I can run without errors using the OpenCL and SPIR-V backends. However, when running with the PTX backend, I still get random errors in the JNI side, mostly related to events. But IMO, this is not related to this fix.
Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?
Additional context