Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Iteration for achieving Multi-Threaded Execution Plans #389

Merged
merged 5 commits into from
Apr 29, 2024

Conversation

jjfumero
Copy link
Member

@jjfumero jjfumero commented Apr 24, 2024

Description

This PR fixes an issue sharing the dataObjectState with multiple Java threads.

Problem description

The problem was that this state must be per thread (private), otherwise, when a buffer is allocated/removed, the device buffer taken might end-up being deallocated twice, thus, provoking a seg-fault from the driver side. This PR solves this issue.

Disclaimer: I can run without errors using the OpenCL and SPIR-V backends. However, when running with the PTX backend, I still get random errors in the JNI side, mostly related to events. But IMO, this is not related to this fix.

Backend/s tested

Mark the backends affected by this PR.

  • OpenCL
  • PTX
  • SPIRV

OS tested

Mark the OS where this PR is tested.

  • Linux
  • OSx
  • Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

  • Yes
  • No

How to test the new patch?

$ make 
$ make tests 
$ tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans

# Pass also the batch processing examples or each backend:
$ tornado-test -V --fast uk.ac.manchester.tornado.unittests.batches.TestBatches

Additional context

  • GAIA kernels are also passing.
  • TornadoVM-RayTracer passes with the OpenCL backend

The issue was that the `DataObjectState` was shared
across multiple threads. Instead, each execution plan
needs a private instance of the object state.
@jjfumero jjfumero added runtime fix Provides a fix labels Apr 24, 2024
@jjfumero jjfumero self-assigned this Apr 24, 2024
Copy link
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Opencl works for me and for ptx a did a few runs without getting any errors.

@mairooni
Copy link
Collaborator

I get this for OpenCL

tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True  -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector

Running thread t0Running thread t1[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> [TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfoCL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).
 -> Returned: 
-58
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_READ_BUFFER on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).

[ERROR] clEnqueueReadBuffer, code = -4 n[TornadoVM-OCL-JNI] ERROR : clEnqueueReadBuffer -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -4
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).

[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[ERROR] clEnqueueReadBuffer, code = -4 n[TornadoVM-OCL-JNI] ERROR : clEnqueueReadBuffer -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_READ_BUFFER on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[JNI] uk.ac.manchester.tornado.drivers.opencl> [TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58

[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[ERROR] clEnqueueReadBuffer, code = -4 n[TornadoVM-OCL-JNI] ERROR : clEnqueueReadBuffer -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_READ_BUFFER on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).


[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).

[ERROR] clEnqueueReadBuffer, code = -4 n[TornadoVM-OCL-JNI] ERROR : clEnqueueReadBuffer -> Returned: -4
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_READ_BUFFER on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -4
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).

[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[ERROR] clEnqueueReadBuffer, code = -4 n[TornadoVM-OCL-JNI] ERROR : clEnqueueReadBuffer -> Returned: [JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
-4[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_READ_BUFFER on NVIDIA GeForce RTX 3050 Ti Laptop GPU (Device 0).
...

This happens both when I run the test individually and through the make tests

@mairooni
Copy link
Collaborator

PTX:

tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1      [TornadoVM-PTX-JNI] ERROR : cuMemAlloc -> Returned: 2
        [TornadoVM-PTX-JNI] ERROR : cuMemcpyHtoDAsyncMemSeg -> Returned: 1
        [TornadoVM-PTX-JNI] ERROR : cuEventSynchronize -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventElapsedTime -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventElapsedTime -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventSynchronize -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventElapsedTime -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventElapsedTime -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventCreate (beforeEvent) -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventCreate (afterEvent) -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventRecord -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuMemcpyDtoHMemSeg -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventRecord -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventCreate (beforeEvent) -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventCreate (afterEvent) -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventRecord -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuMemcpyDtoHMemSeg -> Returned: 716
        [TornadoVM-PTX-JNI] ERROR : cuEventRecord -> Returned: 716
[thread 6688 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007b707dafe705, pid=6432, tid=6687
#
# JRE version: Java(TM) SE Runtime Environment (21.0+35) (build 21+35-LTS-2513)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21+35-LTS-2513, mixed mode, tiered, jvmci, parallel gc, linux-amd64)
# Problematic frame:
# C  [libcuda.so.1+0x4fe705]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/mary/Projects/Juan/TornadoVM/core.6432)
#
# An error report file with more information is saved as:
# /home/mary/Projects/Juan/TornadoVM/hs_err_pid6432.log
[5,914s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted (core dumped)

@mairooni
Copy link
Collaborator

For SPIR-V the tests pass:

Test: class uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
        Running test: test01                     ................  [PASS] 
        Running test: test02                     ................  [PASS] 
        Running test: test03                     ................  [PASS] 
        Running test: test04                     ................  [PASS] 
Test ran: 4, Failed: 0, Unsupported: 0

Copy link
Collaborator

@stratika stratika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In macOS (OpenCL):

tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001ddb1a170, pid=155, tid=43875
#
# JRE version: OpenJDK Runtime Environment Microsoft-9388422 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Microsoft-9388422 (21.0.3+9-LTS, mixed mode, tiered, jvmci, parallel gc, bsd-aarch64)
# Problematic frame:
# C  [OpenCL+0x22170]  clLogMessagesToStderrAPPLE+0x2dc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/thanos/repositories/TornadoVM-stratika/hs_err_pid155.log
[2.208s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://github.com/microsoft/openjdk/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

@stratika
Copy link
Collaborator

In Linux OS with SPIR-V BACKEND when I run make tests, I get:

tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True  -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "Thread-17" java.lang.NullPointerException: Cannot load from object array because "this.commandQueueGroupProperties" is null
	at beehive.levelzero.jni@0.1.3/uk.ac.manchester.tornado.drivers.spirv.levelzero.LevelZeroDevice.getCommandQueueGroupProperties(LevelZeroDevice.java:115)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.getCommandQueueOrdinal(SPIRVCommandQueueTable.java:125)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.createCommandQueue(SPIRVCommandQueueTable.java:88)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.get(SPIRVCommandQueueTable.java:73)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable.get(SPIRVCommandQueueTable.java:56)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.getCommandQueueForDevice(SPIRVLevelZeroContext.java:153)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.enqueueWriteBuffer(SPIRVLevelZeroContext.java:558)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVDeviceContext.enqueueWriteBuffer(SPIRVDeviceContext.java:346)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.mm.SPIRVMemorySegmentWrapper.enqueueWrite(SPIRVMemorySegmentWrapper.java:159)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVTornadoDevice.streamIn(SPIRVTornadoDevice.java:398)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.transferHostToDeviceAlways(TornadoVMInterpreter.java:480)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:305)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:855)
	at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
	at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:125)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:112)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.scheduleInner(TornadoTaskGraph.java:859)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1366)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1378)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TaskGraph.execute(TaskGraph.java:777)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.ImmutableTaskGraph.execute(ImmutableTaskGraph.java:49)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.lambda$execute$0(TornadoExecutionPlan.java:406)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.execute(TornadoExecutionPlan.java:406)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan.execute(TornadoExecutionPlan.java:117)
	at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.compute(TestMultiThreadedExecutionPlans.java:183)
	at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.lambda$test03$4(TestMultiThreadedExecutionPlans.java:193)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Exception in thread "Thread-263" java.lang.NullPointerException: Cannot load from object array because "this.commandQueueGroupProperties" is null
	at beehive.levelzero.jni@0.1.3/uk.ac.manchester.tornado.drivers.spirv.levelzero.LevelZeroDevice.getCommandQueueGroupProperties(LevelZeroDevice.java:115)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.getCommandQueueOrdinal(SPIRVCommandQueueTable.java:125)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.createCommandQueue(SPIRVCommandQueueTable.java:88)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable$ThreadCommandQueueTable.get(SPIRVCommandQueueTable.java:73)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVCommandQueueTable.get(SPIRVCommandQueueTable.java:56)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.getCommandQueueForDevice(SPIRVLevelZeroContext.java:153)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVLevelZeroContext.enqueueWriteBuffer(SPIRVLevelZeroContext.java:558)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.SPIRVDeviceContext.enqueueWriteBuffer(SPIRVDeviceContext.java:346)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.mm.SPIRVMemorySegmentWrapper.enqueueWrite(SPIRVMemorySegmentWrapper.java:159)
	at tornado.drivers.spirv@1.0.4-dev/uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVTornadoDevice.streamIn(SPIRVTornadoDevice.java:398)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.transferHostToDeviceAlways(TornadoVMInterpreter.java:480)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:305)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.interpreter.TornadoVMInterpreter.execute(TornadoVMInterpreter.java:855)
	at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
	at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.executeInterpreterSingleThreaded(TornadoVM.java:125)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:112)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.scheduleInner(TornadoTaskGraph.java:859)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1366)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.execute(TornadoTaskGraph.java:1378)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TaskGraph.execute(TaskGraph.java:777)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.ImmutableTaskGraph.execute(ImmutableTaskGraph.java:49)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.lambda$execute$0(TornadoExecutionPlan.java:406)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan$TornadoExecutor.execute(TornadoExecutionPlan.java:406)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TornadoExecutionPlan.execute(TornadoExecutionPlan.java:117)
	at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.compute(TestMultiThreadedExecutionPlans.java:183)
	at tornado.unittests@1.0.4-dev/uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans.lambda$test04$6(TestMultiThreadedExecutionPlans.java:230)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Running thread t0Running thread t1Test: class uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
	Running test: test01                     ................  [PASS] 
	Running test: test02                     ................  [PASS] 
	Running test: test03                     ................  [PASS] 
	Running test: test04                     ................  [PASS] 
Test ran: 4, Failed: 0, Unsupported: 0

However, when I run this test as standalone, it does not throw that exception. Has anyone observed similar behaviour?

@mikepapadim
Copy link
Member

OpenCL on Popos:


tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True  -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector

Running thread t1Running thread t0Test: class uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
	Running test: test01                     ................  [PASS] 
	Running test: test02                     ................  [PASS] 
	Running test: test03                     ................  [PASS] 
	Running test: test04                     ................  [PASS] 
Test ran: 4, Failed: 0, Unsupported: 0

@mikepapadim
Copy link
Member

In macOS (OpenCL):

tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001ddb1a170, pid=155, tid=43875
#
# JRE version: OpenJDK Runtime Environment Microsoft-9388422 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Microsoft-9388422 (21.0.3+9-LTS, mixed mode, tiered, jvmci, parallel gc, bsd-aarch64)
# Problematic frame:
# C  [OpenCL+0x22170]  clLogMessagesToStderrAPPLE+0x2dc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/thanos/repositories/TornadoVM-stratika/hs_err_pid155.log
[2.208s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://github.com/microsoft/openjdk/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

I see you run it with Microsoft jdk, can you try a different one?

@stratika
Copy link
Collaborator

In macOS (OpenCL):

tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001ddb1a170, pid=155, tid=43875
#
# JRE version: OpenJDK Runtime Environment Microsoft-9388422 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Microsoft-9388422 (21.0.3+9-LTS, mixed mode, tiered, jvmci, parallel gc, bsd-aarch64)
# Problematic frame:
# C  [OpenCL+0x22170]  clLogMessagesToStderrAPPLE+0x2dc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/thanos/repositories/TornadoVM-stratika/hs_err_pid155.log
[2.208s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://github.com/microsoft/openjdk/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

I see you run it with Microsoft jdk, can you try a different one?

Sure, I also ran with OpenJDK 21:

tornado-test --jvm="-Dtornado.device.memory=4GB" --debug -V --fast uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans
tornado --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True -Dtornado.debug=True -Dtornado.device.memory=4GB"  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.multithreaded.TestMultiThreadedExecutionPlans"
WARNING: Using incubator modules: jdk.incubator.vector
Running thread t0Running thread t1Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
Context leak detected, msgtracer returned -1
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001ddb1a170, pid=7096, tid=90907
#
# JRE version: Java(TM) SE Runtime Environment (21.0.2+13) (build 21.0.2+13-LTS-58)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0.2+13-LTS-58, mixed mode, tiered, jvmci, parallel gc, bsd-aarch64)
# Problematic frame:
# C  [OpenCL+0x22170]  clLogMessagesToStderrAPPLE+0x2dc
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/thanos/repositories/TornadoVM-stratika/hs_err_pid7096.log
[2.732s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

@jjfumero
Copy link
Member Author

Thank you all for the extensive testing. This PR needs to be combined with #387. So let's merge that first and then we tested again.

@jjfumero
Copy link
Member Author

Tests are passing in my configuration. Since we merged with #387, can we all check if this fixes the issue?

Copy link
Collaborator

@stratika stratika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested also TornadoVM-Ray-Tracer and an implementation of Kfusion which is compatible with the new API changes.

@jjfumero
Copy link
Member Author

Great, I will merge.

@jjfumero jjfumero merged commit 1743e10 into beehive-lab:develop Apr 29, 2024
2 checks passed
@jjfumero jjfumero deleted the fix/plans/mt branch April 29, 2024 09:41
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Apr 30, 2024
Improvements
~~~~~~~~~~~~~~~~~~

- [beehive-lab#369](beehive-lab#369): Introduction of Tensor types in TornadoVM API and interoperability with ONNX Runtime.
- [beehive-lab#370](beehive-lab#370): Array concatenation operation for TornadoVM native arrays.
- [beehive-lab#371](beehive-lab#371): TornadoVM installer script ported for Windows 10/11.
- [beehive-lab#372](beehive-lab#372): Add support for ``HalfFloat`` (``Float16``) in vector types.
- [beehive-lab#374](beehive-lab#374): Support for TornadoVM array concatenations from the constructor-level.
- [beehive-lab#375](beehive-lab#375): Support for TornadoVM native arrays using slices from the Panama API.
- [beehive-lab#376](beehive-lab#376): Support for lazy copy-outs in the batch processing mode.
- [beehive-lab#377](beehive-lab#377): Expand the TornadoVM profiler with power metrics for NVIDIA GPUs (OpenCL and PTX backends).
- [beehive-lab#384](beehive-lab#384): Auto-closable Execution Plans for automatic memory management.

Compatibility
~~~~~~~~~~~~~~~~~~

- [beehive-lab#386](beehive-lab#386): OpenJDK 17 support removed.
- [beehive-lab#390](beehive-lab#390): SapMachine OpenJDK 21 supported.
- [beehive-lab#395](beehive-lab#395): OpenJDK 22 and GraalVM 22.0.1 supported.
- TornadoVM tested with Apple M3 chips.

Bug Fixes
~~~~~~~~~~~~~~~~~~

- [beehive-lab#367](beehive-lab#367): Fix for Graal/Truffle languages in which some Java modules were not visible.
- [beehive-lab#373](beehive-lab#373): Fix for data copies of the ``HalfFloat`` types for all backends.
- [beehive-lab#378](beehive-lab#378): Fix free memory markers when running multi-thread execution plans.
- [beehive-lab#379](beehive-lab#379): Refactoring package of vector api unit-tests.
- [beehive-lab#380](beehive-lab#380): Fix event list sizes to accommodate profiling of large applications.
- [beehive-lab#385](beehive-lab#385): Fix code check style.
- [beehive-lab#387](beehive-lab#387): Fix TornadoVM internal events in OpenCL, SPIR-V and PTX for running multi-threaded execution plans.
- [beehive-lab#388](beehive-lab#388): Fix of expected and actual values of tests.
- [beehive-lab#392](beehive-lab#392): Fix installer for using existing JDKs.
- [beehive-lab#389](beehive-lab#389): Fix ``DataObjectState`` for multi-thread execution plans.
- [beehive-lab#396](beehive-lab#396): Fix JNI code for the CUDA NVML library access with OpenCL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Provides a fix runtime
Projects
Development

Successfully merging this pull request may close these issues.

4 participants