[hotfix] Resolving issue with missing CUDA NVML library for OpenCL #396

stratika · 2024-04-29T11:05:58Z

Description

This is a hotfix to resolve the problem that occurs if the OpenCL NVIDIA driver is installed but CUDA is not installed in the default paths. In this case the JNI functions that query the NVML functions are not working properly and an exception is thrown:

Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.power.OCLNvidiaPowerMetric.clNvmlInit()'
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.power.OCLNvidiaPowerMetric.clNvmlInit(Native Method)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.power.OCLNvidiaPowerMetric.initializePowerLibrary(OCLNvidiaPowerMetric.java:48)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.power.OCLNvidiaPowerMetric.<init>(OCLNvidiaPowerMetric.java:36)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLDeviceContext.<init>(OCLDeviceContext.java:76)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLContext.createDeviceContext(OCLContext.java:209)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLContext.createDeviceContext(OCLContext.java:42)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.graal.OCLHotSpotBackendFactory.createJITCompiler(OCLHotSpotBackendFactory.java:95)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.createOCLJITCompiler(OCLBackendImpl.java:204)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.installDevices(OCLBackendImpl.java:218)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.lambda$discoverDevices$4(OCLBackendImpl.java:225)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at java.base/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:617)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.discoverDevices(OCLBackendImpl.java:223)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.<init>(OCLBackendImpl.java:76)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider.createBackend(OCLTornadoDriverProvider.java:48)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.loadBackends(TornadoCoreRuntime.java:167)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.<init>(TornadoCoreRuntime.java:105)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.<clinit>(TornadoCoreRuntime.java:79)
	at tornado.drivers.common@1.0.4-dev/uk.ac.manchester.tornado.drivers.TornadoDeviceQuery.main(TornadoDeviceQuery.java:74)

Problem description

If the patch provides a fix for a bug, please describe what was the issue and how to reproduce the issue.

Backend/s tested

Mark the backends affected by this PR.

OpenCL
PTX
SPIRV

OS tested

Mark the OS where this PR is tested.

Linux
OSx
Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

Yes
No

How to test the new patch?

make BACKEND=opencl
tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt --params="100000"

make BACKEND=ptx
tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt --params="100000"

jjfumero · 2024-04-29T11:09:44Z

I confirm this patch works in the cluster.

mikepapadim

Thanks for the fix

… ARM system

Improvements ~~~~~~~~~~~~~~~~~~ - [beehive-lab#369](beehive-lab#369): Introduction of Tensor types in TornadoVM API and interoperability with ONNX Runtime. - [beehive-lab#370](beehive-lab#370): Array concatenation operation for TornadoVM native arrays. - [beehive-lab#371](beehive-lab#371): TornadoVM installer script ported for Windows 10/11. - [beehive-lab#372](beehive-lab#372): Add support for ``HalfFloat`` (``Float16``) in vector types. - [beehive-lab#374](beehive-lab#374): Support for TornadoVM array concatenations from the constructor-level. - [beehive-lab#375](beehive-lab#375): Support for TornadoVM native arrays using slices from the Panama API. - [beehive-lab#376](beehive-lab#376): Support for lazy copy-outs in the batch processing mode. - [beehive-lab#377](beehive-lab#377): Expand the TornadoVM profiler with power metrics for NVIDIA GPUs (OpenCL and PTX backends). - [beehive-lab#384](beehive-lab#384): Auto-closable Execution Plans for automatic memory management. Compatibility ~~~~~~~~~~~~~~~~~~ - [beehive-lab#386](beehive-lab#386): OpenJDK 17 support removed. - [beehive-lab#390](beehive-lab#390): SapMachine OpenJDK 21 supported. - [beehive-lab#395](beehive-lab#395): OpenJDK 22 and GraalVM 22.0.1 supported. - TornadoVM tested with Apple M3 chips. Bug Fixes ~~~~~~~~~~~~~~~~~~ - [beehive-lab#367](beehive-lab#367): Fix for Graal/Truffle languages in which some Java modules were not visible. - [beehive-lab#373](beehive-lab#373): Fix for data copies of the ``HalfFloat`` types for all backends. - [beehive-lab#378](beehive-lab#378): Fix free memory markers when running multi-thread execution plans. - [beehive-lab#379](beehive-lab#379): Refactoring package of vector api unit-tests. - [beehive-lab#380](beehive-lab#380): Fix event list sizes to accommodate profiling of large applications. - [beehive-lab#385](beehive-lab#385): Fix code check style. - [beehive-lab#387](beehive-lab#387): Fix TornadoVM internal events in OpenCL, SPIR-V and PTX for running multi-threaded execution plans. - [beehive-lab#388](beehive-lab#388): Fix of expected and actual values of tests. - [beehive-lab#392](beehive-lab#392): Fix installer for using existing JDKs. - [beehive-lab#389](beehive-lab#389): Fix ``DataObjectState`` for multi-thread execution plans. - [beehive-lab#396](beehive-lab#396): Fix JNI code for the CUDA NVML library access with OpenCL.

[hotfix] Resolving issue with missing CUDA NVML library for OpenCL

efc2de0

stratika requested a review from jjfumero April 29, 2024 11:06

stratika self-assigned this Apr 29, 2024

stratika added bug Something isn't working fix Provides a fix labels Apr 29, 2024

stratika requested a review from mikepapadim April 29, 2024 11:08

jjfumero approved these changes Apr 29, 2024

View reviewed changes

mikepapadim approved these changes Apr 29, 2024

View reviewed changes

[hotfix] Added recommended CUDA path for NVML headers and library for…

32baebf

… ARM system

jjfumero merged commit d6618b1 into beehive-lab:master Apr 29, 2024
2 checks passed

stratika deleted the fix/nvidia-power-metric branch April 29, 2024 12:09

jjfumero mentioned this pull request Apr 30, 2024

[release] TornadoVM v1.0.4 #398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hotfix] Resolving issue with missing CUDA NVML library for OpenCL #396

[hotfix] Resolving issue with missing CUDA NVML library for OpenCL #396

stratika commented Apr 29, 2024 •

edited

Loading

jjfumero commented Apr 29, 2024

mikepapadim left a comment

[hotfix] Resolving issue with missing CUDA NVML library for OpenCL #396

[hotfix] Resolving issue with missing CUDA NVML library for OpenCL #396

Conversation

stratika commented Apr 29, 2024 • edited Loading

Description

Problem description

Backend/s tested

OS tested

Did you check on FPGAs?

How to test the new patch?

jjfumero commented Apr 29, 2024

mikepapadim left a comment

Choose a reason for hiding this comment

stratika commented Apr 29, 2024 •

edited

Loading