Skip to content

Latest commit

 

History

History
834 lines (693 loc) · 33.3 KB

README.md

File metadata and controls

834 lines (693 loc) · 33.3 KB

VEDA (VE Driver API) and VERA (VE Runtime API)

VEDA and VERA are a CUDA Driver and Runtime API-like APIs for programming the NEC SX-Aurora. It is based on AVEO. Most of the functionality is identical to the CUDA Driver API and CUDA Runtime API.

Github PyPI License Python Versions Maintenance Maintenance

Sitemap:


Release Notes

VersionComment
v2.2

Added experimental feature to free VEDAdeviceptr within a kernel. This mechanism has some limitations!

The VEDAdeviceptr needs to be allocated using delayed malloc. First call vedaMemAlloc(&vptr, 0) on the host, call your kernel and execute vedaMemAlloc(vptr, size).

To free the VEDAdeviceptr, first call vedaMemRelease(vptr) on the host to take the ownership of the VEDAdeviceptr. Then you can use vedaMemFree within your kernel.

This mechanism doesn't work with non-delayed mallocs as they might be registered to NEC MPI, which cannot be deregistered from within the device.

v2.1.1
  • Shutdown Bugfix
v2.1.0
  • VE3 support
  • NCC 5 Bugfixes
v2.0.2
  • Fixed detection of i64 cblas library.
v2.0.2
  • Fixed CMake to correct find 64Bit BLAS libraries
v2.0.1
  • Bugfixes.
  • Merged changes of VEOS VEDA.
  • Extended Profiler API.
v2.0.0
v1.4.0
  • Overhauled HMEM API
  • Switched to new AVEO async malloc
  • Fixed bug in CMake setting correct C++ standard flags
v1.3.5
  • Bugfix for vedaArgsSet(...) to again accept VEDAptr
v1.3.4
  • Fixed RPM and LOCAL dist types.
  • Automatically downloading Illyrian and Tungl in VEOS and LOCAL builds.
  • Fixed DType issues with in cpp vedaLaunchKernel(...)
v1.3.3
  • Removed polluting VEDA_ERROR_UNKNOWN_CONTEXT log message.
  • Fixed possible memleak when VPTR is already allocated.
  • Fixed possible memleak when using vedaMemAllocOverrideOnce.
  • Synchronizing TUNGL timer with VH.
v1.3.2
  • Added vedaMemAllocOverrideOnce to prevent vedaMemAlloc to allocate new memory and instead return the override value once. This is not meant to be used except when you want to circumvent unncessary memory allocations in opaque data structures that you don't have access to.
  • Fixed RPATH for veda-smi in Python releases
v1.3.1
  • Added compile guard to prevent vedaArgsSet<bool> because bool is defined as 1B on VH and 4B on VE.
  • Merged changes from VEOS 2.11.1 release.
  • Compatibility Bugfix for veda/omp.h NCC ≥ v3.4
v1.3.0
  • changed definition of VEDAdeviceptr to prevent segfaults when passing them to std::ostream
  • added __global__ for device code
  • improved veda_device_omp.h implementations
  • renamed FIND_PACKAGE(VE ...) to FIND_PACKAGE(VEDA ...)
  • added checks for REQUIRED and min versions to FIND_PACKAGE(VEDA ...)
  • renamed VEDA_INCLUDES to VEDA_INCLUDE_DIRS to comply with CMake standard
  • moved CMakeLists.txt into project root
  • VEDA_CXX will now obey CMAKE_CXX_STANDARD when CMAKE_CXX_STANDARD_REQUIRED is set.
  • added device-side support for Tungl
v1.2.0
  • replaced bool checkResult with int64_t* result in vedaLaunchKernelEx to receive return value of kernel
  • added C++ version of vedaLaunchKernelEx
  • added vedaLaunchHostFuncEx that can return the return value of the function
v1.1.2
  • changed behavior of VE_NODE_NUMBER to be only used when _VENODELIST AND VEDA_VISIBLE_DEVICES are not set.
v1.1.1
  • Added support for AVEO's _VENODELIST env to ensure correct behavior in cluster environments.
  • Changed behavior of VEDA_VISIBLE_DEVICES in case of NUMA nodes. It now accepts the direct hardware id in the format of [AVEO_ID].[NUMA_ID]
v1.1.0
  • added vedaMemSwap function to swap the underlying memory buffer between two VEDAdeviceptr.
v1.0.0 First stable release.
  • Improved memset performance, especially for D8 and D16 (up to 150x faster now!).
  • Added vedaMemsetD128 and vedaMemsetD2D128 API.
  • Added ASL_FFTW_LIBRARIES to ASL CMake.
  • Added device code vedaMemset. Enabled to use vedaMemsetD* in device code.
  • Added C++ wrapper to allow directly signed integer and floating point values for vedaMemsetD* functions.
v0.10.6Maintenance release that fixes SegFaults when context has been destroyed before freeing memory. vedaMemFree ignores calls if the context for the particular pointer has already been freed. BugFix for VEDA_CONTEXT_MODE_SCALAR if VE_OMP_NUM_THREADS is not set.
v0.10.5added veda_omp_simd_reduce. MemTrace only get printed when env varVEDA_MEM_TRACE=1 is set. VEDA no longer overrides VEORUN_BIN if already been set by user. Added LICENSE to installation target.
v0.10.4Fixed Identification of VE model.
v0.10.3Filtering negative values from VEDA_VISIBLE_DEVICES.
v0.10.2Correct veda-smi RPATH to work without setting LD_LIBRARY_PATH.
v0.10.1Added aveorun-ftrace. Can be activated using VEDA_FTRACE=1 env var. Renamed RPM packages to only include major version in package name, i.e. veda-0.10.
v0.10.0Renamed and improved VEDAmpiptr to VEDAptr. Removed VEDAdeviceptr->X functions, as they are now part of VEDAptr. Added veda-smi executable.
v0.10.0rc5Added boundary checks for Memcopy and MemSet. Added vedaArgsSetHMEM. Added veda_device_omp.h parallelization primitives for C++. Added experimental VEDAmpiptr for easier usage with VE-MPI. Added/corrected some of the sensor readings, i.e. LLC Cache, Total Device Memory, ...
v0.10.0rc4Increased VEDA offset limit to 128GB. Added VEDAdeviceptr->X functions in C++. Renamed vedaArgsSetPtr to vedaArgsSetVPtr. Added vedaArgsSetPtr to automatically translate VEDAdeviceptr to void*. Fixed VEDA_VISIBLE_DEVICES to obey NUMA mode.
v0.10.0rc3Added AVEO symlinks. Fixed wrong include.
v0.10.0rc2Fixed problem in veda_types.h when compiling with C. Linking against shared AVEO instead of static.
v0.10.0rc1Fixed 0°C core temperatures. Added NUMA support. Each NUMA node becomes a separate VEDAdevice. Added vedaDeviceDistance(float**, VEDAdevice, VEDAdevice) to determine the relationship between two VEDAdevices (0.0 == same device, 0.5 == same physical device but different NUMA node, 1.0 == different physical device). Added vedaMemGetHMEMPointer(void**, VEDAdeviceptr) to translate VEDA pointer to HMEM pointer.
v0.9.5.2Bugfixes
v0.9.5.1Bugfixes
v0.9.5Bugfixes
v0.9.4Bugfixes
v0.9.3Bugfixes
v0.9.2Added FindMPI. Set all CMake vars as advanced.
v0.9.1Added FindBLAS, FindLAPACK, FindASL and FindNCL to CMake.
v0.9Enhanced VEDA CMake Scripts, to also support native NCC compilation.
v0.8.1updated AVEO. Using VE_NODE_NUMBER as fallback if VEDA_VISIBLE_DEVICES is not set.
v0.8Implemented multi-stream support (experimental). Automatic setting of required env vars.
v0.7.1Bugfix release
v0.7initial VERA release
v0.6initial VEDA release

Differences between VEDA and CUDA Driver API:

  1. [VEDA] Additionally to vedaInit(0) in the beginning, vedaExit() needs to be called at the end of the application, to ensure that no dead device processes stay alive.
  2. All function calls start with: [VEDA] veda* instead of cu* and [VERA] vera* instead of cuda*
  3. Objects start with [VEDA] VEDA* instead of CU* and vera* instead of cuda*
  4. VEDA supports asynchronous malloc and free: VEDA supports asynchronous vedaMemAllocAsync and vedaMemFreeAsync. They can be used like the synchronous calls, but don't need to synchronize the execution between device and host.
  5. vedaDeviceGetPower(float* power, VEDAdevice dev) and vedaDeviceGetTemp(float* tempC, const int coreIdx, VEDAdevice dev) allow to fetch the power consumption (in W) and temperature (in C).
  6. As the programming model of the SX-Aurora differs from NVIDIA GPUs, launching kernels looks different:
    // Device Code -------------------------------------------------------------
    extern "C" void my_function(float myFloat, uint8_t myUnsignedChar, float* array) {
    	...
    }
    
    // C -----------------------------------------------------------------------
    float myFloat;
    uint8_t myUnsignedChar;
    VEDAargs args;
    vedaArgsCreate(&args);
    
    // Scheme: vedaArgsSet[TYPE](&args, [PARAM_INDEX], [VARIABLE]);
    vedaArgsSetF32(args, 0, myFloat);
    vedaArgsSetU8(args, 1, myUnsignedChar);
    
    // Copy entire arrays as function parameter
    float array[32];
    vedaArgsSetStack(args, 2, array, VEDA_ARGS_INTENT_INOUT, sizeof(array));
    
    VEDAmodule mod;
    VEDAfunction func;
    vedaModuleLoad(&mod, "mylib.vso");
    vedaModuleGetFunction(&func, mod, "my_function");
    
    // Kernel Call Version 1: allows to reuse VEDAargs object
    VEDAstream stream = 0;
    vedaLaunchKernel(func, stream, args);
    
    // args are not allowed to be destroyed before synchronizing!
    vedaStreamSynchronize(stream);
    vedaArgsDestroy(&args);
    
    // Kernel Call Version 2: automatically destroys VEDAargs object after execution (can't be reused for other calls!)
    vedaLaunchKernelEx(func, stream, args, 1, 0);
    
    // CPP ---------------------------------------------------------------------
    vedaLaunchKernel(func, stream, myFloat, myUnsignedChar, VEDAstack(array, VEDA_ARGS_INTENT_INOUT, sizeof(array)));
  7. VEDAdeviceptr need to be dereferenced first on device side:
    // Host Code ---------------------------------------------------------------
    VEDAdeviceptr ptr;
    vedaMemAllocAsync(&ptr, sizeof(float) * cnt);
    vedaLaunchKernel(func, 0, ptr, cnt);
    vedaMemFreeAsync(ptr);
    
    // Device Code -------------------------------------------------------------
    void mykernel(VEDAdeviceptr vptr, size_t cnt) {
    	float* ptr;
    	vedaMemPtr(&ptr, vptr);
    
    	for(size_t i = 0; i < cnt; i++)
    		ptr[cnt] = ...;
    }
  8. VEDA streams differ from CUDA streams. See chapter "OMP Threads vs Streams" for more details.
  9. VEDA uses the env var VEDA_VISIBLE_DEVICES in contrast to CUDA_VISIBLE_DEVICES. The behavior of VEDA_VISIBLE_DEVICES is slightly different:
    • VEDA_VISIBLE_DEVICES= enables all devices, CUDA_VISIBLE_DEVICES= disables all devices.
    • For enabling VE's in NUMA mode, use {ID}.0 and {ID}.1.
    • VEDA_VISIBLE_DEVICES ids correspond to VE hardware ids, CUDA_VISIBLE_DEVICES corresponds to the CUDA specific ids.

Differences between VERA and CUDA Runtime API:

  1. All function calls start with vera* instead of cuda*
  2. Objects start with vera* instead of cuda*
  3. VERA supports asynchronous malloc and free, see VEDA. VEDA supports asynchronous vedaMemAllocAsync and vedaMemFreeAsync. They can be used like the synchronous calls, but don't need to synchronize the execution between device and host.
  4. vedaDeviceGetPower(float* power, VEDAdevice dev) and vedaDeviceGetTemp(float* tempC, const int coreIdx, VEDAdevice dev) allow to fetch the power consumption (in W) and temperature (in C).
  5. As the programming model of the SX-Aurora differs from NVIDIA GPUs, launching kernels looks different.
  6. Similar to CUDA Runtime API, calls from VEDA and VERA can be mixed!

VEDA/VERA Unique Features:

Delayed Memory Allocation

VEDA does not need to allocate memory from the host, but can do that directly from the device. For this, the host only needs to create an empty VEDAdeviceptr.

// Host Code ---------------------------------------------------------------
VEDAdeviceptr vptr;
vedaMemAllocAsync(&vptr, 0, 0);
vedaLaunchKernel(func, 0, vptr, cnt);
vedaMemcpyDtoHAsync(host, vptr, sizeof(float) * cnt, 0);
vedaMemFreeAsync(vptr, 0);

// Device Code -------------------------------------------------------------
void mykernel(VEDAdeviceptr vptr, size_t cnt) {
	float* ptr;
	vedaMemAllocPtr((void**)&ptr, vptr, cnt * sizeof(float));

	for(size_t i = 0; i < cnt; i++)
		ptr[cnt] = ...;
}

OMP Threads vs Streams (experimental):

In CUDA streams can be used to create different execution queues, to overlap compute with memcopy. VEDA supports two stream modes which differ from the CUDA behavior. These can be defined by vedaCtxCreate(&ctx, MODE, device).

  1. VEDA_CONTEXT_MODE_OMP (default): All cores will be assigned to the default stream (=0). This mode only supports a single stream.
  2. VEDA_CONTEXT_MODE_SCALAR: Every core gets assigned to a different stream. This mode allows to use each core independently with different streams. Use the function vedaCtxStreamCnt(&streamCnt) to determine how many streams are available.

Both methods use the env var VE_OMP_NUM_THREADS to determine the maximal number of cores that get use for either mode. If the env var is not set, VEDA uses all available cores of the hardware.


Advanced VEDA C++ Ptr

When you use C++, you can use the VEDAptr<typename> that gives you more directly control over the VEDAdeviceptr, i.e. you can use vptr.size(), vptr.device(), ... . The typename is used to automatically determine the correct offsets when executing vptr += offset;.


VEDA-NEC MPI integration

The VEO-aware NEC MPI ( https://www.hpc.nec/forums/topic?id=pgmcA8 ) enables to much easier implement hybrid VE applications. For this, so called HMEM pointers have been introduced in VEO. Starting with v1.4.0 VEDA introduced a new HMEM API: vedaHMEM*. See following example:

VEDAhmemptr hmem;
vedaHMemAlloc(&hmem, size);
vedaHMemcpy(hmem, host_ptr, size);
mpi_send(hmem, ...);

NUMA Support

VEDA supports VE NUMA nodes since v0.10. To enable NUMA on your system you need to execute (set -N ? to specific device index):

VCMD="sudo /opt/nec/ve/bin/vecmd -N ?"
$VCMD vconfig set partitioning_mode on
$VCMD state set off
$VCMD state set mnt
$VCMD reset card

VEDA then recognizes each NUMA node as a separate device, i.e. with 2 physical devices in NUMA mode, VEDA would show 4 devices. You can use VEDAresult vedaDeviceDistance(float* distance, VEDAdevice devA, VEDAdevice devB) to determine the relationship of two VEDAdevices.

distance == 0.0; // same device
distance == 0.5; // same physical device, different NUMA node
distance == 1.0; // differeny physical device

VEDA-smi

The executable veda-smi displays available VEDA devices in your system. It uses the VEDA_VISIBLE_DEVICES env var and therefore only shows the devices that your VEDA application would be able to use. Use VEDA_VISIBLE_DEVICES= veda-smi to ensure that you see all installed devices.

╔ veda-smi ═════════════════════════════════════════════════════════════════════╗
║ VEDA Version: 0.10.0     AVEO Version: 0.9.15                                 ║
╚═══════════════════════════════════════════════════════════════════════════════╝

┌── #0  NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
  ┌ Physical: 1.0
  ├ AVEO:     0.0
  ├ Clock:    current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
  ├ Firmware: 5399
  ├ Memory:   49152 MiB
  ├ Cache:    LLC: 8192kB, L2: 256kB, L1d: 32kB, L1i: 32kB
  ├ Temp:     56.4°C 56.4°C 57.0°C 56.1°C
  └ Power:    18.0W (11.9V, 1.5A)
└───────────────────────────────────────────────────────────────────────────────┘

┌── #1  NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
  ┌ Physical: 1.1
  ├ AVEO:     0.1
  ├ Clock:    current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
  ├ Firmware: 5399
  ├ Memory:   49152 MiB
  ├ Cache:    LLC: 8192kB, L2: 256kB, L1d: 32kB, L1i: 32kB
  ├ Temp:     56.1°C 56.4°C 55.9°C 56.0°C
  └ Power:    18.0W (11.9V, 1.5A)
└───────────────────────────────────────────────────────────────────────────────┘

┌── #2  NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
  ┌ Physical: 0.0
  ├ AVEO:     1.0
  ├ Clock:    current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
  ├ Firmware: 5399
  ├ Memory:   49152 MiB
  ├ Cache:    LLC: 16384kB, L2: 256kB, L1d: 32kB, L1i: 32kB
  ├ Temp:     53.8°C 53.5°C 54.1°C 53.8°C 53.8°C 54.1°C 53.2°C 53.5°C
  └ Power:    36.3W (11.9V, 3.1A)
└───────────────────────────────────────────────────────────────────────────────┘

Profiling API

Since v1.5.0 VEDA supports to add a profiling callback using vedaProfilerSetCallback(...). The callback needs to have the signature void (*)(VEDAprofiler_data* data, int enter). If enter is non-zero, the callback got called right before issuing the command. If it's zero, it just ended.

The data provides the following fields:

  1. type: An enum that identifies which kind function got called (kernel, memcpy, ...)
  2. device_id: VEDA device id
  3. stream_id: VEDA stream id
  4. req_id: ID of the request
  5. user_data: void* that allows to store data between enter and exit of the event. This should be deleted by the user when enter==0 to prevent memleaks.

Depending on the type, you can cast the data to one of the following data types to get access to further information.

  1. type in [VEDA_PROFILER_MEM_ALLOC, VEDA_PROFILER_HMEM_ALLOC]: VEDAprofiler_vedaMemAlloc
    1. bytes: number of bytes to be allocated
  2. type in [VEDA_PROFILER_MEM_FREE, VEDA_PROFILER_HMEM_FREE]: VEDAprofiler_vedaMemFree
    1. ptr: pointer to be freed
  3. type in [VEDA_PROFILER_MEM_CPY_HTOD, VEDA_PROFILER_MEM_CPY_DTOH, VEDA_PROFILER_HMEM_CPY]: VEDAprofiler_vedaMemcpy
    1. dst: destination pointer
    2. src: source pointer
    3. bytes: number of bytes transfered
  4. type == VEDA_PROFILER_LAUNCH_KERNEL: VEDAprofiler_vedaLaunchKernel
    1. func: function pointer that gets called
    2. kernel: name of the kernel that gets called
  5. type == VEDA_PROFILER_LAUNCH_HOST: VEDAprofiler_vedaLaunchHostFunc
    1. func: function pointer that gets called
  6. type == VEDA_PROFILER_SYNC: VEDAprofiler_data

C++ API (Experimental!)

Starting with v1.5.0 we introduce a new experimental and lightweight C++ API. This API aims for easier usage of VEDA, with much more comfort in C++ applications.

To include the new API just use #include <veda/cpp/api.h>.

Error Handling

Instead of the C-API, the C++ API uses exceptions, which can be used like this:

try {
	...
} catch(const veda::Exception& e) {
	std::cerr << e.what() << " @ " << e.file() << " (" << e.line() << ")";
}

Fetching a Device handle

To get a handle to a device, just create an instance using:

veda::Device device(0);

In contrast to the C-API, the veda::Device incorporates the VEDAdevice and VEDAcontext into a single object. We use a lazy scheme, which will not boot up the device context until you allocate memory, load a model, or similar.

The device provides the following attributes and metrics: isActive, current, currentEdge, distance, power, temp, voltage, voltageEdge, abi, aveoId, cacheL1d, cacheL1i, cacheL2, cacheLLC, clockBase, clockMemory, clockRate, cores, firmware, model, numaId, physicalId, singleToDoublePerfRatio, streamCnt, vedaId, totalMem, usedMem.

If your application requires to do the CUDA-style programming, where you bind the device to a specific thread, you can use device.pushCurrent(), device.setCurrent() and auto device = Device::getCurrent() or auto device = Device::popCurrent().

To synchronize the execution use device.sync() or device.sync(stream).

Loading Modules

Just do:

auto mod = dev.load("libmymodule.vso");

Memory Buffer Objects

The new C++ API uses buffer objects instead of raw pointers. These can be allocated using dev.alloc<float>(cnt), which will allocate sizeof(T) * cnt bytes of memory.

If you want to use a different stream, just use dev.alloc<float>(cnt, stream).

To allocate HMEM memory, use dev.alloc<float, veda::HMEM>(size).

To copy data between different Buffers, or the host and the VE, just use:

auto VE = dev.alloc<float>(cnt);
auto VH = malloc(sizeof(float) * cnt);

VE.to(VH);              // copies all items from VE to VH
VE.to(VH, 1);           // copies the first item from VE to VH
VE[4].to(VH + 4, 1);    // copies the 5th item from VE to VH
VE.from(VH);            // copies all items from VH to VE

auto V2 = dev.alloc<float>(cnt);
V2.to(VE);              // copies all items from V2 to VE
VE.from(V2);            // copies all items from V2 to VE

To memset data use:

VE.memset(3.1415);      // set all items
VE[5].memset(3.1415);   // set all items starting the 6th
VE[5].memset(3.1415, 1);// set only the 6th item

To cast a buffer object to another type:

auto Float = dev.alloc<float>(cnt);
auto Int32 = Float.cast<int32_t>(); // Float.cnt() == Int32.cnt()
auto Int16 = Float.cast<int16_t>(); // Float.cnt() == Int16.cnt()*2

All buffer objects use shared pointer semantics. When all objects using the same source pointer are destroyed, it will be automatically freed.

To pass on pointers between methods just pass on the buffer object:

veda::Ptr<VEDA, float> func(...) {
	...
	auto ptr = dev.alloc<float>(cnt);
	...
	return ptr;
}

Fetching Functions

For fetching functions we provide three helper functions.

  1. C-style or extern "C" functions:

    // VE
    extern "C" int name(int, float, VEDAdeviceptr);
    
    // VH
    using namespace veda;
    auto func = CFunction::Return<int>(mod, "name");
    auto result = func(0, 3.14f, ptr);
    printf("%i\n", int(result));

    The CFunction::Return<int> returns you an executable object to an C-function on the VE. Whenever you call func(...) it issues a kernel call. By default we use the stream #0, but you can use func[stream](...) to define the stream yourself.

    result is a future object. When you call result.wait() or fetch the result using (TYPE)result or result.get(), it will synchronize the execution and provide the return value.

    The ::Return<...> can be omitted when no return value is expected.

  2. C++-style functions:

    // VE
    int name(int, float, VEDAdeviceptr);
    
    // VH
    using namespace veda;
    auto func = Function::Return<int>::Args<int, float, VEDAdeviceptr>(mod, "name");
    auto result = func(0, 3.14f, ptr);
    printf("%i\n", int(result));

    For C++-style functions use Function instead of CFunction. In this case you also need to provide the types of all arguments using Args<...>.

    Again ::Return<...> can be omitted when no return value is expected.

    Also struct types can be used as arguments:

    // VE + VH
    namespace whatever {
    	template<typename T>
    	struct complex {
    		T x, y;
    	};
    }
    
    // VE
    void name(VEDAdeviceptr, whatever::complex<float>);
    
    // VH
    auto func = Function::Args<VEDAdeviceptr, whatever::complex<float>>(mod, "name");
    whatever::complex<float> x = {3.0f, 4.0f};
    func(ptr, x);
  3. Template functions:

    // VE
    template<typename T, typename D>
    T name(T, float, D);
    
    template int name<int, VEDAdeviceptr>(int, float, VEDAdeviceptr);
    
    // VH
    using namespace veda;
    auto func = Template<int, VEDAdeviceptr>::Return<_0>::Args<_0, float, _1>(mod, "name");

    Last, we also support to fetch templated functions. Here it is important, that in the VE code, the template gets explicitly instantiated using the template ... name<...>(...); syntax. Otherwise the compiler will not generate this specific templated function.

    On the VH, we first define the template parameters using Template<...>. Next, as before the return type. If it is ::Return<void>, it can be omitted. And last the arguments, similar as before for the Function.

    In the code above you see veda::_0 and veda::_1. These correspond to the template parameters, _0 is the 0th, _1 the 1st, and so on. It is necessary to use these template placeholders within Return<...> and Args<...> at the same locations as within the C++ code.

    If your template uses literals, such as:

    template<int i, typename T>
    T name(T a) { return a + i; }
    
    template float name<0>(float);
    template float name<5>(float);
    template int   name<5>(int);

    You can to use the following code on VH:

    auto name_f0 = Template<Literal<0>, float>::Return<_1>::Args<_1>(...);
    auto name_f5 = Template<Literal<5>, float>::Return<_1>::Args<_1>(...);
    auto name_i0 = Template<Literal<5>, int>  ::Return<_1>::Args<_1>(...);

    It's important that the data type you pass to Literal<...> matches the data type you use in your template<...>. I.e., if you use template<char...>, then you need to use Literal('x') or Literal(char(15)).

    Only integer-like types (char, short, ...) can be used as template literals.

For all function fetching methods it's important, that function arguments match exactly the ones you use in your VE C++ code. Otherwise fetching the function will fail at runtime!


SX-Aurora VE3 support

Since v2.1.0 VEDA supports the SX-Aurora VE3. It's important that your libraries are compatible to the used architecture. Use these compile and linking flags:

Architecture Flags File Extension
VE1+2 -march=ve1 -stdlib=libc++ *.vso
VE3 -march=ve3 -stdlib=libc++ *.vso3

To load the library you can just use vedaModuleLoad(&mod, "libsomething.vso") and VEDA will automatically load libsomething.vso for VE1+2 or libsomething.vso3 for VE3.

By default VEDA determines automatically which architecture to use. You can override this behavior by setting the env var VEDA_ARCH=1 or VEDA_ARCH=3. Be warned, you cannot run VEDA_ARCH=3 on a VE1, but you can use VEDA_ARCH=1 on a VE3!

If you are unsure which architecture your library is for, you can use nreadelf -h libsomething.vso | grep 'Flags'. Flags ending with 0 are for VE1, with 1 are for VE3.


Limitations/Known Problems:

  1. VEDA only supports one VEDAcontext per device.
  2. No unified memory space (yet).
  3. VEDA by default uses the current workdirectory for loading modules. This behavior can be changed by using the env var VE_LD_LIBRARY_PATH.
  4. Due to compiler incompatibilities it can be necessary to adjust the CMake variable ${AVEO_NFORT} to another compiler.
  5. The C++ API can only return fundamental (void, int, short, ...) values.
  6. The C++ API cannot compile ...::Args<void>. Use ...:::Args<> instead.

How to build:

git clone https://github.com/SX-Aurora/veda/
mkdir veda/build
cd veda/build

# Build Option 1: Local installation (default: /usr/local/ve (use -DCMAKE_INSTALL_PREFIX=... for other path))
cmake3 -DVEDA_DIST_TYPE=LOCAL ..
cmake3 --build . --target install 

# Build Option 2: VEOS installation
cmake3 -DVEDA_DIST_TYPE=VEOS ..
cmake3 --build . --target install 

# Build Option 3: Python package
pip3 install illyrian tungl
illyrian cmake3 -DVEDA_DIST_TYPE=PYTHON ..
cmake3 --build . --target dist

How to use:

VEDA has an own CMake find script. This supports 3 modes. The script uses the compilers installed in /opt/nec/ve/bin. You can modify the CMAKE_[LANG]_COMPILER flags to change that behavior. See the Hello World examples in the Examples Folder


1. VEDA Hybrid Offloading:

This mode is necessary for VEDA offloading applications. It enables to compile host and device code within the same CMake project. For this it is necessary to use different file extensions for the VE code. All *.vc files get compiled using NCC, *.vcpp using NC++ and *.vf with NFORT.

SET(CMAKE_MODULE_PATH /usr/local/ve/veda/cmake /opt/nec/ve/share/veda/cmake)
FIND_PACKAGE(VEDA)
ENABLE_LANGUAGE(VEDA_C VEDA_CXX)

INCLUDE_DIRECTORIES(${VEDA_INCLUDE_DIRS})
ADD_EXECUTABLE(myApp mycode.vc mycode.vcpp)
TARGET_LINK_LIBRARIES(myApp ${VEDA_LIBRARY})

2. VE Native applications:

This mode enables to compile VE native applications.

SET(CMAKE_MODULE_PATH /usr/local/ve/veda/cmake /opt/nec/ve/share/veda/cmake)
FIND_PACKAGE(VEDA)
ENABLE_LANGUAGE(VEDA_C VEDA_CXX)
ADD_EXECUTABLE(myApp mycode.c mycode.cpp)

3. VE Native Injection:

If you have a CPU application and you don't want to modify the CMake script you can build your project using:

cmake -C /usr/local/ve/veda/cmake/InjectVE.cmake /path/to/your/source

It will replace the CPU C, CXX and Fortran compilers with NCC.