This guide walks through a quick introduction to MatX to get familiar with the types and basic functionality. For more extensive documentation, please look at the following sources:
- Example and API documentation
- Example code in the examples/ directory
- DLI training (if available)
MatX is currently a single header file to include in your project called matx.h
. Simply add the matx/include directory to your compiler's
include search path, and add the proper #include. All MatX functions are in a top-level matx
namespace, while more specific functions have
a nested namespace. For example, the visualization pieces of MatX are under matx::viz
.
The most fundamental data type in MatX is the tensor (tensor_t). The tensor is used for both viewing and managing any underlying GPU or host memory. While dynamically-typed languages like Python or MATLAB will implicitly allocate and manage data for the user, MatX requires either a one-time explicit memory allocation, or a user-provided device-side buffer. This gives more control over the lifetime of the data, and allows reusing memory regions for different operations.
A tensor is created using the following syntax:
auto t = tensor_t<float, 2>({10, 20});
The constructor takes two required template parameters: the data type and the rank of the tensor. In this case we request a floating point tensor with rank 2 (2D array). The constructor arguments specify the shape of the tensor (10x20), or the size of each dimension. The number of elements in the list must match the rank of the tensor. Currently MatX supports ranks from 0 (scalar) up to 4 (4D array).
To avoid the redundant rank and constructor arguments, the same tensor can be created using the make_tensor
function:
auto t = make_tensor<float>({10, 20});
In this case only the type is required as a template parameter, and the rank is deduced from the number of arguments in the dimension list.
After calling the constructor, MatX will allocate CUDA managed memory large enough to accommodate the specified tensor size. Users can also pass their own pointers in a different form of the constructor which will leave allocation and freeing of the memory to the caller (see Creating a view from an existing pointer below). Note that MatX treats the initial tensor view as the owner of the memory. When this view is destructed, whether by going out of scope or manually, all memory is freed. If there were other views still pointing to this memory, it is no longer valid and the behavior is undefined.
With our view t
created above, we now have managed memory allocated sufficiently large to hold our values, but at this point the data
in the tensor is undefined. To set individual values in a view, we can use operator()
:
t(2,2) = 5.5;
t(4,6) = -10f;
The same operator can be used to get values:
float f = t(2,2) // f is now 5.5
operator()
takes as many parameters as the rank of the tensor:
auto f0 = t0(); // t0 is a rank-0 tensor (scalar)
auto f1 = t1(5); // t1 is a rank-1 tensor (scalar)
Tensors can also be initialized using initializer list syntax using the SetVals
function:
auto myT = tensor_t<float, 2>({3, 3});
myT.SetVals({ {1,2,3}, {4,5,6}, {7,8,9} });
In other languages it's very common to initialize a tensor with a set of values on creation (ones, zeros, ranges). This will be covered later in the tutorial when we discuss operators, and it should become clear why we initialize this way.
The dimensions of the tensor are stored internally in a type named tensorShape_t. This tensor shape contains the rank and dimensions of the
tensor view, but does not contain any information about type or storage. The shape can be retrieved using the Shape
call:
auto shape = t.Shape();
Shape()
is similar to NumPy's shape
attribute.
The number of dimensions in a tensor can be retrieved using the Rank()
member. Since the rank is known at compile time, this function
uses the constexpr
modifier:
auto r = t.Rank();
The size of each individual dimension can be fetched using Size()
:
auto t1size = t1.Size(0); // Size of vector t1
auto t2rows = t2.Size(0); // Rows in t2
auto t2cols = t2.Size(1); // Cols in t2
As the name implies, t
is a view into a region of memory. When the initial view is created and memory is allocated, the tensor view is
of the entire 10x20 contiguous block of memory. Often we don't want to see the entire block of memory, but only want to view a subset of the
underlying data. To do this, we use the Slice
member function of the view class:
auto tCube = t.Slice({3, 5}, {6, 8}); // Cube of t using rows 3-5 and cols 5-7
auto tRectS = t.Slice({0, 0}, {matxEnd, matxEnd}, {2, 2}); // Rectangle with stride of 2 in both dimensions
auto tCol = t.Slice<1>({0, 4}, {matxEnd, matxDropDim}); // Create a 1D tensor with only column 5
auto tRow = t.Slice<1>({4, 0}, {matxDropDim, matxEnd}); // Create a 1D tensor with only row 5
Slice
returns a new view of the tensor using start, stop, and optional stride parameters. Since views are simply
light-weight views into memory, none of these variants modify the data; they return an object with new parameters describing
how the data is viewed. The resulting variables can be used exactly as the original view above:
auto cubeRows = tCube.Size(0); // 3
auto cubeCols = tCube.Size(1); // 3
auto colSize = tCol.Size(0); // 10 since the original tensor had 10 rows
auto rowSize = tRow.Size(0); // 20 since the original tensor had 20 columns
All view functions can be used on any type of existing view:
auto tCubeP = t.Slice({3, 5}, {6, 8}).Permute({1, 0});
The above code takes the same cube as before, but permutes the cube view by swapping the two dimensions.
Permuting a tensor is done using the Permute
member function of a view:
auto t = tensor_t<float, 2>({10, 20});
auto tp = t.Permute({1,0});
tp
is now a view into t
where the rows and columns are swapped (transpose). Permute
is not limited to matrices, though:
auto t4 = tensor_t<float, 4>({10, 20, 5, 2});
auto tp4 = t.Permute({1,3,2,0});
t4p
is now a permuted view of the original 4D tensor, but with the dimensions swapped as ordered in the initializer list.
Note that since no data is moved, permuting a tensor can be detrimental to performance, depending on the context. Permuting usually changes the strides of dimensions such that the memory access patterns are no longer optimal, and accessing the permuted view continuously can be very slow. If a permuted view will be accessed repeatedly, it's recommended to copy the permuted view into a new tensor so that the new layout is contiguous. Using the variables from above:
auto t4pc = tensor_t<float, 4>(tp4.Shape());
copy(t4pc, t4p);
t4pc
will now contain the permuted data, but in contiguous memory.
Ultimately memory is always laid out linearly regardless of how we choose to view it. We can take advantage of this property by allowing a reshaped view of an existing view. This is commonly done when we want to take a tensor of one rank and view the data as if it were a tensor of a different rank. The product of dimensions in one rank must equal the product of dimensions in the other rank. For example, to take a 1D tensor of size 16 and reshape into a 2D tensor of shape 4x4:
auto t1 = tensor_t<1>({16}); auto t2 = t1.View(tensorShape_t<2>({4,4});
t2
is now a view into the same memory as t1
, but viewed as a different rank. Any modifications to one tensor will be seen in the
other since no data was copied.
Sometimes it's useful to increase the rank of an existing view to match the dimensions of another tensor. For example, to add a vector onto all rows in a matrix, you can clone the tensor to a higher rank to match the other tensor:
auto t1 = tensor_t({16});
auto t2 = tensor_t({16, 16});
// ... Initialize tensors
auto t1c = t1.Clone<2>({16, matxKeepDim});
t1c
is now a new tensor view where each row is a replica of the tensor t1
. Again, this is just a view and no data was modified or
allocated, so modifying a row/column in either of these tensors will affect the other.
The keyword matxKeepDim
tells MatX which dimensions should be kept from the original tensor and where it should be in the new tensor.
In this example we used it in the columns place of the shape, but we also could have used {matxKeepDim, 16}
and we would have a 2D
view where all columns of t1c
matches t1
.
Note in some cases MatX's broadcasting feature can be used instead of Clone
. This allows an implicit expansion of ranks during an
element-wise operation. For example, adding a 4D tensor to a 1D tensor will work as long as the outer dimension of the 4D tensor matches
that of the 1D tensor. Broadcasting is covered in the documentation. Clone
is much more powerful since it gives more control over which
dimensions are cloned instead of assuming the outer dimensions.
While using tensor views with CUDA managed memory is very convenient, there are situations where managed memory is not ideal. Integrating MatX into an existing codebase, or wanting more control over the memory copies are both times when using standard CUDA memory allocations is a better option. All constructors in the tensor_t class also allow a manually-allocated pointer to be passed in. MatX will not attempt to allocate or free any memory when this constructor is used, and it is up to the caller to manage the memory lifecycle:
float *my_device_ptr; // Assume my_device_ptr is allocated somewhere
auto t2 = tensor_t<float, 2>(my_device_ptr, {20,100});
t2(1,1) = 5; // Error! Don't do this!
In the example above, t2
is a new view pointing to the existing device-allocated memory. Unlike with managed memory, operator()
cannot be used on t2
from the host side or the code may crash.
Tensors aren't much use by themselves if all we can do is view them in various ways. MatX provides two main ways to perform computations on tensor views: operator expressions and executors.
Operator expressions provide a way to use algebraic expressions using tensor views and operators to generate an element-wise GPU kernel at compile-time. For example:
tensor_t<float, 2> a({10, 20});
tensor_t<float, 2> b({10, 20});
tensor_t<float, 2> c({10, 20});
(c = a + b).run();
Ignoring that the data is unitialized, the first three lines simply create three 3D tensors with the same dimensions, while the last line runs an
operator for the equation c = a + b. In MatX terminology, an operator is a type that creates a CUDA kernel at compile-time to perform the
element-wise operation c = a + b. The = operator is used as a deferred assignment operator expressions to avoid ambiguity with the regular assignment
operator =
. The run
method takes an optional stream parameter, and executes the operation in the CUDA stream specified. Operators can use
expressions of any length, and normal precedence rules apply.
Tensor views can be mixed with scalars and operator functions:
auto op = (c = (a*a) + b / 2.0 + abs(a));
This expression squares each element in a
, divides each element in b
by 2, adds the result to a
, and finally adds the resulting
tensor to the absolute value of every element in a
. The result of the computation will be stored in the tensor view c
.
Again, the entire expression is generated at compile time and a kernel is stored in the variable op
, but the kernel is not launched on the device.
To launch the operator in a CUDA stream, we use the run
function:
op.run(stream);
run
can be thought of as a way to launch the operator/kernel into a CUDA stream, similar to the traditional triple angle bracket notation (<<<>>>).
In MatX terminology, this is called an executor since it causes work to be executed on the device. It's often not necessary to store the operator at
all if the execution is immediate, the two lines above can be combined:
(c = (a*a) + b / 2.0 + abs(a)).run(stream);
Sometimes the data we are using in an expression can be generated on-the-fly rather than coming from memory. Window functions, diagonal matrices, and the identity matrix are all examples of this. MatX provides "generators" that can be used inside of expressions to generate data:
(c = (a*a) + ones(a.Shape())).run(stream);
The example above uses the ones
generator to create a tensor with only the value 1
matching the shape of a (10x20x5). ones
simply returns the
value 1
any time an element of it is requested, and no data is ever loaded from memory.
Implicit in the run
call above is a CUDA executor type. As a beta feature, MatX also supports executing code on the host using a different executor.
To run the same code on the host, a SingleThreadHostExecutor
can be passed into run
:
(c = (a*a) + ones(a.Shape())).run(SingleThreadHostExecutor{});
Instead of a CUDA stream, we pass an executor to run
that instructs MatX to execute the code on the host instead of the device using a single CPU thread.
Unlike CUDA calls, host executors are synchronous, and the line above will block until finished executing.
MatX heavily relies on a deferred or lazy execution model where expressions are not executed at the time of assignment. This allows the library to
closely match the programming model of the GPU so that there are no surprises as to when code is executed. To facilitate the asynchronous model,
MatX overloads the assignment operator (=) to indicate a deferred execution. The deferred assignment can be executed using the run()
method on
the expression. A statement as simple as the following:
(A = B).run()
should be viewed as a deferred assignment of tensor B into tensor A (deep copy) that executes on the device when run()
happens. The result of the
lazy assignment expression can also be assigned into a temporary variable:
auto op = (A = B + C);
In the code above, the =
on the right side indicates lazy assignment, while the =
on the left side executes the copy constructor on the new
variable op
. The pattern above is expected to be infrequently used since expressions are typically executed on the same line as the definition,
but sometimes it's useful for debugging purposes to look at the type of the expression. More complex expressions follow the same rules:
IFELSE(A > 5, B = A, C = B).run()
Remember that since the assignment operator is deferred in both cases above, none of these assignments will happen until A > 5
is executed on the
device, at which point only one of these assignments will occur.
As mentioned above, it's common in high-level languages to initialize a tensor/array with a known set of values. For example, generating a range of linearly- spaced values, all ones, or a diagonal matrix. These are all operations that do not need to be generated and stored in memory before using since they are all generated from a formula. MatX calls these types of operators a generator, indicating that they generate data without storage.
Similar to high-level languages, generators can store their values in existing tensors like so:
tensor_t<float,1> t1{{100}};
(t1 = linspace_x(t1.Shape(), 1.0f, 100.0f)).run();
Similar to the set
calls above, instead of an algebraic equation we are storing the output of generator linspace_x
into the tensor t1
.
linspace_x
takes 3 parameters: the shape of the tensor (in this case we match t1), the start value, and the stop value. Since there are 100 elements
in our tensor, it will generate a sequence of 1.0, 2.0, 3.0, etc, and store it in t1
.
Why not just make a shorthand version of linspace_x
that stores directly in a tensor? The reason is that generators can be used as part of a larger
expression and are not limited to simply assigning to a tensor. Expanding on our last example:
tensor_t<float,1> t1{{100}};
(t1 = ones<float>(t1.Shape()) + linspace_x(t1.Shape(), 1.0f, 100.0f) * 5.0).run();
Instead of setting t1
to a range, we multiply the range by 5.0, and add that range to a vector of ones using the ones
generator. Without any
intermediate storage, we combined two generators, a multiply, and an add operator into a single kernel.
As mentioned above, the exec
function is an executor for launching operators onto the device. exec
is a special type of executor since it can take
either views or operators as inputs and transform them in an element-wise kernel. Often the type of operation we are trying to do cannot be expressed as
an MatX element-wise operator, so exec
cannot be used. Other types of executors exist for this purpose. These executors typically do more complex
transformations on the data compared to an element-wise kernel, and often use optimized libraries on the back-end to execute. Some examples are fft (Fast
Fourier Transform), matmul (Matrix Multiply), and sort.
MatX provides an easy-to-use API for executing complex functions, like those mentioned above. These executors currently cannot be part of an operator expression and must be executed as their own statement:
fft(B, A, stream);
The fft
executor above performs a 1D FFT on the tensor A
, and stores it in B
. All executors use the same calling convention where the outputs
are listed first, followed by inputs, and finally an optional stream. Except for exec
, executors can only operate on tensor views, and not
on generators or operators. For instance, you cannot take an fft of ones()
.
Unless documented otherwise, executors work on tensors of a specific size. Matrix multiplies require a 2D tensor (matrix), 1D FFTs require
a 1D tensor (vector), etc. If the dimension of the tensor is higher than the expected dimension, all higher dimensions will be batched. In the FFT
call above, if A
and B
are 4D tensors, the inner 3 dimensions will launch a batched 1D FFT with no change in syntax.
As mentioned above, the same tensor views can be used in operator expressions before or after executors:
(a = b + 2).run(stream);
matmul(c, a, b, stream);
The code above executes a kernel to store the result of b + 2
into a
, then subsequently performs the matrix multiply C = A * B
. Since
the operator and matrix multiply are launched in the same CUDA stream, they will be executed serially.
Common reduction executors are also available, such as sum()
, mean()
, max()
, etc:
tensor_t<float,4> t4{{100, 100, 100, 100}};
tensor_t<float,0> t0{};
sum(t0, t4);
The above code performs an optimized sum reduction of t4
into t0
. Currently reduction type exectors can take operators as an input. Please
see the documentation for a list of which ones are compatible.
MatX can generate random numbers using the cuRAND library as the backend. Random number generation consumes memory on the device, so the construction is slightly different than other types above:
tensor_t<float, 2> t({100, 50});
randomGenerator_t<float> randData(t.TotalSize(), 0);
auto randTensor = randData.GetTensorView<2>({100,50}, NORMAL);
The code above constructs a random tensor view inside of randTensor
that can be used in expressions as a random-valued tensor. The first line where
the randomGenerator_t
type is constructed allocates enough memory on the device to provide random numbers for a 100x50 tensor. The second line
gets a view from the generator. These two steps are important because you typically want to limit how many generators you create due to their memory
consumption, and instead create views from a small set of generators.
Using the random tensor view above in an expression is the same as any other view:
tensor_t<float, 2> t2({100, 50});
(t2 = randTensor*5 + randTensor).run(stream);
Unlike normal views, randTensor
will give a new random value every time it is accessed. Not only will every element in the first multiply get A
different random number, but when it's access again to add to the previous value, a new random number is generated for every element.
This quick start guide was intended to give a very brief introduction to the concepts behind MatX, and how these concepts apply to the code. There's a lot more to explore in MatX and far more functions than could be listed here. For more examples we recommend browsing through the examples to see how to perform real tasks using MatX, and the API guide to see an exhaustive list of functions and operators.