This file describes the API of the CLTune auto-tuner for OpenCL and CUDA kernels. Everything is in the cltune
namespace.
-
Tuner()
: Initializes a new tuner on platform 0 and device 0. -
Tuner(size_t platform_id, size_t device_id)
: Initializes a new tuner on platformplatform_id
and devicedevice_id
. For CUDAplatform_id
should be set to 0.
-
size_t AddKernel(const std::vector<std::string> &filenames, const std::string &kernel_name, const IntRange &global, const IntRange &local)
: Adds a new kernel to the list of tuning-kernels and returns a unique ID (to be used when adding tuning parameters). This loads one or more kernel files given by a vector offilenames
. The stringkernel_name
gives the entry point of the kernel. The parametersglobal
andlocal
represent the base 1D, 2D, or 3D thread configuration withlocal
being the size of a work-group/thread-block, andglobal
being the total amount of threads in each dimension. -
size_t AddKernelFromString(const std::string &source, const std::string &kernel_name, const IntRange &global, const IntRange &local)
: As above, but now the kernel is loaded from a string instead of from a file. -
void AddParameter(const size_t id, const std::string ¶meter_name, const std::vector<size_t> &values)
: Adds a new tuning parameter for the kernel with the givenid
. The parameter has as a nameparameter_name
, and a list of tuneable integer values. -
void MulGlobalSize(const size_t id, const StringRange range)
: Multiplies the global thread configuration for kernelid
by one of the specified tuning parameters given as a 1D, 2D, or 3Drange
. -
void DivGlobalSize(const size_t id, const StringRange range)
: As above, but global thread division instead. -
void MulLocalSize(const size_t id, const StringRange range)
: As above, but local thread multiplication instead. -
void DivLocalSize(const size_t id, const StringRange range)
: As above, but local thread division instead. -
template <typename T> void AddArgumentInput(const std::vector<T> &source)
andtemplate <typename T> void AddArgumentOutput(const std::vector<T> &source)
andtemplate <typename T> void AddArgumentScalar(const T argument)
: Functions to add kernel-arguments for input or output buffers (given asstd::vector
CPU arrays) and scalars. These should be called in the order in which the arguments appear in the kernel. -
void Tune()
: Starts the tuning process after everything is set-up. This compiles all kernels and runs them for each permutation of the tuning-parameters.
-
void AddConstraint(const size_t id, ConstraintFunction valid_if, const std::vector<std::string> ¶meters)
: Adds a new constraint (e.g. must be equal or larger than) to the set of parameters of kernelid
. The constraintvalid_if
comes in the form of a function object which takes a number of tuning parameters, given as a vector of tuning-parameters (parameters
). Their names are later substituted by actual values. -
void SetLocalMemoryUsage(const size_t id, LocalMemoryFunction amount, const std::vector<std::string> ¶meters)
: As above, but for local memory usage. If this method is not called, it is assumed that the local memory usage is zero: no configurations will be excluded because of too much local memory.
-
void SetReference(const std::vector<std::string> &filenames, const std::string &kernel_name, const IntRange &global, const IntRange &local)
: Sets the reference kernel for automatic verification purposes. Same arguments as theAddKernel()
method, but in this case there can be only one reference kernel so no ID is returned. Calling this method again will overwrite the previous reference kernel. -
void SetReferenceFromString(const std::string &source, const std::string &kernel_name, const IntRange &global, const IntRange &local)
: As above, but now the reference kernel is loaded from a string instead of from a file. -
void AddParameterReference(const std::string ¶meter_name, const size_t value)
: For convenience, a tuning 'parameter'parameter_name
with a single valuevalue
can be added to the reference kernel as well. This can be useful in case the same kernel is used for tuning and as reference and certain values are not defined. It is not necessary to call this function in case a separate fully functional OpenCL or CUDA kernel is supplied.
-
void UseFullSearch()
: Call this method before calling theTune()
method. This will use full-search, i.e. all configurations will be tested on the device and the best-result will be found by the tuner. This is the default behaviour: it is not necessary to call this method except to override a previously set search method. -
void UseRandomSearch(const double fraction)
: Call this method before calling theTune()
method. This will make the tuner explore only a random subset of all configurations. The size of the subset is given as the fractionfraction
. For example, passing0.01
will explore 1% of the search-space. -
void UseAnnealing(const double fraction, const double max_temperature)
: Call this method before calling theTune()
method. This will make the tuner explore only a subset (size determined byfraction
) of all configurations according to the simulated annealing algorithm with a maximum 'temperature' ofmax_temperature
. Annealing uses randomly generated numbers, so behaviour will change from run to run. -
void UsePSO(const double fraction, const size_t swarm_size, const double influence_global, const double influence_local, const double influence_random)
: Call this method before calling theTune()
method. This will make the tuner explore only a subset (size determined byfraction
) of all configurations according to the particle swarm optimisation (PSO) algorithm with a swarm size ofswarm_size
and fractional influence values for the global, local, and random search directions. PSO uses randomly generated numbers, so behaviour will change from run to run. -
void ModelPrediction(const Model model_type, const float validation_fraction, const size_t test_top_x_configurations)
: Call this method after calling theTune()
method. Trains a machine learning model of typemodel_type
(kLinearRegression
orkNeuralNetwork
) based on the search space explored so far. Then, all the missing data-points are estimated based on this model. Following, the toptest_top_x_configurations
configurations are tested on the actual device. Training a model is only useful if a fraction of the search space is explored, as is the case when doing for example random-search.
-
std::unordered_map<std::string, size_t> GetBestResult()
: Retrieves the parameters of the best tuning result and returns them to the caller as a map of strings (parameter names) to integers (parameter values). -
void OutputSearchLog(const std::string &filename)
: Outputs the search process to the filefilename
. -
double PrintToScreen() const
: Prints the results of the tuning to screen (stdout). Returns the best-case execution time in milliseconds. -
void PrintFormatted() const
: Prints the results of the tuning to screen as a formatted table (stdout). -
void PrintJSON(const std::string &filename, const std::vector<std::pair<std::string,std::string>> &descriptions) const
: Prints the results of the tuning to the filefilename
in JSON format. Additional key-value input can be given as a vector of pairs through thedescriptions
argument. -
void PrintToFile(const std::string &filename) const
: Prints the results of the tuning to the filefilename
in plain text format. -
void SuppressOutput()
: Disables all further printing to screen (stdout).