machine learning pipeline #41

UniqueFool · 2016-06-09T14:31:36Z

This is related to another discussion currently taking place here: jrprice/Oclgrind#109 (comment)

The idea is to emulate an OpenCL kernel using oclgrind and use this to gather kernel-specific runtime information (think dataflow, variable lifetime) and use this information in the ML pipeline to do more sophisticated transformations based on much more comprehensive. and better, information of the kernel's runtime behavior.

To pull this off, some kind of interface would need to be established between the kernel virtualization and the tuner components, even if that just means serializing kernel-specific data to a file on disk and use that for the ML pipeline.

CNugteren · 2016-06-09T19:01:37Z

Interesting. I'll take a look at oclgrind to get a better understand of what you want. I'll come back to you soon. Perhaps also the Collective Knowledge framework might be of some help: https://github.com/ctuning/ck

UniqueFool · 2016-06-09T19:34:09Z

wow, I wasn't even aware that something like ck existed ... gotta have to do some reading now.

CNugteren · 2016-06-13T18:30:40Z

I found some time and I think I understand your idea. By the way, in your first post you meant "emulate an OpenCL device", not "emulate an OpenCL kernel", right?

I am not sure if CLTune is what you are looking for though. What is your use-case exactly? I can interpret your goal in two ways:

You are trying to machine-learn a kernel optimiser/tuner based on previous kernels it has seen. So you'll need a lot of kernels and some static and run-time information (that's where oclgrind comes into play). Then you can learn what optimisations are a good choice given static and run-time information of a previously unseen kernel.
You are trying to optimise a single kernel but the optimisation-space is too vast. In that case you'll hope that some static and run-time information (oclgrind again here) can help you guide a machine-learned model faster towards a good (or the best) solution.

In the first case CLTune is really not your choice: it can only perform 'optimisations' that are pre-programmed using pre-processor variables into a kernel. CLTune is a tool to help you explore those options, optionally using machine learning to guide you faster towards a decision space. Better to hook this up in the compiler itself I would say.

For the second case it might be a better fit, but I am not so sure if this extra information will be helpful to train a model. With the extra data we might also need to look at larger models that can capture this new information. Keep in mind that I am currently not even using the static data that is readily available (number of instructions of some sort, number of branches, vector width, architecture details), I am only using the current user-defined 'configuration'. So perhaps it is better to start there, instead of using run-time information from device emulation?

UniqueFool · 2016-06-13T21:07:17Z

yes, I meant "device" like you said - your 2) describes the idea pretty well, i.e. it has more to do with kernel-specific runtime information and using that come up with/guide different transformations

I will have to do some reading to see if this is really feasible, for all the reasons you mentioned - however, I did reference a few papers that basically describe doing this sort of thing.

So it really is more about narrowing-down and guiding the search space based on kernel-specific information that can be gathered via emulated execution.

CNugteren · 2016-06-14T08:02:32Z

OK! Which papers are those? I'm interested as well to see what's possible.

UniqueFool · 2016-06-14T19:37:36Z

I basically worked through the referenced paper and its references section: http://arxiv.org/pdf/1506.00842v1.pdf

We have developed and validated a machine learning
based auto-tuning framework for OpenCL. The frame-
work measures the performance of several candidate im-
plementations from a parameter configuration space and
uses this result to build a artificial neural network, which
works as a performance model. This model is then used
to find interesting parts of the configuration space, which are explored exhaustively to find good candidate imple-
mentations. Our neural network model achieves a mean
relative error as low as 6.1% for three different bench-
marks executed on three different devices, a Intel i7 3770
CPU, an Nvidia K40 GPU and a AMD Radeon HD 7970.
The autotuner is able to find good configurations, at best
only 1.3% slower than the best configuration.
Future work includes enhancing the performance of the
model, in particular with regard to invalid configurations,
evaluating the model on novel hardware architectures, be-
yond just CPUs and GPUs, and integrating problem pa-
rameters into the performance model. Incorporating ad-
vanced new features specific to a given architecture[39]
will remain challenging. However, studying multi-GPU
systems[40] and looking into multi-variate analysis[41]
may also be interesting avenues of inquiry.

CNugteren · 2016-06-15T08:34:56Z

This is exactly what CLTune is also doing. I mainly wrote CLTune because the other paper's authors did not made any tool available. However, I did not evaluate the machine learning part too much, the paper actually doesn't include it at all (http://www.cedricnugteren.nl/downloads/Nugteren2015a.pdf). Perhaps someone should do some more experiments using CLTune an a small neural network?

Or are you perhaps referring to the future work part of the paper:

and integrating problem parameters into the performance model

I am not sure exactly what the authors mean with this, but it could be that this is what you are referring to? In that case I would also contact the authors and see if they haven't already done this?

bhack · 2016-07-19T13:06:41Z

A really interesting thread. /cc @hughperkins

bhack · 2016-07-19T13:13:59Z

See also http://chriscummins.cc/pub/2016-adapt.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machine learning pipeline #41

machine learning pipeline #41

UniqueFool commented Jun 9, 2016

CNugteren commented Jun 9, 2016

UniqueFool commented Jun 9, 2016

CNugteren commented Jun 13, 2016 •

edited

Loading

UniqueFool commented Jun 13, 2016

CNugteren commented Jun 14, 2016

UniqueFool commented Jun 14, 2016

CNugteren commented Jun 15, 2016 •

edited

Loading

bhack commented Jul 19, 2016

bhack commented Jul 19, 2016

machine learning pipeline #41

machine learning pipeline #41

Comments

UniqueFool commented Jun 9, 2016

CNugteren commented Jun 9, 2016

UniqueFool commented Jun 9, 2016

CNugteren commented Jun 13, 2016 • edited Loading

UniqueFool commented Jun 13, 2016

CNugteren commented Jun 14, 2016

UniqueFool commented Jun 14, 2016

CNugteren commented Jun 15, 2016 • edited Loading

bhack commented Jul 19, 2016

bhack commented Jul 19, 2016

CNugteren commented Jun 13, 2016 •

edited

Loading

CNugteren commented Jun 15, 2016 •

edited

Loading