-
Notifications
You must be signed in to change notification settings - Fork 0
Autotuning example slambench
[ Home ]
Please, do not forget to check Getting Started Guides to understand CK concepts!
We expect that you have already followed the previous sections of the Getting Started Guide, so you are familiar with the main CK concepts and have installed CK on your machine.
In this part, we will demonstrate how CK is used to reproduce performance/energy exploration of various algorithmic and OpenCL parameters of KFusion, a computer vision OpenCL-based application from the SLAMBench suite developed in the UK PAMELA project.
Obtain shared components from GitHub (source code, datasets, packages):
$ ck pull repo:reproduce-pamela-project
Check this repository is correctly installed by finding SLAMBench:
$ ck list program:slambench*or
$ ck search program --tags=slambench
You should see several entries with various implementations of SLAMBench (CPU, CPU with OpenMP, CUDA, OpenCL, OpenCL with fixes for a DragonBoard).
You may also check out our public notes about compilation and execution of these programs via
$ ck wiki program:slambench-1.1-cpu $ ck wiki program:slambench-1.1-opencl
Note that the above repository contains a few relatively small data sets (~40MB), i.e. 1-4 frames in PNG with 3D depth file or in raw 3D video format:
$ ck list dataset:slambench*
Much larger datasets in the CK format are available from our Google Drive.
- Medium dataset (~55MB)
- Large dataset (~550MB)
- Large dataset in PNG (~2GB)
If you download a larger dataset as a ZIP file, you need to register it in CK as follows:
$ ck add repo --zip=<full path to ZIP>
OpenCL version of SLAMBench has the following dependencies:
- Compiler - any standard compiler (we already have pre-set environment for generic GCC in CK)
- OpenCL - vendor OpenCL library
- TooN - vector library
- xOpenME - our run time library to expose various parameters to outside world via JSON files (or to tune internal application parameters) - it's a stripped-down and simplified version of our OpenME plugin framework
- hardware and OS dependent scripts to change CPU/GPU frequency and monitor energy
We provide CK packages for all above dependencies except OpenCL. We expect that OpenCL is already natively installed for your target machine and you only have to register its path in CK.
We remind that in CK rather than hardwiring paths to binaries and libraries, we use env entries that contain env.sh or env.bat with pre-set OS environment for a given tool and its version.
Hence, inside CK, before calling a tool (say GCC), we find all related env entries by tags (say compiler,gcc), give user a choice which compiler version to use (in case of multiple environments), then set this environment and only then call GCC.
However, rather than preparing such env entries manually, we created a module soft which automates this process for many existing and standard tools and libraries.
Therefore, to register your OpenCL library in CK, find the most similar CK software description via:
$ ck list soft:lib.opencl.*
You can register a generic Linux OpenCL library (libOpenCL.so) in CK via:
$ ck setup soft:lib.opencl.linux
If you use MALI GPUs, you can install a related OpenCL version for Linux (for example for Chromebooks) via
$ ck setup soft:lib.opencl.malior for Android via
$ ck setup soft:lib.opencl.mali --target_os=android19-arm
You will be asked a few questions about the installation path for this library excluding /lib, i.e. /usr if it is installed in /usr/lib!
If OpenCL library is installed correctly, it is possible to compile and run a simple OpenCL program which prints number of available devices:
$ ck compile program:tool-print-opencl-devices $ ck run program:tool-print-opencl-devicesSometimes, you may need to run OpenCL program as root. In such case, you can run above program as root via
$ ck run program:tool-print-opencl-devices --sudo
As for toon, we prepared and shared a CK package including archive with version 2.2. You can install it on your machine via:
$ ck install package:lib-toon-2.2
As for our OpenME RTL to expose various internal application parameters to outside world, it is not strictly necessary to install it here since it will be installed during first compilation of a program using it. However, just in case, here is the explicit installation of this package:
$ ck install package:lib-rtl-xopenme
Note, that we plan to improve and automate this process further! Please, keep in touch about the progress!
Some of the low-level hardware functionality such as setting CPU and GPU frequency, obtaining hardware counters or monitoring energy consumed by various parts of hardware may require OS tools and scripts.
We started collecting and unifying such scripts using container platform.init in ck-autotuning repository. You can check available sub-directories in this entry found via
$ ck list platform.init
Currently we support the following platforms:
- generic-android
- generic-linux
- generic-odroid ([Hardkernel](http://www.hardkernel.com/main/products/prdt_info.php?g_code=G140448267127))
- chromebook-ubuntu ([Samsung](community.arm.com/groups/arm-mali-graphics/blog/2014/12/18/installing-opencl-on-chromebook-2-in-30-minutes))
Before starting experiments, you need to copy (and possibly customize) scripts from the directory with the most close platform to any directory in your PATH environment. Alternatively, you can create a new directory $HOME/bin, copy there all scripts and add it to your PATH directory:
$ export PATH=$HOME/bin:$PATH
Now, you can try several scripts (may require SUDO password):
$ ck-print-cpu-freq $ ck-print-gpu-freq
Then, you can try to set you system to max frequency:
$ ck-set-performance $ ck-print-cpu-freq $ ck-print-gpu-freq
Next, you can try to set you system to powersaving mode:
$ ck-set-powersave $ ck-print-cpu-freq $ ck-print-gpu-freq
If something is not working, you may customize your local scripts or even share them for your platform.
Note, that if you use Android-based target, you should copy those scripts to /data/local/tmp .
Also note, that the Odroid-XU3 platform has scripts that can measure voltage, current, power and energy for GPU, memory system, A7 and A15 cores as well as processor temperature and fan speed.
Now you should be able to compile and run slambench via preset CK program pipeline.
First, let's try to run CPU slambench while setting CPU and GPU frequency to max:
$ ck run pipeline:program program_uoa=slambench-1.1-cpu --speed --cpu_freq=max --gpu_freq=max
It should set CPU/GPU freq to max, ask you to select command line (various algorithmic parameters), ask you to select one data set, run program 3 times, perform some basic statistical analysis of variation, and output some run-time characteristics.
Note that you should not select cmd_dse as a command line
since it has special CK characters
The typical std output of the execution will be as following:
{ "execution_time": 0.19, "execution_time_kernel_0": 0.19, "execution_time_kernel_X": "@@0.000000,0.000000,0.000000,0.000000", "execution_time_kernel_Y": "@@0.000000,0.000000,0.000000,0.000000", "execution_time_kernel_Z": "@@0.000000,0.000000,0.000000,0.000000", "execution_time_kernel_acquisition": "@@0.009493,0.007024,0.010243,0.000659", "execution_time_kernel_computation": "@@0.019478,0.016749,0.017970,0.032150", "execution_time_kernel_frame": "@@0,1,2,3", "execution_time_kernel_integrated": "@@1,1,1,1", "execution_time_kernel_integration": "@@0.008990,0.007302,0.007298,0.007190", "execution_time_kernel_preprocessing": "@@0.003609,0.003485,0.003482,0.003184", "execution_time_kernel_raycasting": "@@0.000009,0.000003,0.000004,0.017025", "execution_time_kernel_rendering": "@@0.020048,0.016081,0.022103,0.016127", "execution_time_kernel_total": "@@0.049019,0.039854,0.050316,0.048936", "execution_time_kernel_tracked": "@@0,0,0,0", "execution_time_kernel_tracking": "@@0.006869,0.005958,0.007187,0.004751", "fps": 21.262458471760795, "frames": 4, "run_time_state": { "frames": 4, "input_size_x": 640, "input_size_y": 480, "run_time_fps": 35.16001, "run_time_total": 0.113766 }, "total_execution_time_from_kernels": 0.18812500000000001 }
Briefly, these parameters including run_time_state are exposed via our xOpenME light-weight instrumentation library which described in more detail in another autotuning example.
It allows to expose video width and height, accelerator name, number of compute units, number of OpenMP threads, number of executed frames, execution time and FPS (frames per second).
We also expose information about execution time of all OpenCL kernels (when more than one frame is executed, we save such results in one string starting with @@ and separating time per frame with comma).
Now, let's run OpenCL version:
$ ck run pipeline:program program_uoa=slambench-1.1-opencl --speed --cpu_freq=max --gpu_freq=max
It is also possible to run the same experiment but setting CPU and GPU frequency to minimum, i.e.:
$ ck run pipeline:program program_uoa=slambench-1.1-opencl --speed --cpu_freq=min --gpu_freq=min
Note that on Odroid board it is possible to add energy to characteristics simply via extra flag --energy as one of sub cases of CK-powered universal and multi-objective autotuning such as balancing performance, energy, speed, code size.
As usual, we collect various info about successful or problematic compilation and execution from the community at the CK wiki via:
$ ck wiki program:slambench-1.1-cpu
Application developers are typically interested to tune their applications for a given device under various constraints e.g. to maximize FPS and minimize energy without loosing accuracy.
Therefore, we prepared and shared a scenario script:aggregate-executions where users can compile and run slambench, and share information about hardware as well as FPS, execution time, energy (if supported) in our public cKnowledge.org repository.
As in previous experiments, you need first to prepare pipeline:
$ ck find script:_clean_program_pipeline.bat $ Go to above directory $ ./_clean_program_pipeline.bat $ ./_setup_program_pipeline.bat
Now, you can run pipeline and record experiments in remote repository via
$ ./aggregate_executions_in_remote_freq_max.bat
or if you want to select OpenCL device 1, use the following scripts:
- Windows: ./aggregate_executions_in_remote_freq_max_device1.bat
- Linux: ./aggregate_executions_in_remote_freq_max_device1.
You can check and customize associated .json for your own needs.
Results of experiments are aggregated in remote-ck repository which points to our live demo CK repository cknowledge.org/repo.
You can check out current results as an interactive table here (where you can click on headers to sort, etc). Note that we plan to considerably improve this functionality using standard web packages possibly with Django and Pandas - help is appreciated!
You can also see interactive FPS histogram across all devices here.
Finally, you can locally plot above graph while easily obtaining data from our remote repository:
$ ck plot graph: @aggregate_executions_in_remote_plot2.json
where aggregate_executions_in_remote_plot2.json has the following format:
{ "remote_repo_uoa":"remote-ck", "experiment_repo_uoa":"upload", "experiment_module_uoa":"experiment", "data_uoa_list":["pamela-slambench-best-aggregated"], "ignore_graph_separation":"yes", "flat_keys_list":[ "##characteristics#run#fps#all" ], "ignore_point_if_none":"yes", "expand_list":"yes", "plot_type":"mpl_1d_histogram", "bins":100, "title":"Powered by Collective Knowledge", "axis_x_desc": "Slambench FPS (sec.)", "axis_y_desc": "FPS Histogram", "plot_grid":"yes", "mpl_image_size_x":"12", "mpl_image_size_y":"6", "mpl_image_dpi":"100" }
We also use this approach to unify and crowdsource benchmarking and tuning across all available platforms and software: P1, P2, P3.
Besides recording various run-time parameters, we also use OpenME interface in Slambench to dump dynamic arrays and monitor/visualize correct execution of a program. For example, we dump original video frame as well as processed ones after executing kernels depthrender, trackrender and volumerender:
#ifdef XOPENME if ((tock()-delay)>XOPENME_DUMPING_TIME) { if (getenv("XOPENME_DUMP_MEMORY_INPUT_RGB") && (atoi(getenv("XOPENME_DUMP_MEMORY_INPUT_RGB"))==1)) xopenme_dump_memory("tmp-raw-input-rgb.rgb", inputRGB, sInputRGB); if (getenv("XOPENME_DUMP_MEMORY_DEPTHRENDER") && (atoi(getenv("XOPENME_DUMP_MEMORY_DEPTHRENDER"))==1)) xopenme_dump_memory("tmp-raw-depthrender.rgb", depthRender, sDepthRender); if (getenv("XOPENME_DUMP_MEMORY_TRACKRENDER") && (atoi(getenv("XOPENME_DUMP_MEMORY_TRACKRENDER"))==1)) xopenme_dump_memory("tmp-raw-trackrender.rgb", trackRender, sTrackRender); if (getenv("XOPENME_DUMP_MEMORY_VOLUMERENDER") && (atoi(getenv("XOPENME_DUMP_MEMORY_VOLUMERENDER"))==1)) xopenme_dump_memory("tmp-raw-volumerender.rgb", volumeRender, sVolumeRender); xopenme_dump_state(); } #endif
Original slambench application requires heavy Qt and various dependencies to visualize run-time state via GUI which works fast, but is not always easy to port and use (e.g. on monitorless platforms).
Using OpenME and dumping state as raw files allows us to keep slambench very compact, portable and simple while reusing standard web browser to visualize its execution.
You can test this functionality via several scripts prepared in program:slambench-1.1-opencl:
$ ck find program:slambench-1.1-openclGo to the above directory and then run _compile_speed.bat (Windows) or _compile_run.sh (Linux).
Run the following script in the background (preferably in another CMD interpreter):
$ python continuous_convert_of_images_local_tmp.pyAbove script will be detecting new raw images (arrays) and continuously convert them in PNG or JPEG.
Start web browser:
$ firefox _visualize_via_browser.html
This page will be refreshed each second to read images and JSON run-time info from OpenME in the browser.
Now you can start slambench in an infinite loop via _run_device0_loop.bat (Windows) or _run_device0_loop.sh (Linux) if you would like to use OpenCL device 0. If you would like to use OpenCL device 1, run slambench via _run_device1_loop.sh (Linux) or _run_device1_loop.bat (Windows).
Normally, you should be able to observe the changing run-time state of the slambench.
Furthermore, you may use similar approach to embedded such demos into your interactive reports and graphs. To demonstrate this mode, start CK web service via ck start web, then run python continuous_convert_of_images.py instead of continuous_convert_of_images_local_tmp.py in the separate CMD interpreter, start slambench with infinite loop and open the following page:
$ firefox http://localhost:3344/web.php?wcid=1e348bd6ab43ce8a:3d487daf942ff64b
Don't forget to click on auto-refresh in the bottom of the above page and select refresh rate!
If you would like to create similar live report, you can check out the following report entry and live.html:
$ ck find report:slambench-visualization
Note, that running CK and web browser on the same machine can influence slambench performance. Hence, we mainly use it for debugging and demos.
We prepared several scripts to demonstrate slambench autotuning via CK:
$ ck list reproduce-pamela-project:script:
- autotuning-compiler-flags-linux - traditional compiler flag autotuning on Linux.
- autotuning-compiler-flags-windows - traditional compiler flag autotuning on Windows.
- algorithm-exploration - exploring algorithm parameters to balance accuracy versus execution time and energy.
Compiler autotuning is performed similarly to the first example in the Getting Started Guide.
Basically, you need to clean and prepare program pipeline (to set default parameters and resolve dependencies) via
$ ./_clean_program_pipeline.sh $ ./_setup_program_pipeline.sh
(On the Odroid-XU3 platform, you additionally monitor energy consumption, by adding '"energy":"yes"' to _setup_program_pipeline.json when setting up the pipeline.)
To perform compiler flag autotuning, run:
$ ./autotune_program_pipeline_base_best.sh $ ./autotune_program_pipeline_i100.sh $ ./autotune_program_pipeline_i100_apply_pareto.sh
The last script will apply Pareto filter to obtained results after autotuning is finished. Alternatively, it is possible to run autotuning with Pareto filter and leave only points on the frontier for execution time (or energy, though currently relation between execution time and energy is linear so they are equivalent - it may change in the future when hardware reconfiguration or memory hierarchy optimizations are used) and code size:
$ ./autotune_program_pipeline_i100_run_as_pareto.sh
This script substitutes autotune_program_pipeline_i100.sh and autotune_program_pipeline_i100_apply_pareto.sh.
We also prepared an example of a live visualization of the results and interactive graphs for this autotuning in report:slambench-crowdtuning-via-ck. Just run ck start web and you can check it via:
$ firefox http://localhost:3344/?wcid=1e348bd6ab43ce8a:2b4a2375b24fc461
Note, that even on relatively recent GCC 4.9.x, we can get ~30% improvement in execution time and at the same time ~15% improvement in code size on both embedded and server platforms (ARM, Intel).
We also provide an even more exciting example of exploring the algorithm design space of KFusion. We cordially thank Bruno Bodin for providing us with algorithmic parameters that can be tuned.
Autotuning the algorithmic parameters changes the algorithm accuracy but can dramatically improve FPS and energy usage. Surprisingly, the design space exploration can find parameters that make the computation both much faster (up to an order of magnitude!) and more accurate!
The script to do algorithmic exploration can be found in
$ ck find script:algorithm-exploration
As usual, you can prepare a pipeline, however do not forget to choose cmd_dse as command line key since algorithmic parameters for slambench are currently specified via CMD.
You can customize exploration via algorithm_exploration.json:
{ "choices_order":[ ["##choices#run_cmd_key_c"], ["##choices#run_cmd_key_r"], ["##choices#run_cmd_key_l"], ["##choices#run_cmd_key_m"], ["##choices#run_cmd_key_y1"], ["##choices#run_cmd_key_y2"], ["##choices#run_cmd_key_y3"], ["##choices#run_cmd_key_v"] ], "choices_selection": [ {"type":"random-with-next", "choice":["1","2","4"], "default":"1"}, {"type":"random-with-next", "start":1, "stop":32, "step":1, "default":1}, {"type":"random-with-next", "start":1, "stop":64, "step":0.001, "default":1, "subtype":"float"}, {"type":"random-with-next", "start":0.1, "stop":0.5, "step":0.01, "default":1, "subtype":"float"}, {"type":"random-with-next", "start":1, "stop":16, "step":1, "default":10}, {"type":"random-with-next", "start":1, "stop":8, "step":1, "default":5}, {"type":"random-with-next", "start":1, "stop":8, "step":1, "default":4}, {"type":"random-with-next", "choice":[64,128,256,512], "default":128} ], "seed":12345, "iterations":1000, "repetitions":3, "record":"yes", "record_uoa":"algorithm-exploration-slambench", "features_keys_to_process":["##choices#*"], "record_params": { "search_point_by_features":"yes" } }
Note, that above choices run_cmd_key_... will substitute respective variables in the cmd_dse command line taken from SLAMBench CK meta-description:
"run_cmd_main": "$#BIN_FILE#$ -s 5.0 -p 0.34,0.5,0.24 -z 1 -c $#run_cmd_key_c#$ -r $#run_cmd_key_r#$ -l $#run_cmd_key_l#$ -m $#run_cmd_key_m#$ -k 481.2,480,320,240 -y $#run_cmd_key_y1#$,$#run_cmd_key_y2#$,$#run_cmd_key_y3#$ -v $#run_cmd_key_v#$ -i $#dataset_path#$$#dataset_filename#$ -o tmp-output.tmp"
Experimental results during exploration will be recorded in experiment:algorithm-exploration-slambench. You can visualize them as table via CK web front-end (under Experiment menu) or you can use plot_with_variation.bat script to plot all sorted FPS with variation in a python's MatPlotLib GUI.
This scripts can be customized via plot_with_variation.json:
{ "experiment_module_uoa":"experiment", "data_uoa_list":["algorithm-exploration-slambench", "algorithm-exploration-slambench-default"], "flat_keys_list":[ "##characteristics#run#fps#center", "##characteristics#run#fps#halfrange" ], "add_x_loop":"yes", "sort_index":0, "ignore_point_if_empty_string":"yes", "plot_type":"mpl_2d_scatter", "display_x_error_bar":"no", "display_y_error_bar":"yes", "title":"Powered by Collective Knowledge", "axis_x_desc":"Experiment No", "axis_y_desc":"FPS (frames per second)", "plot_grid":"yes", "mpl_image_size_x":"12", "mpl_image_size_y":"6", "mpl_image_dpi":"100", "point_style":{"1":{"elinewidth":"5", "color":"#dc3912"}, "0":{"color":"#3366cc"}} }
Unlike compiler flag tuning, such exploration can easily improve FPS by nearly an order of magnitude (on ARM and Intel GPUs, we can get ~8x speedup) but at the cost of very low accuracy or even failing algorithm. However, this exploration demonstrates well our CK top-down analysis, tuning and co-design methodology for software and hardware similar to physics (see our long-term vision paper), where we start from tuning coarse-grain behavior (algorithm level: fast and high ROI), then move to middle-grain behavior (compiler passes - slow and medium ROI), and only at the end tune fine-grain behavior (instruction level such as superoptimization as "the last mile"- very slow and possibly with no improvements in real application, if, for example, out-of-order execution processor mitigates such reordering).
We usually use predictive analytics to focus search on most profitable areas (better execution time, power consumption, code size, reliability). The unified representation of experiments in CK, however, allows us to easily re-target analytics to other cases.
While performing above the exploration on a new platform, we observed 22 failures out of 100 experiments. Normally, engineers would have to spend considerable amount of time understanding those failures. Instead, we decided to see if the standard decision trees can help engineers triage these failures.
We prepared a few sample scripts (CID=script:algorithm-exploration-analyze-failures) and shared experiments (CID=experiment:algorithm-exploration-slambench-failures-demo) to demonstrate this approach. You can find scripts via
$ ck find script:algorithm-exploration-analyze-failures
In this directory, you can find two scripts:
- model-sklearn-dtc-build.bat - build a model
- model-sklearn-dtc-validate.bat - validate the model
and two JSON input files:
- model-sklearn-dtc.json - describes the model used (decision tree with depth 2 from sklearn-kit)
- model-input.json - describes experiment entries to process, features and objectives
The model-input.json file is as follows:
{ "data_uoa":"algorithm-exploration-slambench-failures-demo", "features_flat_keys_ext":"#min", "features_flat_keys_list":[ "##choices#run_cmd_key_c", "##choices#run_cmd_key_r", "##choices#run_cmd_key_l", "##choices#run_cmd_key_m", "##choices#run_cmd_key_y1", "##choices#run_cmd_key_y2", "##choices#run_cmd_key_y3", "##choices#run_cmd_key_v" ], "features_flat_keys_desc": { "##choices#run_cmd_key_c":{"name":"c"}, "##choices#run_cmd_key_r":{"name":"r"}, "##choices#run_cmd_key_l":{"name":"l"}, "##choices#run_cmd_key_m":{"name":"m"}, "##choices#run_cmd_key_y1":{"name":"y1"}, "##choices#run_cmd_key_y2":{"name":"y2"}, "##choices#run_cmd_key_y3":{"name":"y3"}, "##choices#run_cmd_key_v":{"name":"v"} }, "characteristics_flat_keys_list":[ "##characteristics#run#return_code#min" ], "remove_points_with_none":"yes", "keep_temp_files":"yes" }
Basically, the features are all the exploration dimensions, while the objective is the key that specifies whether SLAMBench failed ("YES" - 24 cases) or not ("NO" - 76 cases).
The automatically built decision tree shows that 22 failures can be attributed to the exploration parameter "volume size" exceeding 384 (the right child of the root node):
This strongly suggests that when performing design space exploration on this platform the "volume size" dimension should not exceed 384 to avoid failures.
The remaining 78 samples (the left child of the root node) result in only two failures. One failure occurs when "l ≤ 1.7860" (the leftmost leaf node). The other failure is an odd one out of 77 samples for which the built model predicts no failures (for "v ≤ 384" and "l > 1.7860"). Thus, the model's prediction rate is 99%.
Using this approach, allows engineers investigating the failures to focus only on a handful of samples: perhaps one case when "v > 384", one case when "v ≤ 384" and "l ≤ 1.7860", and one case when "v ≤ 384" and "l > 1.7860".
Furthermore, this model can be shared to be validated and improved by others, for example, publicly via our pilot CK server or privately in work groups (e.g. in companies).
The constraints derived from initial sampling can be used to focus exploration on areas where no failures occur. Similarly, building a model for performance can focus exploration on areas where improvements are more likely to occur.
We plan to use this CK-based approach in several collaborative projects to create adaptive, self-tuning applications that gradually reduce accuracy while keeping stable FPS to save energy when battery of the mobile device is exhausted. We can also use CK to fix an energy budget and find the best FPS, or to implement any other user scenario that can be shared and reused for other applications thus enabling collaborative experimentation (see P1, P2, P3, P4)!
We are open to collaborations to move this technology to existing tools and applications!
CK development is coordinated by the non-profit cTuning foundation and dividiti