spec

A repository that contain some benchmarks for partial evaluation of different GPU application scenarios leveraging AnyDSL framework and CUDA. Benchmarks are available as Python notebooks.

Implementations

Multiple string matching:
- Three naive implementations in CUDA using global, constant and shared memory respectively to store the patterns
- Implementation utilizing partial evaluation to "store" the patterns
Image separable convolution
- Separable convolution taken from NVIDIA's SDK with constant memory to store the filter
- Implementation utilizing partial evaluation to "store" the filter

Partial evaluation

Partial evaluation or program specialization is a well-known optimization technique that given a program and part of its input data, called static, specializes the program with respect to the data, producing another, optimized, program which if given only the remaining part of input data, called dynamic, yeilds the same results as the original program would have produced being executed given both parts of the input data.

Considering optimization of GPU memory accesses, partial evaluation is able to embed statically known array values directly into the code. More precisely, it allows the values to be accessed via instruction cache rather than through issuing load instructions.

For example, the following CUDA kernel:

__global__ void match_multy(char* patterns, int* p_sizes, int p_number, char* text, long text_size, char* result_buf) {

    long t_id = threadId();

    if(t_id < text_size){
        int p_offset = 0;
        int matched = 1;
        
        result_buf[t_id] = 0;

        for(int i = 0; i < p_number; i++) {//for each pattern
            matched = 1;
            if(t_id < text_size - p_sizes[i] + 1) {
                for(int j = 0; j < p_sizes[i]; j++) {
                
                    if(text[t_id + j] != patterns[j+p_offset]) {
                        matched = -1;
                        break;
                    }
                } 
            
                if(matched == 1) {
                    result_buf[t_id] = i+1; // 0 stands for missmatch
                }
            }
            p_offset += p_sizes[i];
        }             
    }
}

compiles to something like this:

However, since *patterns are fixed at runtime, partial evaluation could be applied resulting in the following residual compiled code:

The observed transformation results in the following performance of naive GPU pattern matching (evaluated using 16 file signatures as patterns and raw disk data as a subject string on GTX 1070):

The performance gain occured due to poor access patterns to each memory space aggravated by thread divergence. Poor access patterns induce either sequential loading or fetching redundant memory to satisfy alignment and fetch size. E.g. actual amount of memory transered for GTX 1070 matching with patterns in global memory and for partially evaluated version are depicted below (subject string size is 2.9GB raw disk data, patterns total length is about 160B)

Global memory version:

Partially evaluated one:

Thus, in case when e.g. constant memory achieves its best perforamance, partial evaluation hardly makes the application significantly faster. For example, considering Separable Convolution Filter implementation from NVIDIA's SDK, where filter kernel resides in constant memory and has the best access pattern, the relocation of the kernel performed by partial evaluation has the following effect:

Build instructions

Dependencies

Working installation of AnyDSL framework

AnyDSL should be installed with RUNTIME_JIT:=true option set in config.sh
Working installation of NVIDIA CUDA with all paths set up
Convolution filtering depends on libjpeg, e.g. sudo apt-get install libjpeg-dev
Python 3
Python Jupyter notebook to view benchmarks
Python matplotlib, e.g. pip3 install -r requirements.txt

Build

To build the test applications simply run:

mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DAnyDSL_runtime_DIR="${PATH_TO_ANYDSL_FOLDER}/runtime/build/share/anydsl/cmake/" && make

or

cmake .. -DAnyDSL_runtime_DIR="${PATH_TO_ANYDSL_FOLDER}/runtime/build/share/anydsl/cmake/"

if cmake --version >= 3.11

To run the coresponding benchmarks, move compiled applications to coresponding benchmarking folder, run jupyter notebook -> Cell->Run All. E.g. for multiple pattern matching:

mv build/matching/cpp/match.out ../matching/benchmarking && mv build/matching/impala/match_spec.out ../matching/benchmarking

Then open matching/benchmarking/MatchingBenchmarking.ipynb, choose a file with a subject string and pass it to Runner, e.g. one can provide /dev/sda1 as input: runner = Runner(...,filename='/dev/sda1',offset=0...) and click Cell -> Run All

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
convolution		convolution
haskell		haskell
llvm-spec-examples		llvm-spec-examples
matching		matching
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakeListsPrev.txt		CMakeListsPrev.txt
README.md		README.md
kmp.dump		kmp.dump
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spec

Implementations

Partial evaluation

Build instructions

Dependencies

Build

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Tiltedprogrammer/spec

Folders and files

Latest commit

History

Repository files navigation

spec

Implementations

Partial evaluation

Build instructions

Dependencies

Build

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages