Skip to content

parallel suffix Array implementation: parallel DC3 & ParallelRange

Notifications You must be signed in to change notification settings

malabz/parallel-suffixArray

 
 

Repository files navigation

Information

**This project is not complete, beacuse of the lack support in modern parallel modules like C++11. Can be enhanced. **

The project used OpenMP for paralleling.

Now the program are available, called parallelKS and parallelRange-2, but failed in parallelRange in large seqenences.

The SA source codes are from https://github.com/snnynhr/ParallelSuffixArrays and https://github.com/jlabeit/parallel-range-lite.

Another useful parallel radix sort is here(ref, ppt).

Usage

KS

Calling functions

This module provides two functions: suffixArray and LCP:

uintT* suffixArray(unsigned char* s, uintT n);
uintT* LCP(unsigned char* s, uintT n);

please include <SA.h> for using this module.

Ubuntu and MAC

cd parallelKS
make

...and link libks.so in your program.

Windows

Firstly, please install Visual Studio 2022, and import vs_config/llvm.vsconfig in Visual Studio Installer. It will install LLVM for compiling the code.

Then, import KS_DLL.vcxproj in your project, it will generate a DLL file libKS.dll.

ParallelRange-2

Calling functions

This module provides only one function: parallelrangelite:

// 32 bit version.
void parallelrangelite(int32_t* ISA, int32_t* SA, int32_t n);
// 64 bit version.
void parallelrangelite(int64_t* ISA, int64_t* SA, int64_t n);

please include <parallel-range.h> for using this module.

Ubuntu and MAC

cd parallelRange_2
make

...and link libprange.so in your program.

Windows

Firstly, please install Visual Studio 2022, and import vs_config/llvm.vsconfig in Visual Studio Installer. It will install LLVM for compiling the code.

Then, import parallelRange_2/libprange.dll.vcxproj in your project, it will generate a DLL file libprange.dll.

RAW parallel-suffixArray README

The directory structure is designed so each benchmark can be used on
its own without any of the rest of the benchmarks.  Since there is a
lot of sharing of code between benchmarks, the main copy of the code
is kept in "common" directories.  This code is then either copied or
linked into the specific benchmark when needed.  This should all be
done by the Makefiles.  Note the infrastructure was all developed
under linux, it will require some more work to make it work under
windows.

Whether you have downloaded the whole suite, or just one benchmark,
you should be able to make and run an implementation of a benchmark by
going to the directory, running make, and then running "./testInputs".
For example to run sampleSort:
  cd comparisonSort/sampleSort
  make
  ./testInputs
To compile with Cilk++ the environment variable CILK needs to be set,
and to compile with Cilk+ the environment variable MKLROOT needs to be set.
Otherwise it will just compile with g++ and run sequentially.  The
code seems to run faster with CILK++.

Below is an outline of how the directories are structured if you
have ALL the benchmarks.

If you have just downloaded one of the benchmarks then you won't have
the TOPLEVEL directory but instead will have one of the <benchmark>
directories.   All common files  should already be copied into 
this benchmark directory.

***********************************
TOPLEVEL
***********************************

common/

This subdirectory includes code and other files that are common across
multiple benchmarks.  The code in this directory is linked into each
of the benchmarks in which it is used (or copied when making a
standalone benchmark).

./runall

Runs all the benchmarks

./clean

Removes all temporary or generated files from all subdirectories.
This can significantly reduce diskspace since the generated data files
can be large.

***********************************
TEST DATA
***********************************

testData/

This directory includes all the data and data generation code used to
generate inputs for the benchmarks.
  
testData/<datatype>

Includes generators for the particular type of data.  Currently this
includes sequenceData, graphData, and geometryData.

testData/<datatype>/data

Includes the actual data files generated by the generators or included
as part of benchmark suite.  Run "Make clean" to remove all the
temporary files.   These can take a lot of space.

testData/<datatype>/data/Makefile
  Includes rules for making various instances of this data type

***********************************
BENCHMARKS
***********************************

<benchmark>/

These directories contain all the files for the benchmarks.  Some of
these files are linked from the COMMON or TEST DATA areas.  The idea
is that including the linked files (copying them instead) these
directories are supposed to be standalone (i.e. the directory can be
distributed and all files compiled on its own).

<benchmark>/<datatype>

This is a link to testdata/<datatype>.  The datatype will depend on
the benchmark (e.g. graphs, sequences or geometric data).  So far no
benchmark includes more than one datatype, but there is no reason not
to.

<benchmark>/common

Code and other files that are common across implementations of the
benchmark, e.g. the check code.

<benchmark>/common/<bnchmrk>Check.C

Code for checking the correctness of the output for the benchmark.
<bnchmrk> is typically an abbreviation for <benchmark>, e.g. "isort" is short 
for "integerSort", and hence the file integerSort/common/isortCheck.C.
Running "make" will typically make the check file.
It is then used in the form 
"<bnchmrk>Check <inputFile> <outputFile>".

<benchmark>/common/testInputs

A script that runs the benchmark on all the inputs.  This file
includes the list of input files that are part of this benchmark.
It is typically copied over to the directory for each implementation.

<benchmark>/common/<bnchmrk>Time.C

This is a driver for running the benchmark.  This can be used if the
benchmark implementation code is written in C or can be linked with C.
Otherwise the benchmark implementation might require its own driver.

***********************************
BENCHMARK IMPLEMENTATIONS
***********************************

<benchmark>/<implementation>/

These directories contain a particular implementation of the benchmark,
for example "comparisonSort/sampleSort/".

<benchmark>/<implementation>/Makefile

Running "make" should make the benchmark code and generate a file
called <bnchmrk>.  This includes linking in files from "common" and
"benchmark/<common>" if needed.

<benchmark>/<implementation>/testInputs

This file might not be in the directory before the "make", but should
be copied over by the make.  It is used to run the benchmark on all
the test inputs and check the results for correctness.  The
benchmark can also be run on its own on a single input without testing by using
<benchmark>/<implementation>/<bnchmrk> <infile>.  Here <bnchmrk> is
the same abbreviation as used in the common directory (e.g. "isort").
For example in the directory "comparisonSort/sampleSort", run:
  ./sort ../sequenceData/data/randomSeq_10000000_double
which will just print out the runtime, and perhaps some statistics,
Using the -o option:
  ./sort -o <fname> ../sequenceData/data/randomSeq_10000000_double
will output the result to the file <fname>, which can then be tested
with the check program:
  ../common/<bnchmrk>Check ../sequenceData/data/randomSeq_10000000_double <fname>
Note, however, that the input file might not exist, in which case
go to the "data" directory and:
   make randomSeq_10000000_double

***********************************
ADDING AN IMPLEMENTATION
***********************************

Within the <benchmark> directory create a new directory for the
implementation and copy over the "testInputs" file from "common".

If your code is linkable with C++ then you should use the benchmark
driver.  To do this you need to implement a function with the
interface given in "common/<bnchmrk>.h".  You then need to copy
"common/<bnchmrk>Time.C" and any files it needs to your implementation
directory.  The files it needs should be listed in
"common/timeRequiredFiles".  Then you can compile
"<bnchmrk>Time.o" and link it with your implementation.
You should now be able to run:
  ./<bnchmrk> -r <nrounds> -o <outfile> <infile>
or "./testInputs".

If your code is NOT linkable with C++ then you need to create a
standalone executable named <bnchmrk> (the short name for the
benchmark).  This needs to run with the command line 
  <bnchmrk> -r <nrounds> -o <outfile> <infile>
where the <outfile> and <infile> are in the appropriate format.  The
-r <nrounds> arugment specifies the number of rounds and running the
program.  The program must output to stdout <nrounds> lines each of 
the form:
  PBBS-time: <time>
giving the time in seconds for one execution of the program on the
input.  Other output can also be printed as long as it does not start
with "PBBS-time".

RAW libprange README

Description

Parallel-Range-Lite is a parallel suffix array construction algorithm for integer alphabets. It is a lightweight implementation based on the range algorithm from the Problem Based Benchmark Suite PBBS. A rough description of the algorithm can be found in the following work.

Julian Labeit, Julian Shun, and Guy E. Blelloch. Parallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction. DCC 2015.

Installation

The following steps have been tested on Ubuntu 14.04 with gcc 5.3.0 and cmake 2.8.12.

git clone https://github.com/jlabeit/parallel-range-lite.git
cd parallel-range-lite
mkdir build
cd build
cmake ..
make
make install

Note that in the default version the cilkplus implementation by gcc is used for parallelization. To change this setting edit parallelization settings in the CMakeLists.txt file.

Getting Started

An example application can be found in demo/main.cpp. The library provides two basic functions to build the suffix array over a text.

// 32 bit version.
void parallelrangelite(int32_t* ISA, int32_t* SA, int32_t n);
// 64 bit version.
void parallelrangelite(int64_t* ISA, int64_t* SA, int64_t n);

To use the library include the header parallel-range.h, link against the library libprange.

About

parallel suffix Array implementation: parallel DC3 & ParallelRange

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 78.5%
  • Python 11.6%
  • C 6.2%
  • HTML 2.3%
  • Makefile 1.4%