Suffix Array

Suffix array is a useful data structure which stores suffix of a string in alphabetical order. (https://en.wikipedia.org/wiki/Suffix_array)

For example:

raw = abaab
Suffix: abaab, baab, aab, ab, b
Sorted suffix: aab, ab, abaab, b, baab

Null character is considered less than any other character.

Suffix array of raw:

[2, 3, 0, 4, 1]

because suffix starting from position 2(aab) is the smallest, then comes suffix starting from position 3(ab), ..., and suffix starting from position 1 (baab) is the largest.

Doubling algorithm

This algorithm builds suffix array in O(nlogn) time complexity and O(CHAR_MAX + n) space complexity, based on sorting segments with length 1, 2, 4, 8, ..., until all segments are different, which is called "doubling algorithm".

(Paper: http://www.cas.mcmaster.ca/~bill/best/algorithms/07Taxonomy.pdf)

Example

Still take abaab as an example:

(Visualized version: https://web.stanford.edu/~mjkay/NewSuffixArrays.pdf)

Iteration 1

We sort all segments with length 1 in 'raw'(i.e. a b a a b), and relabel the original string.

x = 01001  // 'a' => 0, 'b' => 1. two different segments. Still have repeated segments, continue iteration.
sa = 02314  // because letter 'a' appears at 0, 2, 3, then letter 'b' at 1, 4

Here, x[i] stands for the order of raw[i].

Iteration 2 [Step = 1]

We sort all segments with length 2 in raw(i.e. 01 10 00 01 1$ in old_x, since each element in old_x represents exact one character in raw), and relabel it

x = 13012  // '00' => 0, 01 => 1, 1$ => 2, 10 => 3. 4 different segments. still have repeated segments, continue iteration.
sa = 20341  // 00 at position 2, 01 at position 0 and 3, 1$ at position 4, 10 at position 1

Here, x[i] stands for the order of raw[i...i+1]

Iteration 3 [Step = 2]

We sort all segments with length 4 in x(i.e. 10 31 02 1$ 2$ in old_x, since each element in old_x represents 2 characters in raw. If we want x[i] stands for raw[i...i+3], then we need to let the concatenation of old_x[i](which stands for raw[i...i+1]) and old_x[i+2] (which stands for raw[i+2...i+3]) become the key of length-4 segment)

/----------------new_x[0]-----------------\
/-----old_x[0]------\/-----old_x[2]---------\
.........................................................
|raw[0]  |   raw[1]  |   raw[2]  |   raw[3]  |   raw[4] |
.........................................................

x = 24013  // 02 => 0, 1$ => 1, 10 => 2, 2$ => 3, 31 => 4. All different, done!
sa = 23041  // 02 at position 2, 1$ at position 3, 10 at position 0, 2$ at position 4, 31 at position 1

Here, x[i] stands for the order of raw[i...i+3] of all length-4 substrings of raw.

How to sort key pairs

In each iteration, we need sort key pairs (x[i], x[i + step]) (0 <= i < n). The naive method is to use quicksort, thus we need to use O(nlogn) for each iteration, and O(nlognlogn) for the whole algorithm.

A better way is to look into the range of x[i]/raw[i]: the maximum is bounded by both the length of raw (n) [since our goal is to make each elements of x different] and the character set size CHAR_MAX [for raw[i]]. Therefore, we can use bucket sort.

For 2-keyword bucket sort, the basic idea is(just like radix sort):

Sort elements based on the 2nd keyword
Sort elements based on the 1st keyword (must be stable)

Stable bucket sort

How can we perform a stable bucket sort?

// Assume origin is [1, 2, 2, 0, 0]
for (int item : origin)
  bucket[item]++;
// bucket: [2, 1, 2]
for (size_t i=1; i<bucket_size; ++i)
  bucket[i] += bucket[i-1];
// bucket: [2, 3, 5]
for  (int i = origin.size() - 1; i >= 0; --i)
  result[ --bucket[ origin[i] ] ] = i
// result: [3, 4, 0, 1, 2]

Sort array by 2nd keyword, using existing array of previous iteration

So how can we get the 2nd keyword? Remember that the 2nd keyword for position i is x[i+step] (if i + step < N) or $ (if i + step >= N).

Take a look at the result of the 2nd iteration:

x   =   13012

and the key pairs to sort of the 3rd iteration:

  10 31 02 1$ 2$,

2nd keyword:

  0  1  2  $  $

But the sorted array by the 2nd keyword is already there! Sa of iteration 2:

sa  =   23041
new_sa == { N - step, N - step + 1, ..., N - 1, sa[origin1] - step, sa[origin2] - step, ... }

The smallest step elements must be the last step elements, because the second keyword of them is always $. Sort result of 2nd keyword in iteration 3:

new_sa= 3 4 0 1 2.

Explanation: Step = 2, thus 3 4 are the smallest elements. position 0, 1 are impossible for a second keyword: ignore them from sa, the remaining items in sa (2 3 4) is the order of 2nd keywords of position (0 1 2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Suffix Array

Doubling algorithm

Example

Iteration 1

Iteration 2 [Step = 1]

Iteration 3 [Step = 2]

How to sort key pairs

Stable bucket sort

Sort array by 2nd keyword, using existing array of previous iteration

Files

README.md

Latest commit

History

README.md

File metadata and controls

Suffix Array

Doubling algorithm

Example

Iteration 1

Iteration 2 [Step = 1]

Iteration 3 [Step = 2]

How to sort key pairs

Stable bucket sort

Sort array by 2nd keyword, using existing array of previous iteration