The centroid of a tree is a node that, when removed, breaks the tree in connected components of size at most half of that of the original tree. By recursing this procedure on the components, one obtains the centroid decomposition of the tree, also known as centroid tree. The centroid tree has logarithmic height and its construction is a powerful pre-processing step in several tree-processing algorithms. The folklore recursive algorithm for computing the centroid tree runs in O(n*log(n))
time. To the best of our knowledge, the only result claiming O(n)
time is unpublished and relies on (dynamic) heavy path decomposition of the original tree. In this repository, we include our implementation for the new linear-time algorithm that we proposed, based on the idea of applying the folklore algorithm to a suitable decomposition of the original tree.
The main scripts are cdstd.cpp
and cdlin.cpp
, which are respectively the standard O(n*log(n))
and our new O(n)
algorithms for computing the centroid decomposition of a tree. We also provide the benchmark.cpp
script that we used to measure the performance of our algorithm over the standard one, and some random tree generators, included in the ./tree_gen/
folder. To compile everything, just run make all
.
Both T
and T''
are internally represented with a vector<uint32_t>
: each node of the two trees occupies a certain portion of this vector.
In T
, the i
-th node occupies the region T[i], ..., T[j]
:
T[i]
is the number ofi
's children, packed with a bitflag (MSB) saying if it is a cover element;T[i+1]
isi
's parent;T[i+2], ..., T[j]
arei
's children. Each one is represented by two integersT[h], T[h+1]
:T[h]
is the child's ID;T[h+1]
is the weight of the subtree rooted atT[h]
.
In T''
, the i
-th node occupies the region T[i], ..., T[k]
:
T''[i]
is the number ofi
's children;T''[i+1]
isi
's parent;T''[i+2]
is the size of the corresponding treelet onT
;T''[i+3]
is the ID of the root of the corresponding treelet onT
(i.e. the alpha function);T''[i+4], ..., T''[k]
arei
's children. Each one is represented by three integersT''[h], ..., T''[h+2]
:T''[h]
is the child's ID;T''[h+1]
andT''[h+2]
are the two deltas on the edge connectingi
toT[h]
.
All datasets have been generated using the code in folder "tree_gen".
Nicola Prezza has been supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L, PI: Giovanna Rosone. Link: http://pages.di.unipi.it/rosone/CMACBioSeq.html