perf: optimize tree processing #3386

kolesnikovae · 2024-06-24T13:48:07Z

The PR optimizes the way samples are aggregated and symbolized:

For large datasets exceeding 16K samples, implement a sparse set to accumulate samples. Due to the "dense" nature of stack trace identifiers, memory usage remains efficient. Additionally, this approach naturally orders the samples by their stack trace identifiers.
Instead of building the resulting tree from scratch for large datasets, utilize the stored parent pointer tree as a prototype. This significantly reduces the number of memory accesses, improving overall efficiency. The downside of this approach is that truncation may become more "invasive" at the leaves: the source tree is built from locations, not functions. Therefore, the "other" stub may include siblings' samples. In practice, the impact is moderate and can be mitigated by increasing the max nodes limit (while still being more efficient: compare 8K and 64K benchmarks). A proper solution to this problem is to store function stack traces in addition to the tree of locations.
Specialize min heap implementation (int64) to avoid allocations caused by the interface conversion.

Benchmark of the SelectMergeByStacktraces query on the real data set that causes performance issues in the current implementation (~1K profiles, ~10K samples each):

    before    │                after                │
    sec/op    │   sec/op     vs base                │
 3599.9m ± 2%   661.8m ± 1%  -81.62% (p=0.000 n=10)

    before     │                after                │
     B/op      │     B/op      vs base               │
 1054.5Mi ± 1%   999.9Mi ± 3%  -5.18% (p=0.000 n=10)

   before    │                after                │
  allocs/op  │  allocs/op   vs base                │
 494.4k ± 0%   421.2k ± 0%  -14.80% (p=0.000 n=10)

Most of the allocations are made in parquet decoding and reconstruction of the symbolic information (locations, functions, mappings, and strings). This is addressed in #3138.

This is quite close to the synthetic benchmarks we have:

                                  │    before    │                after                │
                                  │    sec/op    │   sec/op     vs base                │
_Resolver_ResolveTree_Small/0-10     261.4µ ± 0%   259.2µ ± 1%   -0.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10    303.9µ ± 1%   258.7µ ± 1%  -14.86% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10    275.4µ ± 1%   272.2µ ± 1%   -1.17% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10       479.4m ± 4%   457.3m ± 2%   -4.61% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     362.16m ± 2%   47.96m ± 1%  -86.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    368.45m ± 1%   68.41m ± 6%  -81.43% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10     372.8m ± 3%   101.6m ± 3%  -72.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10     378.9m ± 7%   162.0m ± 4%  -57.26% (p=0.000 n=10)

                                  │    before     │                after                 │
                                  │     B/op      │     B/op      vs base                │
_Resolver_ResolveTree_Small/0-10    148.4Ki ±  0%   148.1Ki ± 0%   -0.19% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   152.1Ki ±  0%   142.6Ki ± 0%   -6.25% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   231.9Ki ±  0%   231.6Ki ± 0%   -0.15% (p=0.001 n=10)
_Resolver_ResolveTree_Big/0-10      198.0Mi ± 40%   202.7Mi ± 0%   +2.39% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     90.04Mi ± 88%   46.86Mi ± 0%        ~ (p=0.134 n=10)
_Resolver_ResolveTree_Big/16K-10    90.77Mi ± 87%   55.95Mi ± 0%        ~ (p=0.136 n=10)
_Resolver_ResolveTree_Big/32K-10    92.41Mi ±  0%   70.96Mi ± 0%  -23.21% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    95.62Mi ±  0%   88.80Mi ± 0%   -7.13% (p=0.001 n=10)

                                  │   before    │                after                │
                                  │  allocs/op  │  allocs/op   vs base                │
_Resolver_ResolveTree_Small/0-10    2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   3.224k ± 0%   2.036k ± 0%  -36.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10      3.047M ± 0%   3.042M ± 0%   -0.15% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     28.85k ± 0%   19.70k ± 0%  -31.72% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    47.91k ± 0%   39.01k ± 0%  -18.58% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10    87.01k ± 0%   76.53k ± 0%  -12.05% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    164.9k ± 0%   152.7k ± 0%   -7.42% (p=0.000 n=10)

Note that the optimisations mostly concern large data sets (significantly bigger than our ResolveTree_Big) and some of them are not used in the case when the max nodes limit is not specified (defaults to 16K), or is too big, as this would be inefficient because of the increased memory consumption. Therefore, e.g., Resolver_ResolveTree_Big/0 (no truncation) does not perform significantly better.

# Conflicts: # pkg/phlaredb/symdb/resolver.go

kolesnikovae · 2024-06-25T11:53:52Z

pkg/phlaredb/symdb/testdata/big-profile.pb.gz

The original fixture has random sample values, which leads to integer overflow

petethepig

Looks awesome. Let's try it

kolesnikovae added 5 commits June 19, 2024 21:43

perf: optimize sample merge

6b8fd52

wip

a6842df

Merge branch 'main' into perf/optimize-tree-processing

d5a543b

# Conflicts: # pkg/phlaredb/symdb/resolver.go

refactor processing of samples

b175212

specialize min heap

43ec24c

kolesnikovae requested a review from a team as a code owner June 24, 2024 13:48

kolesnikovae marked this pull request as draft June 24, 2024 13:48

kolesnikovae added 2 commits June 24, 2024 21:54

cleanup

b984967

simplify the algorithm

0458bf1

kolesnikovae force-pushed the perf/optimize-tree-processing branch from 34dfc61 to 0458bf1 Compare June 25, 2024 08:50

kolesnikovae commented Jun 25, 2024

View reviewed changes

kolesnikovae changed the title ~~WIP: perf: optimize tree processing~~ perf: optimize tree processing Jun 25, 2024

kolesnikovae added the performance If there's anything we have to be really good at it's this label Jun 25, 2024

kolesnikovae marked this pull request as ready for review June 25, 2024 12:02

add unit tests

07b8e32

kolesnikovae force-pushed the perf/optimize-tree-processing branch from 76086bf to 07b8e32 Compare June 25, 2024 14:59

petethepig approved these changes Jun 26, 2024

View reviewed changes

improve tests

112089b

kolesnikovae merged commit b62930e into main Jun 26, 2024
16 checks passed

kolesnikovae deleted the perf/optimize-tree-processing branch June 26, 2024 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize tree processing #3386

perf: optimize tree processing #3386

kolesnikovae commented Jun 24, 2024 •

edited

Loading

kolesnikovae Jun 25, 2024 •

edited

Loading

petethepig left a comment

perf: optimize tree processing #3386

perf: optimize tree processing #3386

Conversation

kolesnikovae commented Jun 24, 2024 • edited Loading

kolesnikovae Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

petethepig left a comment

Choose a reason for hiding this comment

kolesnikovae commented Jun 24, 2024 •

edited

Loading

kolesnikovae Jun 25, 2024 •

edited

Loading