Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimize tree processing #3386

Merged
merged 9 commits into from
Jun 26, 2024
Merged

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Jun 24, 2024

The PR optimizes the way samples are aggregated and symbolized:

  • For large datasets exceeding 16K samples, implement a sparse set to accumulate samples. Due to the "dense" nature of stack trace identifiers, memory usage remains efficient. Additionally, this approach naturally orders the samples by their stack trace identifiers.
  • Instead of building the resulting tree from scratch for large datasets, utilize the stored parent pointer tree as a prototype. This significantly reduces the number of memory accesses, improving overall efficiency. The downside of this approach is that truncation may become more "invasive" at the leaves: the source tree is built from locations, not functions. Therefore, the "other" stub may include siblings' samples. In practice, the impact is moderate and can be mitigated by increasing the max nodes limit (while still being more efficient: compare 8K and 64K benchmarks). A proper solution to this problem is to store function stack traces in addition to the tree of locations.
  • Specialize min heap implementation (int64) to avoid allocations caused by the interface conversion.

Benchmark of the SelectMergeByStacktraces query on the real data set that causes performance issues in the current implementation (~1K profiles, ~10K samples each):

    before    │                after                │
    sec/op    │   sec/op     vs base                │
 3599.9m ± 2%   661.8m ± 1%  -81.62% (p=0.000 n=10)

    before     │                after                │
     B/op      │     B/op      vs base               │
 1054.5Mi ± 1%   999.9Mi ± 3%  -5.18% (p=0.000 n=10)

   before    │                after                │
  allocs/op  │  allocs/op   vs base                │
 494.4k ± 0%   421.2k ± 0%  -14.80% (p=0.000 n=10)

Most of the allocations are made in parquet decoding and reconstruction of the symbolic information (locations, functions, mappings, and strings). This is addressed in #3138.

This is quite close to the synthetic benchmarks we have:

                                  │    before    │                after                │
                                  │    sec/op    │   sec/op     vs base                │
_Resolver_ResolveTree_Small/0-10     261.4µ ± 0%   259.2µ ± 1%   -0.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10    303.9µ ± 1%   258.7µ ± 1%  -14.86% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10    275.4µ ± 1%   272.2µ ± 1%   -1.17% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10       479.4m ± 4%   457.3m ± 2%   -4.61% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     362.16m ± 2%   47.96m ± 1%  -86.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    368.45m ± 1%   68.41m ± 6%  -81.43% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10     372.8m ± 3%   101.6m ± 3%  -72.76% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10     378.9m ± 7%   162.0m ± 4%  -57.26% (p=0.000 n=10)

                                  │    before     │                after                 │
                                  │     B/op      │     B/op      vs base                │
_Resolver_ResolveTree_Small/0-10    148.4Ki ±  0%   148.1Ki ± 0%   -0.19% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   152.1Ki ±  0%   142.6Ki ± 0%   -6.25% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   231.9Ki ±  0%   231.6Ki ± 0%   -0.15% (p=0.001 n=10)
_Resolver_ResolveTree_Big/0-10      198.0Mi ± 40%   202.7Mi ± 0%   +2.39% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     90.04Mi ± 88%   46.86Mi ± 0%        ~ (p=0.134 n=10)
_Resolver_ResolveTree_Big/16K-10    90.77Mi ± 87%   55.95Mi ± 0%        ~ (p=0.136 n=10)
_Resolver_ResolveTree_Big/32K-10    92.41Mi ±  0%   70.96Mi ± 0%  -23.21% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    95.62Mi ±  0%   88.80Mi ± 0%   -7.13% (p=0.001 n=10)

                                  │   before    │                after                │
                                  │  allocs/op  │  allocs/op   vs base                │
_Resolver_ResolveTree_Small/0-10    2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Small/1K-10   3.224k ± 0%   2.036k ± 0%  -36.85% (p=0.000 n=10)
_Resolver_ResolveTree_Small/8K-10   2.400k ± 0%   2.403k ± 0%   +0.12% (p=0.000 n=10)
_Resolver_ResolveTree_Big/0-10      3.047M ± 0%   3.042M ± 0%   -0.15% (p=0.000 n=10)
_Resolver_ResolveTree_Big/8K-10     28.85k ± 0%   19.70k ± 0%  -31.72% (p=0.000 n=10)
_Resolver_ResolveTree_Big/16K-10    47.91k ± 0%   39.01k ± 0%  -18.58% (p=0.000 n=10)
_Resolver_ResolveTree_Big/32K-10    87.01k ± 0%   76.53k ± 0%  -12.05% (p=0.000 n=10)
_Resolver_ResolveTree_Big/64K-10    164.9k ± 0%   152.7k ± 0%   -7.42% (p=0.000 n=10)

Note that the optimisations mostly concern large data sets (significantly bigger than our ResolveTree_Big) and some of them are not used in the case when the max nodes limit is not specified (defaults to 16K), or is too big, as this would be inefficient because of the increased memory consumption. Therefore, e.g., Resolver_ResolveTree_Big/0 (no truncation) does not perform significantly better.

@kolesnikovae kolesnikovae requested a review from a team as a code owner June 24, 2024 13:48
@kolesnikovae kolesnikovae marked this pull request as draft June 24, 2024 13:48
@kolesnikovae kolesnikovae force-pushed the perf/optimize-tree-processing branch from 34dfc61 to 0458bf1 Compare June 25, 2024 08:50
Copy link
Collaborator Author

@kolesnikovae kolesnikovae Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original fixture has random sample values, which leads to integer overflow

@kolesnikovae kolesnikovae changed the title WIP: perf: optimize tree processing perf: optimize tree processing Jun 25, 2024
@kolesnikovae kolesnikovae added the performance If there's anything we have to be really good at it's this label Jun 25, 2024
@kolesnikovae kolesnikovae marked this pull request as ready for review June 25, 2024 12:02
@kolesnikovae kolesnikovae force-pushed the perf/optimize-tree-processing branch from 76086bf to 07b8e32 Compare June 25, 2024 14:59
Copy link
Member

@petethepig petethepig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome. Let's try it

@kolesnikovae kolesnikovae merged commit b62930e into main Jun 26, 2024
16 checks passed
@kolesnikovae kolesnikovae deleted the perf/optimize-tree-processing branch June 26, 2024 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance If there's anything we have to be really good at it's this
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants