Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: tune async batch iterator #3358

Merged
merged 1 commit into from
Jun 18, 2024
Merged

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Jun 17, 2024

It's been observed that the async batch iterator we're using for fetching Parquet rows might be using too much memory. Note the query.CloneParquetValues call under iter.(*AsyncBatchIterator[...]).fillBuffer in the flame graph:

image

The problem manifests when the query hits downsampled (aggregated) profiles: a row may contain thousands of values. Another factor is the misalignment of the query split interval and the block duration: each sub-range is processed independently, with its own iterator, thus multiplying the memory requirement.

In practice, a large buffer is not required here, as it is only needed to avoid waiting for fetches from individual columns by reading the data from them ahead of time. In turn, each of the columns has it's own "read ahead" buffer, which should minimaize blocking of the top-level iterator.

One way to solve the problem is to make the iterator work with size in bytes and have a predictable memory footprint. In the PR, I reduce the default buffer size and change the allocation strategy to use the new slices.Grow function.

@kolesnikovae kolesnikovae requested a review from a team as a code owner June 17, 2024 08:50
@kolesnikovae kolesnikovae force-pushed the perf/tune-async-batch-iterator branch from 5a279f5 to a5d3cdc Compare June 17, 2024 08:56
@kolesnikovae kolesnikovae merged commit 96c3860 into main Jun 18, 2024
16 checks passed
@kolesnikovae kolesnikovae deleted the perf/tune-async-batch-iterator branch June 18, 2024 03:35
@kolesnikovae
Copy link
Collaborator Author

The change has helped to reduce the pressure, however, I still think that the memory usage is too wasteful. I'm thinking about removing the buffer altogether – my experiments have shown no significant impact on the performance

image

There are more spots that need to be optmimized:

  1. The map that accumulates samples. We could replace it with a slice and use direct indexing.
  2. The intermediate tree (before truncation). This one is tricky: we need to find a way to trim stack traces before building the tree.
  3. In-memory symdb partitions. We should disable chunking for stack traces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants