Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #66171

Merged

Commits on Jun 8, 2021

  1. row: fix intra-query memory leak in kvfetcher

    The KVFetcher is the piece of code that does the first level of decoding
    of a KV batch response, doling out slices of keys and values to higher
    level code that further decodes the key values into formats that the SQL
    engine can operate on.
    
    The KVFetcher uses a slice into the batch response to keep track of
    where it is during the decoding process. Once the slice is empty, it's
    finished until someone asks it for a new batch.
    
    However, the KVFetcher used to keep around that empty slice pointer for
    its lifetime, or until it was asked for a new batch. This causes the
    batch response to be un-garbage-collectable, since there is still a
    slice pointing at it, even though the slice is empty.
    
    This causes queries to use up to 2x their accounted-for batch memory,
    since the memory accounting system assumes that once data is transfered
    out of the batch response into the SQL representation, the batch
    response is freed - it assumes there's just 1 "copy" of this batch
    response memory.
    
    This is especially problematic for long queries (since they will not
    allow that KVFetcher memory to be freed until they're finished).
    
    In effect, this causes 1 extra batch per KVFetcher per query to be
    retained in memory. This doesn't sound too bad, since a batch is of
    fixed size. But the max batch size is 1 megabyte, so with 1024
    concurrent queries, each with 3 KVFetchers, like we see in a TPCH
    workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of
    unaccounted for memory. This is easily enough memory to push a node over
    and cause it to OOM.
    
    This patch nils the batch response pointer once the KVFetcher is
    finished decoding it, which allows it to be garbage collected as soon as
    possible. In practice, this seems to allow at least a single-node
    concurrency-1024 query18 TPCH workload to survive indefinitely (all
    queries return out of budget errors) without OOMing.
    
    Release note (bug fix): queries use up to 1MB less actual system memory
    per scan, lookup join, index join, zigzag join, or inverted join in
    their query plans. This will result in improved memory performance for
    workloads with concurrent OLAP-style queries.
    jordanlewis committed Jun 8, 2021
    Configuration menu
    Copy the full SHA
    ba3a6f4 View commit details
    Browse the repository at this point in the history
  2. row: release memory more eagerly in txnKVFetcher

    Previously, we could leave some dangling references to batch responses
    around in the txnKVFetcher when we were fetching more than one batch at
    a time. This would cause a delay in reclamation of memory for the
    lifetime of a given query.
    
    Release note (bug fix): use less memory in some queries, primarily
    lookup joins.
    jordanlewis committed Jun 8, 2021
    Configuration menu
    Copy the full SHA
    3f6d0cc View commit details
    Browse the repository at this point in the history