Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeneralPurposeAllocator: Considerably improve worst case performance #17383

Merged
merged 3 commits into from
Oct 3, 2023

Commits on Oct 3, 2023

  1. Treap: Add InorderIterator

    squeek502 committed Oct 3, 2023
    Configuration menu
    Copy the full SHA
    da7ecfb View commit details
    Browse the repository at this point in the history
  2. GeneralPurposeAllocator: Considerably improve worst case performance

    Before this commit, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly `slot_count` allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large.
    
    This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment)
    
    In that case, the length of the bucket list for the `128` size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds.
    
    To address this, there are three different changes happening here:
    
    1. std.Treap is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of searchBucket [used in resize and free] from O(n) to O(log n), but increases the time complexity of insert from O(1) to O(log n) [before, all new buckets would get added to the head of the list]. Note: Any data structure with O(log n) or better search/insert/delete would also work for this use-case.
    2. If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow searchBucket to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's alloc_cursor only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless.
    3. Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data.
    
    The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of GeneralPurposeAllocator is unchanged.
    
    In the degraded `arocc` case, these changes bring Debug performance from ~8 minutes to ~20 seconds.
    
    Benchmark 1: test-master.bat
      Time (mean ± σ):     481.263 s ±  5.440 s    [User: 479.159 s, System: 1.937 s]
      Range (min … max):   477.416 s … 485.109 s    2 runs
    
    Benchmark 2: test-optim-treap.bat
      Time (mean ± σ):     19.639 s ±  0.037 s    [User: 18.183 s, System: 1.452 s]
      Range (min … max):   19.613 s … 19.665 s    2 runs
    
    Summary
      'test-optim-treap.bat' ran
       24.51 ± 0.28 times faster than 'test-master.bat'
    
    Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With `.stack_trace_frames = 0` the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used.
    
    These changes may or mat not introduce a slight performance regression in the average case:
    
    Here's the standard library tests on Windows in Debug mode:
    
    Benchmark 1 (10 runs): std-tests-master.exe
      measurement          mean ± σ            min … max           outliers         delta
      wall_time          16.0s  ± 30.8ms    15.9s  … 16.1s           1 (10%)        0%
      peak_rss           42.8MB ± 8.24KB    42.8MB … 42.8MB          0 ( 0%)        0%
    Benchmark 2 (10 runs): std-tests-optim-treap.exe
      measurement          mean ± σ            min … max           outliers         delta
      wall_time          16.2s  ± 37.6ms    16.1s  … 16.3s           0 ( 0%)        💩+  1.3% ±  0.2%
      peak_rss           42.8MB ± 5.18KB    42.8MB … 42.8MB          0 ( 0%)          +  0.1% ±  0.0%
    
    And on Linux:
    
    Benchmark 1: ./test-master
      Time (mean ± σ):     16.091 s ±  0.088 s    [User: 15.856 s, System: 0.453 s]
      Range (min … max):   15.870 s … 16.166 s    10 runs
     
    Benchmark 2: ./test-optim-treap
      Time (mean ± σ):     16.028 s ±  0.325 s    [User: 15.755 s, System: 0.492 s]
      Range (min … max):   15.735 s … 16.709 s    10 runs
     
    Summary
      './test-optim-treap' ran
        1.00 ± 0.02 times faster than './test-master'
    squeek502 committed Oct 3, 2023
    Configuration menu
    Copy the full SHA
    cf3572a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    95f4c15 View commit details
    Browse the repository at this point in the history