Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(array): use-after-free in optimized String array #5377

Merged
merged 4 commits into from
Feb 1, 2023

Conversation

mhilton
Copy link
Contributor

@mhilton mhilton commented Jan 31, 2023

The optimized string array, which is used whan all column values are identical, was caching the string value that was passed to it. If this value came from an arrow String array then the string's data would be in the buffer backing that array. Once the array is finished with and freed the new string array containing the reference to the data buffer remains and prevents the entire buffer from being garbage collected.

The changes to the executetest package show that this is the case by scribbling over a buffer's memory when it is reported as free. This caused a number of tests to fail with strings that contain the data pattern rather than the expected result.

This could cause a correctness error if one uses a more sophisticated allocator that maintains a backing buffer. This would be a legitimate design choice in order to relieve pressure on the garbage collector.

In order to correct this the StringBuilder has been enhanced to always copy the input string into a new buffer whether it is in the optimised mode, or not. This buffer is allocated from the memory allocator. Once the builder has been converted into a string array retrieving a string from this array always makes a string copy of the data, this time allocated by the standard runtime, so that the string can be used with other structures without the possibility that it will keep a larger data buffer from being garbage collected.

These changes are likely to have an impact on the performance of certain operations. Increasing the amount of memory copying will mean that some transform heavy workloads take longer. I don't know how to ascertain the impact yet. In order to alleviate this slightly a new byte array based method (ValueBytes) has been added to the string array. This does not copy the data but instead returns a slice of the data backing the array. This is to be used when the caller knows that they will be using the data without storing a reference, or immediately copying it to a different location.

Checklist

Dear Author 👋, the following checks should be completed (or explicitly dismissed) before merging.

  • ✏️ Write a PR description, regardless of triviality, to include the value of this PR
  • 🔗 Reference related issues
  • 🏃 Test cases are included to exercise the new code
  • 🧪 If new packages are being introduced to stdlib, link to Working Group discussion notes and ensure it lands under experimental/
  • 📖 If language features are changing, ensure docs/Spec.md has been updated

Dear Reviewer(s) 👋, you are responsible (among others) for ensuring the completeness and quality of the above before approval.

Test that potential use after free bugs can be exercised through
the flux language by modifying executetest.ProcessTestHelper2 to
use a memory allocator that scrubs the memory buffer when freeing.
The optimised string arrays did not always copy values before
freeing the backing data, leading to potentially unpredictable
behaviour.

Because the default allocator, which is presumably used in most
circumstances, uses the go runtime allocator underneath it is likely
that this wouldn't normally affect the correctness of operations.
The more likely problem is that large memory blocks that the execution
engine thinks has been freed are not being garbage collected because
of refences to strings within the memory block.

The array.StringBuilder is changed to always copy the value into
an internal memory buffer when a value is appended, The internal
buffer will always be allocated from the the memory allocator. When
retriving a string value from array.String the returned value will
be a copy from the internal buffer. This makes the string much
easier to reason about as it fits with the standard go string
semantics. This string is allocated by the go runtime rather than
the memory allocator.

The above changes make the system safer, but are likely to introduce
a reduction in performance due to the increase in copying of data
in memory. In order to provide a faster path some byte slice based
methods have been added to allow for less copying in places where
it can easily be determined that values will be finished with before
the memory is freed.
Create custom string addition (concatenation) functions that limit
the amount of copying required for the operation.
In places where it is clearly safe to do so use the byte-slice
oriented string column functions to minimize the amount of data
copies that are made when processing string columns.
@mhilton mhilton self-assigned this Jan 31, 2023
@mhilton mhilton requested a review from a team as a code owner January 31, 2023 15:24
Copy link
Contributor

@brettbuddin brettbuddin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems reasonable, @mhilton. We'll need to cut a pre-release build to canary in Cloud 2 so we can assess the performance impact.

@gavincabbage
Copy link
Contributor

@brettbuddin I spoke with @mhilton today and general plan, as you've mentioned, will be to canary in staging tomorrow morning and then in the region affected by the EAR Friday morning (EU morning). Then GA next week if that all goes well.

@mhilton mhilton merged commit 3d8cd33 into master Feb 1, 2023
@jacobmarble jacobmarble deleted the mh-string-column-use-after-free branch January 4, 2024 17:05
mhilton added a commit that referenced this pull request Jun 4, 2024
Undo the changes introduced in #5377. This is a trade-off where it
is considerd better to have as much memory as possible allocated
by the allocator, with the risk that it will be freed when still
in use. Than to have significant amounts of memory allocated outside
of the allocator.

It should be noted that because of Go's garbage collector it is
perfectly safe for memory to be "freed" when still in use. It just
mean that any accounting done by the allocator will be inaccurate.
mhilton added a commit that referenced this pull request Jun 5, 2024
* feat: stop making string copies

Undo the changes introduced in #5377. This is a trade-off where it
is considerd better to have as much memory as possible allocated
by the allocator, with the risk that it will be freed when still
in use. Than to have significant amounts of memory allocated outside
of the allocator.

It should be noted that because of Go's garbage collector it is
perfectly safe for memory to be "freed" when still in use. It just
mean that any accounting done by the allocator will be inaccurate.

* feat: remove unnecessary StringRef type

Remove the recently added StringRef type as it is now unnecessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants