Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shrink_to_fit method for the bucket backend is unsound #46

Closed
jedel1043 opened this issue Jun 10, 2022 · 4 comments · Fixed by #66
Closed

shrink_to_fit method for the bucket backend is unsound #46

jedel1043 opened this issue Jun 10, 2022 · 4 comments · Fixed by #66

Comments

@jedel1043
Copy link

Sorry! I forgot to open an issue when I talked about this problem. Better late than never I guess haha

The StringInterner::shrink_to_fit method causes undefined behaviour when using the bucket backend.

A simple test demonstrates this problem:

#[test]
fn shrink_to_fit_works() {
    let mut interner = StringInterner::new();
    // Insert 3 unique strings:
    let aa = interner.get_or_intern("aa").to_usize();
    let bb = interner.get_or_intern("bb").to_usize();
    let cc = interner.get_or_intern("cc").to_usize();

    interner.shrink_to_fit();

    assert_eq!(
        interner.get_or_intern("aa").to_usize(),
        aa,
        "'aa' did not produce the same symbol",
    );
    assert_eq!(
        interner.get_or_intern("bb").to_usize(),
        bb,
        "'bb' did not produce the same symbol",
    );
    assert_eq!(
        interner.get_or_intern("cc").to_usize(),
        cc,
        "'cc' did not produce the same symbol",
    );
    assert_eq!(interner.len(), 3);
}

This fails with the following message:

thread 'bucket_backend::shrink_to_fit_works' panicked at 'assertion failed: `(left == right)`
  left: `3`,
 right: `2`: 'cc' did not produce the same symbol', tests/tests.rs:410:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Additionally, running cargo +nightly miri test errors with:

    |
381 |         unsafe { &*self.as_ptr() }
    |                  ^^^^^^^^^^^^^^^ pointer to alloc219428 was dereferenced after this allocation got freed
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
    ... backtrace

I'll try to work on a fix shortly.

Explanation (taken from #42 (comment))

Explanation

The definition of the BucketBackend is as follows:

pub struct BucketBackend<S = DefaultSymbol> {
spans: Vec<InternedStr>,
head: FixedString,
full: Vec<String>,
marker: PhantomData<fn() -> S>,
}

This backend works by creating several FixedString buffers of a fixed size, storing them in a Vec<String> and creating NonNull<str> references (abstracted by InternedString) to each of them in order to obtain the interned strs.

However, FixedString must always ensure that it won't be moved to another memory location, otherwise the previously created NonNull<str> references will point to unallocated memory (use after free).

Unfortunately, when we call the shrink_to_fit method from the backend, it internally calls the shrink_to_fit method from FixedString:

/// Shrink capacity to fit the contents exactly.
pub fn shrink_to_fit(&mut self) {
self.contents.shrink_to_fit();
}

And calling shrink_to_fit on a String CAN move its contents to another memory location to reduce its allocated space. From the definition of Vec::shrink_to_fit:

    pub fn shrink_to_fit(&mut self) {
        // The capacity is never less than the length, and there's nothing to do when
        // they are equal, so we can avoid the panic case in `RawVec::shrink_to_fit`
        // by only calling it with a greater capacity.
        if self.capacity() > self.len {
            self.buf.shrink_to_fit(self.len);
        }
    }

This calls RawVec::shrink_to_fit:

    pub fn shrink_to_fit(&mut self, cap: usize) {
        handle_reserve(self.shrink(cap));
    }

Which calls RawVec::shrink:

    fn shrink(&mut self, cap: usize) -> Result<(), TryReserveError> {
        assert!(cap <= self.capacity(), "Tried to shrink to a larger capacity");


        let (ptr, layout) = if let Some(mem) = self.current_memory() { mem } else { return Ok(()) };


        let ptr = unsafe {
            // `Layout::array` cannot overflow here because it would have
            // overflowed earlier when capacity was larger.
            let new_layout = Layout::array::<T>(cap).unwrap_unchecked();
            self.alloc
                .shrink(ptr, layout, new_layout)
                .map_err(|_| AllocError { layout: new_layout, non_exhaustive: () })?
        };
        self.set_ptr_and_cap(ptr, cap);
        Ok(())
    }
}

Which calls Allocator::shrink, and in the implementation of shrink for Global, the default global allocator, in the branch corresponding to the case where new_layout.size() < old_layout.size():

            new_size => unsafe {
                let new_ptr = self.allocate(new_layout)?;
                ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_mut_ptr(), new_size);
                self.deallocate(ptr, old_layout);
                Ok(new_ptr)
            },

Oops, the old memory is deallocated! This causes an use after free and results in undefined behaviour.

@Robbepop
Copy link
Owner

Hey, sorry for not responding for a long time.
Thanks a lot for the issue and the PR.
I actually was already thinking for quite a while to remove the BucketBackend since it is pretty complex compared to the other backends but does not really provide any significant performance improvements besides fewer deallocations.

What is your opinion? Do you see a valid use case for keeping BucketBackend?

@jedel1043
Copy link
Author

Hey, sorry for not responding for a long time. Thanks a lot for the issue and the PR. I actually was already thinking for quite a while to remove the BucketBackend since it is pretty complex compared to the other backends but does not really provide any significant performance improvements besides fewer deallocations.

What is your opinion? Do you see a valid use case for keeping BucketBackend?

As the description says, this backend is very useful when you need to intern a lot of static strings, which is very common to do on interpreters and compilers, so I definitely see a valid use case for it :)

@Robbepop
Copy link
Owner

Robbepop commented Aug 30, 2022

Hey, sorry for not responding for a long time. Thanks a lot for the issue and the PR. I actually was already thinking for quite a while to remove the BucketBackend since it is pretty complex compared to the other backends but does not really provide any significant performance improvements besides fewer deallocations.
What is your opinion? Do you see a valid use case for keeping BucketBackend?

As the description says, this backend is very useful when you need to intern a lot of static strings, which is very common to do on interpreters and compilers, so I definitely see a valid use case for it :)

Are we talking about >(10^3) static strings or just like <(10^2)? Because I guess it won't matter then. Have you got benchmarks?

@jedel1043
Copy link
Author

Are we talking about >(10^3) static strings or just like <(10^2)?

Hmm, probably less than 100, but that would depend on the compiler, I think.

Because I guess it won't matter then. Have you got benchmarks?

No benchmarks, unfortunately. However, that backend has the advantage of consuming very little memory when the majority of interned strings are static, which is always very appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants