Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Subslice-offset - Get the offset of references into a slice. #2796

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Oct 28, 2019

Rendered

(This idea was mentioned in the 'Future possibilities' section of #2791)


## Raw Pointers

By requiring a `&T` or a `&[T]` in these functions, we needlessly restrict them
Copy link

@hanna-kruppe hanna-kruppe Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This restriction to dereferenceable pointers side-steps potential problems with pointer comparisons: compiler optimizations can change the results of pointer comparisons to be different from the order of the runtime addresses when the pointers are dangling. See for example section 4.6 of this paper.

It is not immediately apparent to what degree LLVM performs such optimizations (especially on < and >, as opposed to ==), but it seems risky to assume that it never will. LLVM docs say icmp on pointers works "as if [the pointers] were integers" but this is imprecise and apparently contradicted by the fact that some comparisons which can be true at runtime under -O0 are folded to false at -O2.

cc @rust-lang/wg-unsafe-code-guidelines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

I mention somthing similar briefly, but it's a bit hidden at the end of 'option 4':

The downside is that we lose the guarantees of a &[T], and can no longer make assumptions such as start <= end or about the maximum size of a slice (which is needed to safely use pointer::offset_from).

It's part of the reason why I'm proposing the versions with references, and only mention the pointers as alternatives. I should probably mention this more clearly right at the start of the raw pointers section.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did miss that part, thanks for pointing it out.

Those problems you list seem of a different nature, though. Not knowing that start <= end and that the slice spans no more than isize::MAX bytes requires a different, slightly less efficient implementation (e.g., some more comparisons). In contrast, it is not clear to me whether the problem of non-deterministic pointer comparison can be circumvented at all, since e.g. ptr->int casts do not currently stop the LLVM optimization I showcased earlier. That may have to change anyway to make LLVM's memory model consistent, but at present I don't have confidence the raw pointer version can be implemented correctly, at least according to the naive specification of comparing the pointer addresses numerically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkruppe unfortunately I don't think restricting to dereferencable pointers entirely side-steps the problems here when ZST are involved. ZST pointers are nominally "dangling" as far as LLVM is concerned (or at least, they might be). But then, as @HeroicKatora points out, there are larger problems with ZST.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may have to change anyway to make LLVM's memory model consistent, but at present I don't have confidence the raw pointer version can be implemented correctly, at least according to the naive specification of comparing the pointer addresses numerically.

Indeed LLVM is currently in a sad state where it is not possible to reliably compare the integer addresses of two pointers. Casting to integers first and then comparing should work but LLVM incorrectly "optimizes" that to comparing at pointer type instead. We could conceivably hack around that by "obfuscating" the comparison enough, which however would likely be catastrophic for performance. Also see these LLVM bugs: [1], [2].

@jeekobu
Copy link

jeekobu commented Oct 30, 2019

I feel these would be useful. I've implemented the following less-generic analog before and was surprised it didn't exist.

fn offset_within(parent: &str, child: &str) -> Option<usize> { ... }

Copy link

@HeroicKatora HeroicKatora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature might be confusing for slices of ZSTs. In particular, all elements are virtually indistinguishable from their pointers alone as they have the same address. The only useful return values of index_of seem Some(0) or None.

impl<T> [T] {
pub fn range_of(&self, subslice: &[T]) -> Option<Range<usize>>
}
```
Copy link

@HeroicKatora HeroicKatora Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation may be more tricky for range_of in the ZST case. One should be able to expect that if range_of(slice).is_some() then the returned range has the same length as the subslice given. Finding the index of the first and last element of the subslice each with index_of will return a wrong result from this point of view.

let element = element as *const _;
let range = self.as_ptr_range();
if range.contains(&element) {
unsafe { Some(element.offset_from(range.start) as usize) }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation panics for ZSTs (but would otherwise be unsound anyways). The requirements of offset_from state:

Both the starting and other pointer must be either in bounds or one byte past the end of the same allocated object. Note that in Rust, every (stack-allocated) variable is considered a separate allocated object.
[...]
Panics
This function panics if T is a Zero-Sized Type ("ZST").

References to ZSTs are always dangling pointers and thus offset_from never valid as they never point into any allocation. The non-ZST case could be improved with an explicit comment detailing soundness justifications.

Also note that even if having a special case implementation, slices of ZST will return Some(0) in many more cases than the caller might realize, sometimes depending on optimization levels.

let not_an_allocation = vec![(); 16];
let some_new = &();
// Entirely indetermined, might hold or not depending on compiler
assert!(not_an_allocation.index_of(&some_new).is_some());

pub fn range_of(&self, subslice: &[T]) -> Option<Range<usize>> {
let range = self.as_ptr_range();
let subrange = subslice.as_ptr_range();
if subrange.start >= range.start && subrange.end <= range.end {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this does not consider the ZST case. As outlined above, pointers may compare equal even though they logically refer to different regions in the slice. A sensible choice if the operation should succeed could be to assume a subslice starting at index 0 and having a length of subslice.len().

This shows that the implementation is entirely non-trivial. This is a major motivation for having such an interface in std, in my opinion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is unsound in general.

Consider that the subslice may be empty. In that case, the start and end pointer can fulfil the pointer comparison but not in fact be part of the same allocation. Note that this may be the case even for standard/non-ZST types. Then the offset_from calls are UB and do not panic. The crucial difference is that unlike the single element case even the start pointer may be a one-past-the-end pointer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- if subslice is empty this should short-circuit and return None (or maybe Some(0..0)?).
Otherwise it should use index_of(subrange.start) and then add subslice.len to get the end index.

@dtolnay dtolnay added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jan 16, 2020
@KodrAus KodrAus added A-slice Slice related proposals & ideas Libs-Tracked Libs issues that are tracked on the team's project board. labels Jul 29, 2020
@steffahn
Copy link
Member

steffahn commented Feb 4, 2021

There’s also UB in the proposed implementation, if it is called with slices like this (playground).

@RustyYato
Copy link

RustyYato commented Feb 4, 2021

Note: The UB comes from (quoting the docs)

If any of the following conditions are violated, the result is Undefined Behavior:

  • Both the starting and other pointer must be either in bounds or one byte past the end of the same allocated object. Note that in Rust, every (stack-allocated) variable is considered a separate allocated object.

  • Both pointers must be derived from a pointer to the same object. (See below for an example.)

  • The distance between the pointers, in bytes, cannot overflow an isize.

  • The distance between the pointers, in bytes, must be an exact multiple of the size of T.

  • The distance being in bounds cannot rely on "wrapping around" the address space.

Emphasis mine. The first point fundamentally can't be checked, the second point could be checked, but isn't right now.

This UB can be alleviated by using wrapping_offset_from instead, if that existed! (I'm surprised it doesn't exist). You'll have to convert the pointers to integers and do the arithmetic yourself to be safe.

Comment on lines +231 to +235
- In the first case, a `&[T]` can be used just like 'before', as it is coerced
to a `*const [T]` automatically. However, there's no proper way to construct
such a 'fat pointer' manually, making it not very useful for anything other
than taking a `&[T]`, except that it lifts the borrowing/lifetime
requirements.
Copy link
Contributor

@SimonSapin SimonSapin Aug 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s possible that this RFC pre-dates it, but std::ptr::slice_from_raw_parts is that proper way.

That said I think it’s fine to have these methods only on slices, at least initially. It avoids many of the soundness questions, and if those questions are eventually resolved one of the other options could be picked.


See the [Raw Pointers section](#raw-pointers) above.

- Should this be added for `&str` as well?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are already methods that would be trivial to replace with as_bytes + [u8] methods but are still available on str for convenience.

```rust
impl<T> [T] {
pub fn index_of(&self, element: &T) -> Option<usize>;
pub fn range_of(&self, subslice: &[T]) -> Option<Range<usize>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a range instead of just the start offset seems not strictly necessary since its length should always be subslice.len(), but on further thought it’s nice to have and makes the meaning of the return value obvious. So 👍

@SimonSapin
Copy link
Contributor

SimonSapin commented Aug 9, 2021

Below is another proposed implementation based on the feedback here. The main changes are explicit handling of ZSTs, and "manually" doing arithmetic on pointer addresses instead of using <*const T>::offset_of to avoid its potential UB.

In order to handle the case mentioned #2796 (comment) (reproduced in code comments) we need to compute the distance in bytes and check its modulo. At that point we’ve already done half of the job of offset_of so we might as well do the division on that integer.

To be added to the doc-comment of index_of:

Zero-sized types

When size_of::<T>() == 0 the memory address of &T is not meaningful, so index_of can only return None or Some(0). It may spuriously return Some with an element created from outside the self slice.

To be added to the doc-comment of range_of:

If a range is returned, its length is always subslice.len().

Zero-sized types

When size_of::<T>() == 0 the memory address of &T or &[T] is not meaningful, so range_of can only return None or Some(0..subslice.len()).

Zero-sized subslice

When size_of::<T>() == 0 or subslice.is_empty(), range_of may spuriously return Some for a subslice created from outside the self slice.

Proposed implementation
use std::mem;
use std::ops::Range;

trait SliceExt<T> {
    fn index_of(&self, element: &T) -> Option<usize>;
    fn range_of(&self, subslice: &[T]) -> Option<Range<usize>>;
}

impl<T> SliceExt<T> for [T] {
    fn index_of(&self, element: &T) -> Option<usize> {
        let size_of = mem::size_of::<T>();
        if size_of > 0 {
            let range = as_address_range(self);
            let element_address = element as *const T as usize;
            if range.contains(&element_address) {
                let bytes_offset = element_address - range.start;
                if bytes_offset % size_of == 0 {
                    Some(bytes_offset / size_of)
                } else {
                    // This could happen with arguments like this:
                    //
                    // ```
                    // use bytemuck::cast_slice;
                    // let x: &[[i32; 2]] = &[[1,2],[3,4]];
                    // let y: &[[i32; 2]] = cast_slice(&cast_slice::<_, i32>(x)[1..3]);
                    // let index = x.index_of(&y[0]);
                    // dbg!(x, y, index);
                    // ```
                    //
                    // … which prints:
                    //
                    // ```
                    // x: [[1,2],[3,4]]
                    // y: [[2,3]]
                    // index: None
                    // ```
                    None
                }
            } else {
                None
            }
        } else {
            // All items of a slice of ZST share the same address
            if element as *const T == self.as_ptr() {
                Some(0)
            } else {
                None
            }
        }
    }

    fn range_of(&self, subslice: &[T]) -> Option<Range<usize>> {
        let size_of = mem::size_of::<T>();
        if size_of > 0 {
            let range = as_address_range(self);
            let subrange = as_address_range(subslice);
            if subrange.start >= range.start && subrange.end <= range.end {
                let bytes_offset = subrange.start - range.start;
                if bytes_offset % size_of == 0 {
                    let start = bytes_offset / size_of;
                    let end = start + subslice.len();
                    Some(start..end)
                } else {
                    // This could happen with arguments like this:
                    //
                    // ```
                    // use bytemuck::cast_slice;
                    // let x: &[[i32; 2]] = &[[1,2],[3,4]];
                    // let y: &[[i32; 2]] = cast_slice(&cast_slice::<_, i32>(x)[1..3]);
                    // let range = x.range_of(y);
                    // dbg!(x, y, range);
                    // ```
                    //
                    // … which prints:
                    //
                    // ```
                    // x: [[1,2],[3,4]]
                    // y: [[2,3]]
                    // range: None
                    // ```
                    None
                }
            } else {
                None
            }
        } else {
            // All items of a slice of ZST share the same address
            if subslice.as_ptr() == self.as_ptr() {
                Some(0..subslice.len())
            } else {
                None
            }
        }
    }
}

fn as_address_range<T>(slice: &[T]) -> Range<usize> {
    let ptr_range = slice.as_ptr_range();
    let start = ptr_range.start as usize;
    let end = ptr_range.end as usize;
    start..end
}

@SimonSapin
Copy link
Contributor

I’ve edited my proposed doc-comment additions to cover ZSTs and empty subslices.

@Rua
Copy link

Rua commented Sep 11, 2021

I was going to propose index_of myself, so I'm happy to see this is already proposed and being worked on.

@kpreid
Copy link
Contributor

kpreid commented May 2, 2022

I think this RFC is a good idea, but the specific choice of name index_of() concerns me, because it sounds too much like an innocuous value-comparison operation, not one which rightfully has the warning “Note that this does not look at the value, but only at the reference” in its documentation. Generally Rust keeps pointer-equality in items with ptr in the name, like Arc::ptr_eq(), and everything else is by value.

Unfortunately, these particular methods don't take pointers but putting ptr in the name would sound like they do. Maybe index_of_ref() and range_of_ref() would be adequate to flag “something unusual is here”? Or more verbosely, index_of_contained_ref()?

@m-ou-se
Copy link
Member Author

m-ou-se commented May 4, 2022

Unfortuantely, I won't have the time to update this RFC any time soon. If someone wants to push this feature forward without waiting for me to update it, please feel free to open a new RFC taking whatever you need from this one. :)

@joshtriplett
Copy link
Member

This came up in a recent discussion.

Potential alternative: a function on a slice that takes another slice and computes the offset from one to the other. So, for instance, a function on a &str that accepts another &str and returns a usize. This would be useful, for instance, after splitting a string, to get the offset of the split-out string in the original string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-slice Slice related proposals & ideas Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.