Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Option<TinyAsciiStr> be the same size as TinyAsciiStr #2083

Closed
sffc opened this issue Jun 16, 2022 · 26 comments · Fixed by #2430
Closed

Make Option<TinyAsciiStr> be the same size as TinyAsciiStr #2083

sffc opened this issue Jun 16, 2022 · 26 comments · Fixed by #2430
Assignees
Labels
A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Jun 16, 2022

Previously, the tinystr crate used NonZeroU32 and friends to optimize Option<TinyStr4>. We have not yet replicated this optimization after the rewrite to TinyAsciiStr.

@Manishearth had some suggestions on how to go about this.

@sffc sffc added T-enhancement Type: Nice-to-have but not required A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Jun 16, 2022
@Manishearth
Copy link
Member

Did I? I don't recall having such suggestions, I don't think it's currently possible for the generic type.

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

I think the naive solution is to just make the first byte be NonZeroU8, such that we can't represent empty strings, but it creates the niche for Option to use.

A fancier solution is to recognize that 0xFF is never a valid ASCII byte (or UTF-8 byte) and use that as the niche somehow.

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

Here's how it works with the NonZeroU8 lead byte (it might be necessary to reduce the const N by 1, but this is OK especially if we make a new type for it): Playground

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

For the fancier solution, I think the safe way is to make an enum with a value for each valid ASCII char, and the internal representation is an array of those enums instead of an array of raw u8: Playground

@Manishearth
Copy link
Member

Manishearth commented Jun 17, 2022

Oh, huh. The N-1 thing was something I was trying earlier but there were const problems, I guess that works now

Honestly I'd prefer the enum solution since it keeps it clean. Might need to be careful about the repr (C can't use niches, unsure if u8 can, Rust is UB to use this way)

@robertbastian
Copy link
Member

The const problem is still there, you cannot evaluate const generics, so you cannot write the type NonEmptyTinyAsciiStr<N-1>. playground

@robertbastian
Copy link
Member

I tried to find some documentation on how to define a niche, but it seems like the niche optimizations are entirely compiler internals are thus not reference-guaranteed behaviour.

For the enum option, what's the plan wrt to getting a [u8; N] to work with? Transmute? Big match statement?

@Manishearth
Copy link
Member

You cannot define niches beyond making your own enums.

Yes, transmute. If we can guarantee the repr we can transmute.

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

Right, the NonZeroU8 option involves a new type with the size offset by one. TinyAsciiStr<4> becomes NonEmptyTinyAsciiStr<3>, a new type to avoid confusion.

@robertbastian
Copy link
Member

You cannot define niches beyond making your own enums.

But even that doesn't seem to be guaranteed iiuc

@robertbastian
Copy link
Member

a new type to avoid confusion.

I think it will be very confusing to use N-1 in the public type.

@Manishearth
Copy link
Member

But even that doesn't seem to be guaranteed iiuc

well, nothing in repr(Rust) is

Also yes, I agree about N-1

@Manishearth
Copy link
Member

Manishearth commented Jun 17, 2022

Also, uh, the other problem with N-1 is that (NonZero, [u8; N]) doesn't have a defined layout, and you will lose niche optimizations with repr(C). That's another reason why I abandoned that approach. Might work with packed.

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

Also, uh, the other problem with N-1 is that (NonZero, [u8; N]) doesn't have a defined layout, and you will lose niche optimizations with repr(C). That's another reason why I abandoned that approach. Might work with packed.

The playground link I posted earlier demonstrates that it works (with packed)

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ce619eb4e274017e6004a38ca211d4f3

@Manishearth
Copy link
Member

If it works with packed that should be okay, packed is defined.

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

OK, should we vote?

  • Option 1: NonZeroU8 with repr(packed) and N offset by 1
  • Option 2: Internal repr(u8) enum with values 0x00 to 0x7F (or 0xFE), transmuting to [u8]

Note: This is intended as a future feature; I just want to record the conclusion so that it's ready for someone to get started.

@sffc
Copy link
Member Author

sffc commented Jun 17, 2022

Question: is there a code size risk with either of these choices? Will we not know until it is implemented? Would the enum compile down to a zero-cost abstraction?

@Manishearth
Copy link
Member

Manishearth commented Jun 17, 2022

Well, with the enum i suspect we won't ever access it as the enum, we'll always be transmuting

I hae a preference for Option 2

@robertbastian
Copy link
Member

2 >> 1 because the N-1 is such a footgun. It also removes the distinction between empty string and no string because we use 0 as the niche, whereas with option two we use something that is actually a niche, the highest-order bit.

@Manishearth
Copy link
Member

Also uh sorry my preference is for Option 2 I mistyped 😄

Also am 2>>1.

@sffc sffc added the help wanted Issue needs an assignee label Jun 17, 2022
@robertbastian
Copy link
Member

@zbraniecki found https://github.com/ParkMyCar/compact_str, which uses nonmax::NonMaxU8 to achieve the niche. Might be an alternative to the enum.

@sffc
Copy link
Member Author

sffc commented Jun 21, 2022

@zbraniecki found https://github.com/ParkMyCar/compact_str, which uses nonmax::NonMaxU8 to achieve the niche. Might be an alternative to the enum.

Looking at the implementation, it's storing values as NonZeroU8 and XORing them with ^ $primitive::max_value() in the .get() function. This suggests we most likely wouldn't be able to transmute the bytes to a string, since the byte representation is incorrect.

@robertbastian
Copy link
Member

Oh I was hoping it's doing something cleverer

@sffc
Copy link
Member Author

sffc commented Jun 23, 2022

@Manishearth Do we need to define every value in the enum in order to prevent it from being used as the niche, or is setting a start and end range sufficient?

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=f6c980d29eb4f2f057469ce9470d3c11

@CAD97
Copy link
Contributor

CAD97 commented Jun 23, 2022

👋 I pushed the niche optimization for CompactString over the finish line. Stumbled into this at a useful time, it seems.

I don't see a get function in compact_str at all? @sffc (Though the implementation has changed a lot over the previous few versions, and I only hopped on board for the previous couple.) And yes, transmuting any integer not used as a discriminant to an enum in Rust is UB. (Try running the playground under miri: it shows the UB as "encountered 0x68, but expected a valid enum tag".)

The important part of CompactString w.r.t. inlining is the following:

#[cfg(target_pointer_width = "64")]
const MAX_SIZE: usize = 24;
#[cfg(target_pointer_width = "32")]
const MAX_SIZE: usize = 12;

const LENGTH_MASK: u8 = 0b11000000;

#[repr(C)]
#[derive(Debug, Copy, Clone)]
pub struct InlineString {
    buffer: [u8; MAX_SIZE],
}

impl InlineString {
    #[inline]
    pub fn new(text: &str) -> Self {
        debug_assert!(text.len() <= MAX_SIZE);

        let len = text.len();
        let mut buffer = [0u8; MAX_SIZE];

        // set the length
        buffer[MAX_SIZE - 1] = len as u8 | LENGTH_MASK;

        // copy the string
        //
        // note: in the case where len == MAX_SIZE, we'll overwrite the len, but that's okay because
        // when reading the length we can detect that the last byte is part of UTF-8 and return a
        // length of MAX_SIZE
        unsafe { std::ptr::copy_nonoverlapping(text.as_ptr(), buffer.as_mut_ptr(), len) };


        InlineString { buffer }
    }

CompactString stores length in the last byte (which is the NonMaxU8), mapping length values 0x0..0x18 to 0xC0..D8, which are not valid UTF-8 terminating bytes. The max length of 0x19 (2410) is mapped to any ASCII byte 0x0..0xC0 to achieve maximum length inline strings, as the "length byte" is the final UTF-8 byte. compact_str's observation here is that the final byte of a UTF-8 string cannot be any value in the range 0xC0..=0xFF, and the simplest way to niche the inline length byte into that available range is to set/mask the most significant two bits. Final bytes in 0xD8..0xFF are used to mark an outlined string, and 0xFF is made unrepresentable so that it is used as a niche.

This representation should be 100% compatible with true-zero-copy usage.

Oh, our NonMaxU8 is just said enum, not the nonmax crate.

@Manishearth
Copy link
Member

@sffc every undefined value is a niche, regardless of whether rustc currently uses it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants