Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting to VarULE #1179

Closed
sffc opened this issue Oct 16, 2021 · 13 comments · Fixed by #1199
Closed

Converting to VarULE #1179

sffc opened this issue Oct 16, 2021 · 13 comments · Fixed by #1199
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Oct 16, 2021

Part of #1082

Converting From VarULE

From VarULE is fairly easy. From works fine when the target type allocates memory. If we want to support cases where the target type borrows from the VarULE, FromVarULE can work (playground):

trait FromVarULE<'a, U: ?Sized>: 'a {
    fn from_var_ule(ule: &'a U) -> Self;
}

impl FromVarULE<'_, str> for String {
    fn from_var_ule(ule: &str) -> Self {
        ule.to_string()
    }
}

impl<'a> FromVarULE<'a, str> for Cow<'a, str> {
    fn from_var_ule(ule: &'a str) -> Self {
        Cow::Borrowed(ule)
    }
}

Converting To VarULE

The other direction is more complicated because VarULE is unsized.

Box-based solutions

A simple trait would be something like the following, but we should not consider it because it requires allocation (playground):

trait IntoBoxedVarULE<U: ?Sized> {
    fn into_var_ule(&self) -> Box<U>;
}

impl IntoBoxedVarULE<str> for String {
    fn into_var_ule(&self) -> Box<str> {
        self.clone().into_boxed_str()
    }
}

impl IntoBoxedVarULE<str> for Cow<'_, str> {
    fn into_var_ule(&self) -> Box<str> {
        self.to_string().into_boxed_str()
    }
}

An interesting option would be a Box-based solution that uses a custom allocator such that it does not need to actually allocate memory. A trait definition such as the following would be cool, but I can't get this to compile because MaybeUninit requires a Sized type (which seems silly to me, since the main reason you allocate in Rust is for unsized things):

// error[E0277]: the size for values of type `U` cannot be known at compilation time
trait IntoAllocBoxedVarULE<U: ?Sized> {
    fn into_var_ule<A: Allocator>(
        &self,
        allocf: impl FnOnce(usize) -> Box<MaybeUninit<U>, A>
    ) -> Box<U, A>;
}

A workaround would be to pass an uninitialized buffer into the function. This results in a safe trait (I think), although implementing it requires unsafe code and unstable features (playground):

trait IntoAllocBufferVarULE<U: ?Sized> {
    fn into_var_ule<A: Allocator>(
        &self,
        allocf: impl FnOnce(usize) -> Box<[MaybeUninit<u8>], A>
    ) -> Box<U, A>;
}

Buffer-based solutions

@Manishearth proposed the following in #1173:

pub unsafe trait EncodeAsVarULE {
    type VarULE: VarULE + ?Sized;
    fn encode_var_ule<R>(&self, cb: impl FnOnce(&[&[u8]]) -> R) -> R;
}

unsafe impl EncodeAsVarULE for String {
    type VarULE = str;
    fn encode_var_ule<R>(&self, cb: impl FnOnce(&[&[u8]]) -> R) -> R {
        cb(&[self.as_bytes()])
    }
}

The advantage of this type of approach, which I'll call a "buffer-based approach" since it returns a [u8] instead of a U, is that it is easy to reason about and doesn't require unstable features or additional memory allocations. The disadvantage is that it requires the trait to be unsafe.

An issue with the above trait is that it requires creating the &[&[u8]], which is easy if the outer slice has only a single element, but may require allocating if multiple slices are required (such as a Vec<String>). An alternative would be something such as:

trait AppendableBuffer {
    fn push_bytes(&mut self, bytes: &[u8]);
}

unsafe trait AppendAsVarULE<U: ?Sized> {
    fn encode_var_ule<R, A: AppendableBuffer>(
        &self,
        appendable: &mut A
    ) -> R;
}

The above solution could also use std::io::Write, but that trait requires the std feature.

We could also pre-allocate the whole buffer:

unsafe trait WriteToBufferAsVarULE<U: ?Sized> {
    fn encode_var_ule<'a, E>(
        &self,
        get_buffer: impl FnOnce(usize) -> &'a mut [u8]
    ) -> Result<&'a U, E>;
}

An advantage of WriteToBufferAsVarULE is that the safety constraint is a bit simpler: the only requirement is that the pointer returned by encode_var_ule is equal in in location, size, and alignment to the pointer returned by get_buffer. Validating that the buffer is a valid VarULE is done internally in the function.

CC @zbraniecki

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting S-small Size: One afternoon (small bug fix or enhancement) labels Oct 16, 2021
@Manishearth
Copy link
Member

So a problem with the Appendable thing is that you often need to know the size of the buffer beforehand, for example when calling .insert() on VZV.

I'm not sure if Vec<String> fields should be supported for this: you should use a VZV instead and directly yield the buffer. At least, that's the intent of the design; I don't think it should ever be the case that you allocate the buffer passed through to the encode function. It should always be a fixed size stack array.

@Manishearth
Copy link
Member

This will be explained a bit better in the EncodeAsVarULE docs once I get to them :)

@sffc
Copy link
Member Author

sffc commented Oct 17, 2021

What is the use case you had in mind where the slice-of-slices has more than one element?

@Manishearth
Copy link
Member

Manishearth commented Oct 17, 2021

What is the use case you had in mind where the slice-of-slices has more than one element?

Zibi needs this.

something like:

struct Foo<'a> {
   x: u32,
   y: char,
   vec: ZeroVec<'a, u32>
}

struct FooULE {
    x: u32::ULE,
    y: char::ULE,
    vec: [u32::ULE]
}

fn encode_as_ule(...) {
    cb(&[x.as_byte_slice().as_unaligned(), y.as_unaligned().as_byte_slice(), vec.as_bytes()])
}

The idea is that you do this once per field, and since there's a constant number of fields this will work without allocation.

Note that FooULE-style types won't work if you need multiple ZeroVecs since Rust does not support multi-DST-field DSTs. However you can still define a custom FooULE type with a single [u8] at the end which it can split into multiple ZeroVecs based on a length field.

The only time you should have to allocate for an encode_() impl is if the type you are encoding does not use ZeroFoo (or String) for dynamically sized fields; and IMO if you have such a type you have already chosen an allocating path.

@sffc
Copy link
Member Author

sffc commented Oct 17, 2021

Okay. So this is basically a variadic function, and using a slice seems to be Rust convention for that. Works for me

@Manishearth
Copy link
Member

Yeah should have opened with that 😄 . I actually considered trying to make this work with generic tuples instead (because then you could just feed it ULE types directly) but then you need an extra trait and you have to manually implement it on all kinds of tuples.

@sffc sffc mentioned this issue Oct 17, 2021
2 tasks
@sffc sffc closed this as completed Oct 17, 2021
@sffc
Copy link
Member Author

sffc commented Oct 20, 2021

I'm reopening this issue because we haven't fully litigated the pros and cons of WriteToBufferAsVarULE.

An unfortunate side-effect of EncodeAsVarULE is that it requires calling the function twice, once to get the length and then later to actually copy the data.

@sffc sffc reopened this Oct 20, 2021
@Manishearth
Copy link
Member

Only for .insert() and .replace(), which are already a slow path. Bear in mind that 99% of the time that first call for the length will compile down into something super simple (should be something of the form constant1 + len / constant2)

If it's that concerning we could add a second encoded_length() method that's defaulted but can be overridden to ensure perf characteristics.

@sffc
Copy link
Member Author

sffc commented Oct 20, 2021

I think I'm in favor of a second method on the trait, like Writeable uses. I don't think it should be defaulted. Most impls of the method should be #[inline] and will likely be of the form N + self.value.len() where N is a constant and self.value is the variable-length byte field.

@Manishearth
Copy link
Member

Hmmm. So an important thing to note is that due to the use of generics it's getting inlined anyway, and most likely optimized; using a defaulted function can be a helpful way to hint (and reduces duplication) but I'd rather not require people to implement it manually.

@sffc
Copy link
Member Author

sffc commented Oct 20, 2021

We still haven't discussed the pros and cons of

unsafe trait WriteToBufferAsVarULE<U: ?Sized> {
    fn encode_var_ule<'a, E>(
        &self,
        get_buffer: impl FnOnce(usize) -> &'a mut [u8]
    ) -> Result<&'a U, E>;
}

@sffc
Copy link
Member Author

sffc commented Oct 21, 2021

Discussion from today: we solve almost all of our known problems with

unsafe trait EncodeAsVarULE<U: ?Sized> {
    fn encode_var_ule<'a, E>(
        &self,
        get_buffer: impl FnOnce(usize) -> &'a mut [u8]
    ) -> Result<&'a U, E>;

    fn encode_len(&self) -> usize;

    fn encode_write<'a, E>(&self, &'a mut [u8]) -> Result<&'a U, E>;
}

where encode_len and encode_write are auto-implemented to call encode_var_ule, but they can be overridden. Meanwhile, clients such as VarZeroVec::insert() will only call encode_len and encode_write and never call encode_var_ule except indirectly via those methods.

@Manishearth
Copy link
Member

#1199 addresses that issue.

@Manishearth Manishearth changed the title Converting to and from VarULE Converting to VarULE Oct 21, 2021
@Manishearth Manishearth added this to the ICU4X 0.4 milestone Oct 21, 2021
@Manishearth Manishearth self-assigned this Oct 21, 2021
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants