Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move error_value (and optionally default_value) to the CPT header? #1879

Closed
2 of 3 tasks
sffc opened this issue May 12, 2022 · 4 comments · Fixed by #2301
Closed
2 of 3 tasks

Move error_value (and optionally default_value) to the CPT header? #1879

sffc opened this issue May 12, 2022 · 4 comments · Fixed by #2301
Assignees
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@sffc
Copy link
Member

sffc commented May 12, 2022

See previous discussion in #1183.

The thread #1183 got derailed into a discussion on trust boundaries. I moved that conversation over to #1290. Our work on CrabBake is our solution to that side of the problem and allows us to reason about "invalid serialized data" in the serde case.

I increasingly feel that the option most consistent with how our data model works is to move these two fields, especially error_value, into the CPT header. This would help us eliminate the problematic DATA_GET_ERROR_VALUE.

This should be an easy change: the error value is always the last value in the data array, so at datagen time, we just remove the last value from the array and save it into the header.

Responses to the pushback on this idea from the previous thread:

@markusicu: Between UTrie2 and the newer CodePointTrie, I moved the error value from the header struct into the data array in order to make the code simpler and more uniform.

In Rust, I do not believe that this will increase code complexity. It would involve adding a type parameter to CodePointTrieHeader, but this parameter would just be bubbled down from CodePointTrie, which already has a type parameter.

It may require adding #[derive(Deserialize)] on TrieType types that don't already have it.

@hsivonen: This would make ICU4X compute bogus results if externally-traveled data actually has been corrupted. In terms of surprise and what it takes to understand what's going wrong, this seems worse than any of the other options. However, in terms of usage ergonomics, I'd rather settle on this option than option 2 if we can't agree on panicking on malformed data at the time of making lookups from the data.

We are moving from one type of GIGO to another type of GIGO. The new type of GIGO is in fact slightly more robust, because it allows us to automatically fail in the constructor if the error value field is not present in the header, something we can't as easily do when it is stored in the data array. Discussion of when or when not to rely on GIGO is the topic of #1290 and I do not want to derail this thread again on that separate, very important issue.

Needs approval from:

@sffc sffc added C-unicode Component: Props, sets, tries T-techdebt Type: ICU4X code health and tech debt S-small Size: One afternoon (small bug fix or enhancement) needs-approval One or more stakeholders need to approve proposal labels May 12, 2022
@sffc sffc added this to the ICU4X 1.0 milestone May 12, 2022
@Manishearth
Copy link
Member

This makes sense to me.

@echeran
Copy link
Contributor

echeran commented May 26, 2022

It seems fine to me, it seems to be an incremental improvement on code clarity and safety without actually changing any behavior.

If you're so inclined, since this value should be the last element in the CPT's data array (according to this constant), you could truncate the data array by 1 element. But I don't know if that's beneficial.

@sffc sffc added good first issue Good for newcomers help wanted Issue needs an assignee and removed needs-approval One or more stakeholders need to approve proposal labels May 28, 2022
@Manishearth
Copy link
Member

@pandusonu2 is working on this

The bug is to add an error_value field to this type, of type T

data: ZeroVec<'trie, T>,

And we can parse it in from the last element of the data array here:

impl<T: TrieValue> TryFrom<&CodePointTrieToml> for CodePointTrie<'static, T> {
type Error = Error;
fn try_from(cpt_data: &CodePointTrieToml) -> Result<CodePointTrie<'static, T>, Self::Error> {
use CodePointDataSlice::*;
let header = CodePointTrieHeader::try_from(cpt_data)?;
let index: ZeroVec<u16> = ZeroVec::alloc_from_slice(&cpt_data.index);
let data: Result<ZeroVec<'static, T>, T::TryFromU32Error> = match cpt_data.data_slice()? {
U8(s) => s.iter().map(|i| T::try_from_u32(*i as u32)).collect(),
U16(s) => s.iter().map(|i| T::try_from_u32(*i as u32)).collect(),
U32(s) => s.iter().map(|i| T::try_from_u32(*i as u32)).collect(),
};
let data = data.map_err(|_| Error::FromDeserialized {
reason: "Could not parse data array to typed array",
})?;
CodePointTrie::<T>::try_new(header, index, data)
}
}

And then replace all uses of the following constant with it:

const DATA_GET_ERROR_VALUE: Self;

We also probably need a pub error_value() function that returns this.

@sffc sffc removed the help wanted Issue needs an assignee label Jun 9, 2022
@sffc
Copy link
Member Author

sffc commented Jul 28, 2022

@pandusonu2 What is the status of this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
4 participants