Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict Script (and ScriptWithExt) to specific range #1404

Open
1 of 3 tasks
sffc opened this issue Dec 15, 2021 · 6 comments
Open
1 of 3 tasks

Restrict Script (and ScriptWithExt) to specific range #1404

sffc opened this issue Dec 15, 2021 · 6 comments
Labels
blocked A dependency must be resolved before this is actionable C-unicode Component: Props, sets, tries help wanted Issue needs an assignee S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@sffc
Copy link
Member

sffc commented Dec 15, 2021

Currently both Script and ScriptWithExt use type ULE = PlainOldULE<2>. However, ScriptWithExt assumes that Script is < 1024. We should consider strengthening the requirements on Script and ScriptWithExt.

One way to do this would be to:

  1. Make the u16 field private, and ensure it is always in range when constructing
  2. Add a common ULE type BoundedBytesULE<const N: usize, const L: usize, const U: usize> that bounds a ULE integer to a specific inclusive range between L and U
  3. Make Script and ScriptWithExt use that new common ULE type

Advantages of making this change:

  • More correct and type-safe
  • Prevents errors when converting between Script and ScriptWithExt

Disadvantages:

  • May be considered overdesigning, since the additional types have limited utility
  • Adds a ULE constrait that may impact the deserialization speed of CPTs containing Script

Needs feedback from:

@sffc sffc added C-unicode Component: Props, sets, tries v1 T-techdebt Type: ICU4X code health and tech debt S-small Size: One afternoon (small bug fix or enhancement) labels Dec 15, 2021
@sffc sffc self-assigned this Dec 15, 2021
@echeran
Copy link
Contributor

echeran commented Dec 15, 2021

FYI for others, ScriptWithExt is being introduced in #1353. It is a newtype for the values in a code point trie that, together with a (nested) array of Script values, can represent the property data for both Script and Script_Extensions efficiently space-wise. This space savings is due to the minimal extra information beyond Script that is needed for Script_Extensions. The trie in this data structure is almost the same as a regular trie for the Script proeprty, except that for each trie value, 2 of the higher-order bits are set to allow short-circuiting some of the lookups for Script and/or Script_Extensions values.

ScriptWithExt isn't a property, but because it occurs in a CPT, it is written similarly and needs impls of the same traits that other enums/newtypes need when they occur as values within a CPT.

I agree that the proposal above could address the problem, but I wonder what the utility of this would be -- in other words, how many other "pseudo-property" types are we creating to help support our data structure optimizations by representing bit-fiddly versions of real data? Could such a new bound be applied in other scenarios? -- I can't think of any because newtypes are defined to not have a known upper bound (beyond the max value of the wrapper type) in order to allow for the possibility that future [enumerated, numeric, etc.] values are defined in the future for the property that we can't yet know about in the present.

So if we only have ScriptWithExt and maybe one or two other situations, it seems like a lot of effort to be type-safe but with little payoff if it is only used in a handful of situations that are mostly internal details anyways.

@Manishearth
Copy link
Member

Manishearth commented Dec 16, 2021

My understanding is that more script codes can be introduced in the future, right? Or are you considering introducing bounds at the 1024 level which is (i guess?) the absolute max available.

I'm kinda ambivalent, I like the idea of having stricter validation here but I don't want to make things harder to use. I'm not really sure if this makes things harder to use, though.

@sffc
Copy link
Member Author

sffc commented Dec 16, 2021

@markusicu says there is an intrinsic maximum of around 500 script codes that could possibly be encoded, and that 1024 is more than we should ever need.

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Jan 6, 2022
@iainireland
Copy link
Contributor

2 bits for the metadata plus 10 bits for the script code is only 12 out of 16 bits, so we have 4 bits of leeway here.

Given that our current margin of safety is already 2x the intrinsic maximum, I don't think there's a huge return on investment for adding type-system restrictions here. At most, I would consider moving the 2 bits of metadata up to the highest bits and officially reserving the 4 bits in between, so that they can be used as either script code or metadata if the future surprises us.

@sffc
Copy link
Member Author

sffc commented Jan 9, 2022

Okay, fair enough.

I think a better approach here anyway may be to apply bitpacking, which I filed as a separate issue, #1487.

I'm going to move this issue to the backlog and block it on #1487.

@sffc sffc added backlog blocked A dependency must be resolved before this is actionable and removed needs-approval One or more stakeholders need to approve proposal labels Jan 9, 2022
@sffc sffc removed their assignment Jan 9, 2022
@sffc sffc added the help wanted Issue needs an assignee label Jan 9, 2022
@markusicu
Copy link
Member

@echeran deliberately used only the lower 12 bits of the u16 value, so that if in the future we add a 12-bit value width to the code point trie, the ScriptWithExt values need not be re-encoded. So unless there is a specific benefit, I wouldn't move the special bits up.

A CPT with 12-bit value width would be relatively easy, but trading off size vs. performance because reading a data value would require reading/combining two bytes and applying some bit masking/shifting, rather than just reading a u8/u16/u32.

@sffc sffc removed the v1 label Apr 1, 2022
@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked A dependency must be resolved before this is actionable C-unicode Component: Props, sets, tries help wanted Issue needs an assignee S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

No branches or pull requests

5 participants