Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data transformer for Script_Extensions map data #1353

Merged
merged 58 commits into from
Dec 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
1de0ca5
Add initial code for data struct and related types for Script_Extensions
echeran Dec 1, 2021
5dbdc99
Random minor Rust doc string fixes
echeran Dec 1, 2021
bf53c8f
Add initial code for data provider for Script / Script_Extensions
echeran Dec 2, 2021
abe3ad4
Attempt to satisfy ULE trait impl for Script newtype/enum
echeran Dec 2, 2021
8418325
Revert Script AsULE impl; Use the associated ULE type directly in the…
echeran Dec 2, 2021
4a35625
Code from Iain to implement .get for a VZV<[T]> that returns ZV<T>
echeran Dec 2, 2021
4e80ac4
Add ule module to zerovec
Manishearth Dec 3, 2021
2fceaff
Add ZeroVecULE
Manishearth Dec 3, 2021
52323b3
Add VarULE impl and ser/de
Manishearth Dec 3, 2021
7ea5b5f
Add other impls
Manishearth Dec 3, 2021
11f6675
Add Ord impls
Manishearth Dec 3, 2021
255b01e
Move VarULE impls over to ZeroVecULE
Manishearth Dec 3, 2021
4bf4824
Add Eq impl to VarZeroVec
Manishearth Dec 3, 2021
05ac46a
docs
Manishearth Dec 3, 2021
2a977a8
docs
Manishearth Dec 3, 2021
b56308e
import
Manishearth Dec 3, 2021
a64c7b2
update blob data provider
Manishearth Dec 3, 2021
f009e86
Merge branch 'zv-ule' into script-ext-data
echeran Dec 3, 2021
76e87b1
Use VZV<ZeroVecULE<Script>> for Script_Extensions
echeran Dec 3, 2021
54ff899
Apply cargo fmt changes
echeran Dec 3, 2021
ed30f01
Create skeleton of data provider code; perform some helpful refactorings
echeran Dec 4, 2021
a33c62a
Fill out the source data -> VZV conversion for ScriptExtensions
echeran Dec 4, 2021
888cf26
Remove the .get_as_zv(usize) for VZV after being superseded by new Ze…
echeran Dec 4, 2021
6722ead
Add function to get Script value from Script_Extensions data struct v…
echeran Dec 5, 2021
cdf3565
Apply cargo fmt changes
echeran Dec 5, 2021
12cf565
Fix bit shifting logic for ScriptWithExt
echeran Dec 5, 2021
646be8d
Apply cargo fmt changes
echeran Dec 6, 2021
64764e0
Move impl AsULE for ScriptWithExt back to ule.rs to be consistent wit…
echeran Dec 6, 2021
ad410dd
Add getter for script extensions array
echeran Dec 6, 2021
ddce4e1
Typo fix
echeran Dec 6, 2021
6dd5ca3
Apply suggestions from code review
echeran Dec 8, 2021
24bba86
Make fields of ScriptExtensions struct private
echeran Dec 9, 2021
695783e
Combine duplicate branch body code in nested if/else
echeran Dec 9, 2021
3c51411
Merge branch 'main' into script-ext-data
echeran Dec 10, 2021
c87c93b
Use default for DataResponseMetadata in provider response
echeran Dec 10, 2021
63f2f0c
Add .get_ule() method to CodePointTrie and refactor .get()
echeran Dec 10, 2021
90bff89
Hoist data provider for Script / Script_Extensions for export within …
echeran Dec 10, 2021
140ef0b
Fix typo
echeran Dec 10, 2021
a250d0a
Fix property newtype name used to do data transform
echeran Dec 10, 2021
ad00ecf
Replace debug_assert! with return of Err(MissingResourceKey)
echeran Dec 10, 2021
6b5db19
Apply suggestions from code review
echeran Dec 13, 2021
3bafe8e
Revert changes after eager merge of PR branch for #1357 that were rem…
echeran Dec 13, 2021
a7d828b
Removed unneeded line in examples code
echeran Dec 13, 2021
f004b71
Rename provider_uprops serde module for Script_Extensions
echeran Dec 13, 2021
4009be3
Merge branch 'main' into script-ext-data
echeran Dec 14, 2021
f1c6000
Attempt to replace ZeroVecULE with ZeroSlice, but unsure how to pass …
echeran Dec 14, 2021
1618f47
Attempt to improve VarZeroVec construction from Vec<Vec<T>>
echeran Dec 14, 2021
dc3543b
Convert from Script::ULE to ScriptWithExt::ULE directly since it is t…
echeran Dec 15, 2021
0b3eef9
Fix code to make get_script_extensions_val() return a ZeroSlice ref
echeran Dec 16, 2021
94aceba
Apply cargo fmt changes
echeran Dec 16, 2021
8d7b803
Revert attempt to simplify / make more type-safe the data transformer…
echeran Dec 16, 2021
fff3be9
Fix unit tests, apply cargo fmt and clippy changes
echeran Dec 16, 2021
9ee05fa
Add TODO for transformer simplification
echeran Dec 16, 2021
a53b78f
Fix Script_Extensions source data file after running icuexportdata wi…
echeran Dec 16, 2021
19b9f4d
WIP (find a From impl for VZV to match Vec<Vec<T>>)
echeran Dec 16, 2021
0462c73
Simplify construction of VZV<ZeroSlice<Script>> from Vec<Vec<u16>>
echeran Dec 16, 2021
c12315f
Impl Default for ZeroSlice; clean up get_script_extensions_val()
echeran Dec 16, 2021
4137848
Merge branch 'main' into script-ext-data
echeran Dec 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions components/properties/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ mod error;
pub mod maps;
mod props;
pub mod provider;
pub mod script;
pub mod sets;
mod trievalue;
mod ule;
Expand Down
2 changes: 1 addition & 1 deletion components/properties/src/maps.rs
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ where
}

/// Return a [`CodePointTrie`] for the East_Asian_Width Unicode enumerated
/// property. See [`East_Asian_Width`].
/// property. See [`EastAsianWidth`].
///
/// # Example
///
Expand Down
28 changes: 28 additions & 0 deletions components/properties/src/provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
//!
//! Read more about data providers: [`icu_provider`]

use crate::script::ScriptExtensions;
use icu_codepointtrie::{CodePointTrie, TrieValue};
use icu_provider::yoke::{self, *};
use icu_uniset::UnicodeSet;
Expand Down Expand Up @@ -331,6 +332,16 @@ pub mod key {
(SENTENCE_BREAK_V1, "SB"),

);

define_resource_keys!(ALL_SCRIPT_EXTENSIONS_KEYS; 1;
//
// Script_Extensions + Script data
//

// ResourceKey subcategory string is the short alias of Script_Extensions

(SCRIPT_EXTENSIONS_V1, "scx"),
);
}

//
Expand Down Expand Up @@ -408,3 +419,20 @@ pub struct UnicodePropertyMapV1Marker<T: TrieValue> {
impl<T: TrieValue> icu_provider::DataMarker for UnicodePropertyMapV1Marker<T> {
type Yokeable = UnicodePropertyMapV1<'static, T>;
}

//
// Script_Extensions
//

/// A data structure efficiently storing `Script` and `Script_Extensions` property data.
#[icu_provider::data_struct]
#[derive(Debug, Eq, PartialEq)]
#[cfg_attr(
feature = "provider_serde",
derive(serde::Serialize, serde::Deserialize)
)]
pub struct ScriptExtensionsPropertyV1<'data> {
/// A special data structure for `Script` and `Script_Extensions`.
#[cfg_attr(feature = "provider_serde", serde(borrow))]
pub data: ScriptExtensions<'data>,
}
193 changes: 193 additions & 0 deletions components/properties/src/script.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! Data and APIs for supporting both Script and Script_Extensions property
//! values in an efficient structure.

use crate::error::PropertiesError;
use crate::props::Script;

use icu_codepointtrie::{CodePointTrie, TrieValue};
use icu_provider::yoke::{self, *};
use zerovec::{VarZeroVec, ZeroSlice};

#[cfg(feature = "serde")]
use serde::{Deserialize, Serialize};

const SCRIPT_X_SCRIPT_VAL: u16 = 0x03FF;
const SCRIPT_VAL_LENGTH: u16 = 10;

/// An internal-use only pseudo-property that represents the values stored in
/// the trie of the special data structure [`ScriptExtensions`].
///
/// Note: The will assume a 12-bit layout. The 2 higher order bits in positions
/// 11..10 will indicate how to deduce the Script value and Script_Extensions,
/// and the lower 10 bits 9..0 indicate either the Script value or the index
/// into the `extensions` structure.
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[repr(transparent)]
pub struct ScriptWithExt(pub u16);

#[allow(missing_docs)] // These constants don't need individual documentation.
#[allow(non_upper_case_globals)]
impl ScriptWithExt {
pub const Unknown: ScriptWithExt = ScriptWithExt(0);
}

impl ScriptWithExt {
pub fn is_common(&self) -> bool {
self.0 >> SCRIPT_VAL_LENGTH == 1
}

pub fn is_inherited(&self) -> bool {
self.0 >> SCRIPT_VAL_LENGTH == 2
}

pub fn is_other(&self) -> bool {
self.0 >> SCRIPT_VAL_LENGTH == 3
}
}

/// A data structure that represents the data for both Script and
/// Script_Extensions properties in an efficient way. This structure matches
/// the data and data structures that are stored in the corresponding ICU data
/// file for these properties.
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[derive(Debug, Eq, PartialEq, Yokeable, ZeroCopyFrom)]
pub struct ScriptExtensions<'data> {
/// Note: The `ScriptWithExt` values in this array will assume a 12-bit layout. The 2
/// higher order bits 11..10 will indicate how to deduce the Script value and
/// Script_Extensions value, nearly matching the representation
/// [in ICU](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/uprops.h):
///
/// | High order 2 bits value | Script | Script_Extensions |
/// |-------------------------|--------------------------------------------------------|----------------------------------------------------------------|
/// | 3 | First value in sub-array, index given by lower 10 bits | Sub-array excluding first value, index given by lower 10 bits |
/// | 2 | Script=Inherited | Entire sub-array, index given by lower 10 bits |
/// | 1 | Script=Common | Entire sub-array, index given by lower 10 bits |
/// | 0 | Value in lower 10 bits | `[ Script value ]` single-element array |
///
/// When the lower 10 bits of the value are used as an index, that index is
/// used for the outer-level vector of the nested `extensions` structure.
#[cfg_attr(feature = "serde", serde(borrow))]
trie: CodePointTrie<'data, ScriptWithExt>,

/// This companion structure stores Script_Extensions values, which are
/// themselves arrays / vectors. This structure only stores the values for
/// cases in which `scx(cp) != [ sc(cp) ]`. Each sub-vector is distinct. The
/// sub-vector represents the Script_Extensions array value for a code point,
/// and may also indicate Script value, as described for the `trie` field.
#[cfg_attr(feature = "serde", serde(borrow))]
extensions: VarZeroVec<'data, ZeroSlice<Script>>,
}

impl<'data> ScriptExtensions<'data> {
pub fn try_new(
trie: CodePointTrie<'data, ScriptWithExt>,
extensions: VarZeroVec<'data, ZeroSlice<Script>>,
) -> Result<ScriptExtensions<'data>, PropertiesError> {
// TODO: do validation here

Ok(ScriptExtensions { trie, extensions })
}

pub fn get_script_val(&self, code_point: u32) -> Script {
let sc_with_ext = self.trie.get(code_point);

if sc_with_ext.is_other() {
let ext_idx = sc_with_ext.0 & SCRIPT_X_SCRIPT_VAL;
let scx_val = self.extensions.get(ext_idx as usize);
let scx_first_sc = scx_val.and_then(|scx| scx.get(0));

let default_sc_val = <Script as TrieValue>::DATA_GET_ERROR_VALUE;

scx_first_sc.unwrap_or(default_sc_val)
} else if sc_with_ext.is_common() {
Script::Common
} else if sc_with_ext.is_inherited() {
Script::Inherited
} else {
let script_val = sc_with_ext.0 & SCRIPT_X_SCRIPT_VAL;
Script(script_val)
}
}

pub fn get_script_extensions_val(&self, code_point: u32) -> &ZeroSlice<Script> {
let sc_with_ext = self.trie.get(code_point);

if sc_with_ext.is_other() {
let ext_idx = sc_with_ext.0 & SCRIPT_X_SCRIPT_VAL;
let ext_subarray = self.extensions.get(ext_idx as usize);
// In the OTHER case, where the 2 higher-order bits of the
// `ScriptWithExt` value in the trie doesn't indicate the Script value,
// the Script value is copied/inserted into the first position of the
// `extensions` array. So we must remove it to return the actual scx array val.
let scx_slice = ext_subarray
.and_then(|zslice| zslice.as_ule_slice().get(1..))
.unwrap_or_default();
ZeroSlice::from_ule_slice(scx_slice)
} else if sc_with_ext.is_common() || sc_with_ext.is_inherited() {
let ext_idx = sc_with_ext.0 & SCRIPT_X_SCRIPT_VAL;
let scx_val = self.extensions.get(ext_idx as usize);
scx_val.unwrap_or_default()
} else {
let script_with_ext_ule = self.trie.get_ule(code_point);
let script_ule_slice = script_with_ext_ule
.map(|swe| core::slice::from_ref(swe))
.unwrap_or_default();
ZeroSlice::from_ule_slice(script_ule_slice)
}
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_is_common() {
assert!(ScriptWithExt(0x04FF).is_common());
assert!(ScriptWithExt(0x0400).is_common());

assert!(!ScriptWithExt(0x08FF).is_common());
assert!(!ScriptWithExt(0x0800).is_common());

assert!(!ScriptWithExt(0x0CFF).is_common());
assert!(!ScriptWithExt(0x0C00).is_common());

assert!(!ScriptWithExt(0xFF).is_common());
assert!(!ScriptWithExt(0x0).is_common());
}

#[test]
fn test_is_inherited() {
assert!(!ScriptWithExt(0x04FF).is_inherited());
assert!(!ScriptWithExt(0x0400).is_inherited());

assert!(ScriptWithExt(0x08FF).is_inherited());
assert!(ScriptWithExt(0x0800).is_inherited());

assert!(!ScriptWithExt(0x0CFF).is_inherited());
assert!(!ScriptWithExt(0x0C00).is_inherited());

assert!(!ScriptWithExt(0xFF).is_inherited());
assert!(!ScriptWithExt(0x0).is_inherited());
}

#[test]
fn test_is_other() {
assert!(!ScriptWithExt(0x04FF).is_other());
assert!(!ScriptWithExt(0x0400).is_other());

assert!(!ScriptWithExt(0x08FF).is_other());
assert!(!ScriptWithExt(0x0800).is_other());

assert!(ScriptWithExt(0x0CFF).is_other());
assert!(ScriptWithExt(0x0C00).is_other());

assert!(!ScriptWithExt(0xFF).is_other());
assert!(!ScriptWithExt(0x0).is_other());
}
}
10 changes: 10 additions & 0 deletions components/properties/src/trievalue.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

use crate::script::ScriptWithExt;
use crate::{
CanonicalCombiningClass, EastAsianWidth, GeneralCategory, GraphemeClusterBreak, LineBreak,
Script, SentenceBreak, WordBreak,
Expand Down Expand Up @@ -41,6 +42,15 @@ impl TrieValue for Script {
}
}

impl TrieValue for ScriptWithExt {
const DATA_GET_ERROR_VALUE: ScriptWithExt = ScriptWithExt::Unknown;
type TryFromU32Error = TryFromIntError;

fn try_from_u32(i: u32) -> Result<Self, Self::TryFromU32Error> {
u16::try_from(i).map(Self)
}
}

impl TrieValue for EastAsianWidth {
const DATA_GET_ERROR_VALUE: EastAsianWidth = EastAsianWidth::Neutral;
type TryFromU32Error = TryFromIntError;
Expand Down
15 changes: 15 additions & 0 deletions components/properties/src/ule.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

use crate::script::ScriptWithExt;
use crate::{
CanonicalCombiningClass, EastAsianWidth, GeneralCategory, GraphemeClusterBreak, LineBreak,
Script, SentenceBreak, WordBreak,
Expand Down Expand Up @@ -78,6 +79,20 @@ impl AsULE for Script {
}
}

impl AsULE for ScriptWithExt {
type ULE = PlainOldULE<2>;

#[inline]
fn as_unaligned(self) -> Self::ULE {
PlainOldULE(self.0.to_le_bytes())
}

#[inline]
fn from_unaligned(unaligned: Self::ULE) -> Self {
ScriptWithExt(u16::from_le_bytes(unaligned.0))
}
}

impl AsULE for EastAsianWidth {
type ULE = u8;

Expand Down
Loading