Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodePointTrie data provider #1167

Merged
merged 32 commits into from
Oct 21, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
d1ca58c
Rename TrieTypeEnum to TrieType
iainireland Oct 7, 2021
2731299
Implement Yokeable/ZeroCopyFrom for CodePointTrie and data struct
iainireland Oct 7, 2021
3519819
Cargo fmt + minor fixes
iainireland Oct 7, 2021
9142762
Add CPT struct to icu_provider_uprops data source struct
echeran Oct 7, 2021
7ce722e
Renames data providers for UnicodeSet data ahead of introducing one f…
echeran Oct 8, 2021
7784c1d
Matches CPT version to project/sub-crates, adds CPT as dep to provide…
echeran Oct 11, 2021
1f6d7e7
Add WIP code for data provider for CodePointTrie data
echeran Oct 11, 2021
3a890f5
More WIP code for CodePointTrie data provider implementation
echeran Oct 12, 2021
888edc5
Fix error
Manishearth Oct 12, 2021
4b6c986
Merge branch 'main' into cpt-data-transformer
echeran Oct 12, 2021
1186512
Simplify constructing ZeroVec using ZV's new FromIterator impl
echeran Oct 12, 2021
b67d033
Merge branch 'main' into cpt-data-transformer
echeran Oct 19, 2021
4abb8a4
Merge current snapshot of PR #1153 (refactor properties to separate c…
echeran Oct 19, 2021
ee4e8e9
Update path to uniset crate in CI job for benchmarking
echeran Oct 19, 2021
69fc6e1
Merge branch 'main' into cpt-data-transformer
echeran Oct 20, 2021
c543200
Implement TrieValue for GeneralSubcategory
echeran Oct 12, 2021
40381be
Implement TrieValue for Script
echeran Oct 20, 2021
d4bbcbd
Rename TrieValue trait's associate type for Result errors
echeran Oct 20, 2021
997db93
Remove unneeded dependency
echeran Oct 20, 2021
28b51c0
Revert version number of icu_codepointtrie
echeran Oct 20, 2021
e10eb66
Move data structs for UnicodePropertyMap from icu_codepointtrie to ic…
echeran Oct 20, 2021
5b32a4d
Error message rewording
echeran Oct 20, 2021
499b817
Finish reverting unneeded renaming/refactoring in icu_properties
echeran Oct 20, 2021
7139775
Add docstrings for the uprops data providers
echeran Oct 20, 2021
fd0f748
Add test for Script using data provider for CodePointTrie data
echeran Oct 20, 2021
4ce245c
Export CPT data provider symbol publicly
echeran Oct 20, 2021
b8aada4
Merge branch 'main' into cpt-data-transformer
echeran Oct 20, 2021
8aca45a
Declare no_std for icu_codepointtrie
echeran Oct 20, 2021
a39dd49
Add `extern crate...` to import alloc libs
echeran Oct 20, 2021
3689a11
Remove unused custom code for string -> enum conversion
echeran Oct 21, 2021
f24108d
Replace icu_provider dep with yoke, remove std feature in icu_codepoi…
echeran Oct 21, 2021
0a32faf
Add derive feature to yoke dependency in icu_codepointtrie
echeran Oct 21, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 16 additions & 2 deletions components/uniset/src/enum_props.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,31 @@
//! A collection of enums for enumerated properties.

use num_enum::{TryFromPrimitive, UnsafeFromPrimitive};
use tinystr::TinyStr16;

/// Selection constants for Unicode properties.
/// These constants are used to select one of the Unicode properties.
/// See UProperty in ICU4C.
#[derive(Clone, PartialEq, Debug)]
#[derive(Clone, PartialEq, Debug, TryFromPrimitive)]
#[allow(missing_docs)] // TODO(#1030) - Add missing docs.
#[non_exhaustive]
#[repr(i32)]
pub enum EnumeratedProperty {
GeneralCategory = 0x1005,
Script = 0x100A,
ScriptExtensions = 0x7000,
ScriptExtensions = 0x7000, // TODO(#1160) - this is a Miscellaneous property, not Enumerated
InvalidCode = -1, // TODO(#1160) - taken from ICU4C UProperty::UCHAR_INVALID_CODE
}

impl From<&TinyStr16> for EnumeratedProperty {
fn from(prop_short_alias: &TinyStr16) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this one of two ways:

  1. Match on a &str rather than a &TinyStr16
  2. Match on a TinyStr16 by value as described in Recommendation for pattern matching zbraniecki/tinystr#22

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please make sure we are actually using this. If you can delete it, please do, because when we actually implement string-to-property parsing, it should be data-driven, not hard coded in this impl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix.

match prop_short_alias.as_str() {
"gc" => EnumeratedProperty::GeneralCategory,
"sc" => EnumeratedProperty::Script,
"scx" => EnumeratedProperty::ScriptExtensions,
_ => EnumeratedProperty::InvalidCode,
}
}
}

/// Enumerated Unicode general category types.
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/writing_a_new_data_struct.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Examples of source data providers include:
- [`PluralsProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_provider_cldr/transform/struct.PluralsProvider.html)
- [`DateSymbolsProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_provider_cldr/transform/struct.DateSymbolsProvider.html)
- [&hellip; more examples](https://unicode-org.github.io/icu4x-docs/doc/icu_provider_cldr/transform/index.html)
- `BinaryPropertiesDataProvider`
- `BinaryPropertyUnicodeSetDataProvider`
- [`HelloWorldProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_provider/hello_world/struct.HelloWorldProvider.html)

Source data providers must implement the following traits:
Expand Down
2 changes: 2 additions & 0 deletions provider/uprops/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,12 @@ all-features = true

[dependencies]
displaydoc = { version = "0.2.3", default-features = false }
icu_codepointtrie = { version = "0.3", path = "../../utils/codepointtrie", features = ["serde"] }
icu_provider = { version = "0.3", path = "../../provider/core", features = ["provider_serde"] }
icu_uniset = { version = "0.3", path = "../../components/uniset", features = ["provider_serde"] }
serde = { version = "1.0", features = ["derive"] }
toml = { version = "0.5" }
zerovec = { version = "0.3", path = "../../utils/zerovec", features = ["serde", "yoke"] }

[dev-dependencies]
icu_testdata = { version = "0.3", path = "../../provider/testdata" }
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ use icu_uniset::UnicodeSetBuilder;
use std::fs;
use std::path::PathBuf;

pub struct BinaryPropertiesDataProvider {
pub struct BinaryPropertyUnicodeSetDataProvider {
root_dir: PathBuf,
}

/// A data provider reading from .toml files produced by the ICU4C icuwriteuprops tool.
impl BinaryPropertiesDataProvider {
impl BinaryPropertyUnicodeSetDataProvider {
pub fn new(root_dir: PathBuf) -> Self {
BinaryPropertiesDataProvider { root_dir }
BinaryPropertyUnicodeSetDataProvider { root_dir }
}
fn get_toml_data(&self, name: &str) -> Result<uprops_serde::binary::Main, Error> {
let mut path: PathBuf = self.root_dir.clone().join(name);
Expand All @@ -28,7 +28,7 @@ impl BinaryPropertiesDataProvider {
}
}

impl<'data> DataProvider<'data, UnicodePropertyV1Marker> for BinaryPropertiesDataProvider {
impl<'data> DataProvider<'data, UnicodePropertyV1Marker> for BinaryPropertyUnicodeSetDataProvider {
fn load_payload(
&self,
req: &DataRequest,
Expand All @@ -54,11 +54,11 @@ impl<'data> DataProvider<'data, UnicodePropertyV1Marker> for BinaryPropertiesDat
}
}

icu_provider::impl_dyn_provider!(BinaryPropertiesDataProvider, {
icu_provider::impl_dyn_provider!(BinaryPropertyUnicodeSetDataProvider, {
_ => UnicodePropertyV1Marker,
}, SERDE_SE, 'data);

impl IterableDataProviderCore for BinaryPropertiesDataProvider {
impl IterableDataProviderCore for BinaryPropertyUnicodeSetDataProvider {
fn supported_options_for_key(
&self,
_resc_key: &ResourceKey,
Expand All @@ -74,7 +74,7 @@ fn test_basic() {
use std::convert::TryInto;

let root_dir = icu_testdata::paths::data_root().join("uprops");
let provider = BinaryPropertiesDataProvider::new(root_dir);
let provider = BinaryPropertyUnicodeSetDataProvider::new(root_dir);

let payload: DataPayload<'_, UnicodePropertyV1Marker> = provider
.load_payload(&DataRequest {
Expand Down
121 changes: 121 additions & 0 deletions provider/uprops/src/enum_codepointtrie.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

use crate::error::Error;
use crate::uprops_serde;
use crate::uprops_serde::enumerated::EnumeratedPropertyCodePointTrie;

use icu_codepointtrie::codepointtrie::{CodePointTrie, CodePointTrieHeader, TrieType, TrieValue};
use icu_codepointtrie::provider::{UnicodePropertyMapV1, UnicodePropertyMapV1Marker};
use icu_provider::prelude::*;
use icu_uniset::enum_props::EnumeratedProperty; // TODO(#1160) - Refactor property definitions out of UnicodeSet
use zerovec::ZeroVec;

use core::convert::TryFrom;

use std::fs;
use std::path::PathBuf;

pub struct EnumeratedPropertyCodePointTrieProvider {
sffc marked this conversation as resolved.
Show resolved Hide resolved
root_dir: PathBuf,
}

impl EnumeratedPropertyCodePointTrieProvider {
pub fn new(root_dir: PathBuf) -> Self {
EnumeratedPropertyCodePointTrieProvider { root_dir }
}

fn get_toml_data(&self, name: &str) -> Result<uprops_serde::enumerated::Main, Error> {
let mut path: PathBuf = self.root_dir.clone().join(name);
path.set_extension("toml");
let toml_str = fs::read_to_string(&path).map_err(|e| Error::Io(e, path.clone()))?;
toml::from_str(&toml_str).map_err(|e| Error::Toml(e, path))
}
}

impl<T: TrieValue> TryFrom<uprops_serde::enumerated::EnumeratedPropertyCodePointTrie>
for UnicodePropertyMapV1<'static, T>
{
type Error = DataError;

fn try_from(
cpt_data: EnumeratedPropertyCodePointTrie,
) -> Result<UnicodePropertyMapV1<'static, T>, DataError> {
let trie_type_enum: TrieType =
TrieType::try_from(cpt_data.trie_type_enum_val).map_err(DataError::new_resc_error)?;
let header = CodePointTrieHeader {
high_start: cpt_data.high_start,
shifted12_high_start: cpt_data.shifted12_high_start,
index3_null_offset: cpt_data.index3_null_offset,
data_null_offset: cpt_data.data_null_offset,
null_value: cpt_data.null_value,
trie_type: trie_type_enum,
};
let index: ZeroVec<u16> = ZeroVec::clone_from_slice(&cpt_data.index);
// TODO: make data have type ZeroVec<T>
//
let data: Result<Vec<T::ULE>, String> = if let Some(data_8) = cpt_data.data_8 {
sffc marked this conversation as resolved.
Show resolved Hide resolved
data_8
.iter()
.map(|i| *i as u32)
.map(|i| T::parse_from_u32(i).map(|i| i.as_unaligned()))
.collect()
} else if let Some(data_16) = cpt_data.data_16 {
data_16
.iter()
.map(|i| *i as u32)
.map(|i| T::parse_from_u32(i).map(|i| i.as_unaligned()))
.collect()
} else if let Some(data_32) = cpt_data.data_32 {
data_32
.iter()
.map(|i| *i as u32)
.map(|i| T::parse_from_u32(i).map(|i| i.as_unaligned()))
.collect()
} else {
return Err(DataError::new_resc_error(
icu_codepointtrie::error::Error::FromDeserialized {
reason: "Cannot deserialize data array for CodePointTrie in TOML",
sffc marked this conversation as resolved.
Show resolved Hide resolved
},
));
};

let data = ZeroVec::Owned(data.map_err(DataError::new_resc_error)?);
let trie = CodePointTrie::<T>::try_new(header, index, data)
.map_err(DataError::new_resc_error);
trie.map(|t| UnicodePropertyMapV1 { codepoint_trie: t })
}
}

impl<'data, T: TrieValue> DataProvider<'data, UnicodePropertyMapV1Marker<T>>
for EnumeratedPropertyCodePointTrieProvider
{
fn load_payload(
&self,
req: &DataRequest,
) -> Result<DataResponse<'data, UnicodePropertyMapV1Marker<T>>, DataError> {
// For data resource keys that represent the CodePointTrie data for an enumerated
// property, the ResourceKey sub-category string will just be the short alias
// for the property.
let prop_name = &req.resource_path.key.sub_category;

let toml_data: uprops_serde::enumerated::Main = self
.get_toml_data(prop_name)
.map_err(DataError::new_resc_error)?;

let prop_enum: EnumeratedProperty = EnumeratedProperty::from(prop_name);

let source_cpt_data: uprops_serde::enumerated::EnumeratedPropertyCodePointTrie =
toml_data.enum_property.data.code_point_trie;

let data_struct = UnicodePropertyMapV1::<T>::try_from(source_cpt_data)?;

Ok(DataResponse {
metadata: DataResponseMetadata {
data_langid: req.resource_path.options.langid.clone(),
},
payload: Some(DataPayload::from_owned(data_struct)),
})
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ use icu_uniset::UnicodeSetBuilder;
use std::fs;
use std::path::PathBuf;

pub struct EnumeratedPropertiesDataProvider {
pub struct EnumeratedPropertyUnicodeSetDataProvider {
root_dir: PathBuf,
}

/// A data provider reading from .toml files produced by the ICU4C icuwriteuprops tool.
impl EnumeratedPropertiesDataProvider {
impl EnumeratedPropertyUnicodeSetDataProvider {
pub fn new(root_dir: PathBuf) -> Self {
EnumeratedPropertiesDataProvider { root_dir }
EnumeratedPropertyUnicodeSetDataProvider { root_dir }
}
fn get_toml_data(&self, name: &str) -> Result<uprops_serde::enumerated::Main, Error> {
let mut path: PathBuf = self.root_dir.clone().join(name);
Expand Down Expand Up @@ -61,7 +61,9 @@ fn expand_groupings<'a>(prop_name: &str, prop_val: &'a str) -> Vec<&'a str> {
}
}

impl<'data> DataProvider<'data, UnicodePropertyV1Marker> for EnumeratedPropertiesDataProvider {
impl<'data> DataProvider<'data, UnicodePropertyV1Marker>
for EnumeratedPropertyUnicodeSetDataProvider
{
fn load_payload(
&self,
req: &DataRequest,
Expand Down Expand Up @@ -104,11 +106,11 @@ impl<'data> DataProvider<'data, UnicodePropertyV1Marker> for EnumeratedPropertie
}
}

icu_provider::impl_dyn_provider!(EnumeratedPropertiesDataProvider, {
icu_provider::impl_dyn_provider!(EnumeratedPropertyUnicodeSetDataProvider, {
_ => UnicodePropertyV1Marker,
}, SERDE_SE, 'data);

impl IterableDataProviderCore for EnumeratedPropertiesDataProvider {
impl IterableDataProviderCore for EnumeratedPropertyUnicodeSetDataProvider {
fn supported_options_for_key(
&self,
_resc_key: &ResourceKey,
Expand All @@ -124,7 +126,7 @@ fn test_general_category() {
use std::convert::TryInto;

let root_dir = icu_testdata::paths::data_root().join("uprops");
let provider = EnumeratedPropertiesDataProvider::new(root_dir);
let provider = EnumeratedPropertyUnicodeSetDataProvider::new(root_dir);

let payload: DataPayload<'_, UnicodePropertyV1Marker> = provider
.load_payload(&DataRequest {
Expand Down Expand Up @@ -152,7 +154,7 @@ fn test_script() {
use std::convert::TryInto;

let root_dir = icu_testdata::paths::data_root().join("uprops");
let provider = EnumeratedPropertiesDataProvider::new(root_dir);
let provider = EnumeratedPropertyUnicodeSetDataProvider::new(root_dir);

let payload: DataPayload<'_, UnicodePropertyV1Marker> = provider
.load_payload(&DataRequest {
Expand Down Expand Up @@ -181,7 +183,7 @@ fn test_gc_groupings() {

fn get_uniset_payload<'data>(key: ResourceKey) -> DataPayload<'data, UnicodePropertyV1Marker> {
let root_dir = icu_testdata::paths::data_root().join("uprops");
let provider = EnumeratedPropertiesDataProvider::new(root_dir);
let provider = EnumeratedPropertyUnicodeSetDataProvider::new(root_dir);
let payload: DataPayload<'_, UnicodePropertyV1Marker> = provider
.load_payload(&DataRequest {
resource_path: ResourcePath {
Expand Down Expand Up @@ -293,7 +295,7 @@ fn test_gc_surrogate() {
use std::convert::TryInto;

let root_dir = icu_testdata::paths::data_root().join("uprops");
let provider = EnumeratedPropertiesDataProvider::new(root_dir);
let provider = EnumeratedPropertyUnicodeSetDataProvider::new(root_dir);

let payload: DataPayload<'_, UnicodePropertyV1Marker> = provider
.load_payload(&DataRequest {
Expand Down
9 changes: 5 additions & 4 deletions provider/uprops/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,11 @@
//! [`StaticDataProvider`]: ../icu_provider_blob/struct.StaticDataProvider.html
//! [`PropertiesDataProvider`]: binary::PropertiesDataProvider

mod binary;
mod enumerated;
mod bin_uniset;
mod enum_codepointtrie;
mod enum_uniset;
mod error;
mod provider;
mod uprops_serde;

pub use provider::PropertiesDataProvider;
pub use bin_uniset::BinaryPropertyUnicodeSetDataProvider;
pub use enum_uniset::EnumeratedPropertyUnicodeSetDataProvider;
Loading