Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect properties provider to the icu4x_datagen exporter tool #1204

Merged
merged 20 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
bfd4cb8
Initial code for connecting properties provider to the icu4x_datagen …
echeran Oct 21, 2021
82e9699
Connect the data providers for sets and maps of property data to the …
echeran Oct 26, 2021
24da958
Merge branch 'main' into uprops-datagen
echeran Oct 26, 2021
19be0f7
Add JSON and bincode versions of uprops source data TOML files into t…
echeran Oct 26, 2021
9d049bd
Update crate level docstrings for icu::properties and icu_properties
echeran Oct 27, 2021
34f07f4
Fix icu_properties docs test compilation, make minor adjustment to tests
echeran Oct 27, 2021
b739ce1
Fix datagen tests to use renamed CLI args. Use canonical formatting f…
echeran Oct 27, 2021
80c28b1
Make serde imports/derives conditional on feature being enabled
echeran Oct 27, 2021
a414e5b
Move icu_testdata to dev-dependency
sffc Oct 27, 2021
b75ffdc
Update Rust docstring examples
echeran Oct 27, 2021
29467f1
Update a portion of the docstring examples for property UnicodeSet ge…
echeran Oct 28, 2021
00137b6
Complete the docstring examples for property UnicodeSet getter APIs
echeran Oct 28, 2021
bd1c49a
Apply formatter changes
echeran Oct 28, 2021
1d240b2
Fix docstring link
echeran Oct 28, 2021
93e923b
Adjust wording for crate-level docstrings for properties
echeran Oct 28, 2021
8c45a66
Adjust wording for crate-level docstrings for uniset
echeran Oct 28, 2021
2e4a5ae
Remove unneeded bincode versions of testdata
echeran Oct 28, 2021
38e34b4
Rename field in provider data struct for code point maps
echeran Oct 29, 2021
764d7fc
Merge branch 'main' into uprops-datagen
echeran Oct 29, 2021
92782f2
Simplify nested if+match statement logic
echeran Oct 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 8 additions & 1 deletion components/icu/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,20 @@ path = "../../utils/fixed_decimal"
default-features = false

[dev-dependencies]
icu_codepointtrie = { version = "0.2", path = "../../utils/codepointtrie" }
icu_provider = { version = "0.3", path = "../../provider/core" }
icu_testdata = { version = "0.3", path = "../../provider/testdata", features = ["static"] }
icu_uniset = { version = "0.3", path = "../../utils/uniset" }
writeable = { version = "0.2", path = "../../utils/writeable" }

[features]
std = ["icu_datetime/std", "icu_locid/std", "icu_plurals/std", "icu_properties/std", "fixed_decimal/std"]
std = [
"icu_datetime/std",
"icu_locid/std",
"icu_plurals/std",
"icu_properties/std",
"fixed_decimal/std"
]
default = ["provider_serde"]
serde = [
"icu_locid/serde"
Expand Down
62 changes: 58 additions & 4 deletions components/icu/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -368,17 +368,71 @@ pub mod plurals {
}

pub mod properties {
//! `icu_properties` is a utility crate of the [`ICU4X`] project.
//! Unicode properties
//!
//! This component provides definitions of [Unicode Properties] and APIs for
//! This API provides definitions of [Unicode Properties] and functions for
//! retrieving property data in an appropriate data structure.
//!
//! Currently, only binary property APIs are supported, with APIs that return
//! a [`UnicodeSet`]. See the [`sets`] module for more details.
//! APIs that return a [`UnicodeSet`] exist for binary properties and certain enumerated
//! properties. See the [`sets`] module for more details.
//!
//! APIs that return a [`CodePointTrie`] exist for certain enumerated properties. See the
//! [`maps`] module for more details.
//!
//! # Examples
//!
//! ## Property data as `UnicodeSet`s
//!
//! ```
//! use icu::properties::{sets, GeneralCategory};
//!
//! let provider = icu_testdata::get_provider();
//!
//! // A binary property as a `UnicodeSet`
//!
//! let payload =
//! sets::get_emoji(&provider)
//! .expect("The data should be valid");
//! let data_struct = payload.get();
//! let emoji = &data_struct.inv_list;
//!
//! assert!(emoji.contains('🎃')); // U+1F383 JACK-O-LANTERN
//! assert!(!emoji.contains('木')); // U+6728
//!
//! // An individual enumerated property value as a `UnicodeSet`
//!
//! let payload =
//! sets::get_for_general_category(&provider, GeneralCategory::LineSeparator)
//! .expect("The data should be valid");
//! let data_struct = payload.get();
//! let line_sep = &data_struct.inv_list;
//!
//! assert!(line_sep.contains_u32(0x2028));
//! assert!(!line_sep.contains_u32(0x2029));
//! ```
//!
//! ## Property data as `CodePointTrie`s
//!
//! ```
//! use icu::properties::{maps, Script};
//!
//! let provider = icu_testdata::get_provider();
//!
//! let payload =
//! maps::get_script(&provider)
//! .expect("The data should be valid");
//! let data_struct = payload.get();
//! let script = &data_struct.code_point_trie;
//!
//! assert_eq!(script.get('🎃' as u32), Script::Common); // U+1F383 JACK-O-LANTERN
//! assert_eq!(script.get('木' as u32), Script::Han); // U+6728
//! ```
//!
//! [`ICU4X`]: ../icu/index.html
//! [Unicode Properties]: https://unicode-org.github.io/icu/userguide/strings/properties.html
//! [`UnicodeSet`]: ../../icu_uniset/struct.UnicodeSet.html
//! [`sets`]: sets
//! [`CodePointTrie`]: ../../icu_codepointtrie/codepointtrie/struct.CodePointTrie.html
//! [`maps`]: maps
pub use icu_properties::*;
}
1 change: 1 addition & 0 deletions components/properties/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ zerovec = { version = "0.4", path = "../../utils/zerovec", features = ["serde"]

[dev-dependencies]
icu = { path = "../../components/icu", default-features = false }
icu_testdata = { version = "0.3", path = "../../provider/testdata" }

[lib]
bench = false # This option is required for Benchmark CI
Expand Down
59 changes: 56 additions & 3 deletions components/properties/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,69 @@
# icu_properties [![crates.io](https://img.shields.io/crates/v/icu_properties)](https://crates.io/crates/icu_properties)

`icu_properties` is a utility crate of the [`ICU4X`] project.
`icu_properties` is one of the [`ICU4X`] components.

This component provides definitions of [Unicode Properties] and APIs for
retrieving property data in an appropriate data structure.

Currently, only binary property APIs are supported, with APIs that return
a [`UnicodeSet`]. See the [`sets`] module for more details.
APIs that return a [`UnicodeSet`] exist for binary properties and certain enumerated
properties. See the [`sets`] module for more details.

APIs that return a [`CodePointTrie`] exist for certain enumerated properties. See the
[`maps`] module for more details.

## Examples

### Property data as `UnicodeSet`s

```rust
use icu::properties::{sets, GeneralCategory};

let provider = icu_testdata::get_provider();

// A binary property as a `UnicodeSet`

let payload =
sets::get_emoji(&provider)
.expect("The data should be valid");
let data_struct = payload.get();
let emoji = &data_struct.inv_list;

assert!(emoji.contains('🎃')); // U+1F383 JACK-O-LANTERN
assert!(!emoji.contains('木')); // U+6728

// An individual enumerated property value as a `UnicodeSet`

let payload =
sets::get_for_general_category(&provider, GeneralCategory::LineSeparator)
.expect("The data should be valid");
let data_struct = payload.get();
let line_sep = &data_struct.inv_list;

assert!(line_sep.contains_u32(0x2028));
assert!(!line_sep.contains_u32(0x2029));
```

### Property data as `CodePointTrie`s

```rust
use icu::properties::{maps, Script};

let provider = icu_testdata::get_provider();

let payload =
maps::get_script(&provider)
.expect("The data should be valid");
let data_struct = payload.get();
let script = &data_struct.code_point_trie;

assert_eq!(script.get('🎃' as u32), Script::Common); // U+1F383 JACK-O-LANTERN
assert_eq!(script.get('木' as u32), Script::Han); // U+6728
```

[`ICU4X`]: ../icu/index.html
[Unicode Properties]: https://unicode-org.github.io/icu/userguide/strings/properties.html
[`UnicodeSet`]: icu_uniset::UnicodeSet
[`CodePointTrie`]: icu_codepointtrie::codepointtrie::CodePointTrie
[`sets`]: crate::sets

## More Information
Expand Down
59 changes: 56 additions & 3 deletions components/properties/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,70 @@
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! `icu_properties` is a utility crate of the [`ICU4X`] project.
//! `icu_properties` is one of the [`ICU4X`] components.
//!
//! This component provides definitions of [Unicode Properties] and APIs for
//! retrieving property data in an appropriate data structure.
//!
//! Currently, only binary property APIs are supported, with APIs that return
//! a [`UnicodeSet`]. See the [`sets`] module for more details.
//! APIs that return a [`UnicodeSet`] exist for binary properties and certain enumerated
//! properties. See the [`sets`] module for more details.
//!
//! APIs that return a [`CodePointTrie`] exist for certain enumerated properties. See the
//! [`maps`] module for more details.
//!
//! # Examples
//!
//! ## Property data as `UnicodeSet`s
//!
//! ```
//! use icu::properties::{sets, GeneralCategory};
//!
//! let provider = icu_testdata::get_provider();
//!
//! // A binary property as a `UnicodeSet`
//!
//! let payload =
//! sets::get_emoji(&provider)
//! .expect("The data should be valid");
//! let data_struct = payload.get();
//! let emoji = &data_struct.inv_list;
//!
//! assert!(emoji.contains('🎃')); // U+1F383 JACK-O-LANTERN
//! assert!(!emoji.contains('木')); // U+6728
//!
//! // An individual enumerated property value as a `UnicodeSet`
//!
//! let payload =
//! sets::get_for_general_category(&provider, GeneralCategory::LineSeparator)
//! .expect("The data should be valid");
//! let data_struct = payload.get();
//! let line_sep = &data_struct.inv_list;
//!
//! assert!(line_sep.contains_u32(0x2028));
//! assert!(!line_sep.contains_u32(0x2029));
//! ```
//!
//! ## Property data as `CodePointTrie`s
//!
//! ```
//! use icu::properties::{maps, Script};
//!
//! let provider = icu_testdata::get_provider();
//!
//! let payload =
//! maps::get_script(&provider)
//! .expect("The data should be valid");
//! let data_struct = payload.get();
//! let script = &data_struct.code_point_trie;
//!
//! assert_eq!(script.get('🎃' as u32), Script::Common); // U+1F383 JACK-O-LANTERN
//! assert_eq!(script.get('木' as u32), Script::Han); // U+6728
//! ```
//!
//! [`ICU4X`]: ../icu/index.html
//! [Unicode Properties]: https://unicode-org.github.io/icu/userguide/strings/properties.html
//! [`UnicodeSet`]: icu_uniset::UnicodeSet
//! [`CodePointTrie`]: icu_codepointtrie::codepointtrie::CodePointTrie
//! [`sets`]: crate::sets

#![cfg_attr(not(any(test, feature = "std")), no_std)]
Expand Down
34 changes: 34 additions & 0 deletions components/properties/src/maps.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,23 @@ where

/// Return a [`CodePointTrie`] for the General_Category Unicode enumerated property. See [`GeneralCategory`].
///
/// # Example
///
/// ```
/// use icu::properties::{maps, GeneralSubcategory};
/// use icu_codepointtrie::codepointtrie::CodePointTrie;
///
/// let provider = icu_testdata::get_provider();
///
/// let payload =
/// maps::get_general_category(&provider)
/// .expect("The data should be valid");
/// let data_struct = payload.get();
/// let gc = &data_struct.code_point_trie;
/// assert_eq!(gc.get('木' as u32), GeneralSubcategory::OtherLetter); // U+6728
/// assert_eq!(gc.get('🎃' as u32), GeneralSubcategory::OtherSymbol); // U+1F383 JACK-O-LANTERN
/// ```
///
/// [`CodePointTrie`]: icu_codepointtrie::codepointtrie::CodePointTrie
pub fn get_general_category<'data, D>(provider: &D) -> CodePointMapResult<'data, GeneralSubcategory>
where
Expand All @@ -54,6 +71,23 @@ where

/// Return a [`CodePointTrie`] for the Script Unicode enumerated property. See [`Script`].
///
/// # Example
///
/// ```
/// use icu::properties::{maps, Script};
/// use icu_codepointtrie::codepointtrie::CodePointTrie;
///
/// let provider = icu_testdata::get_provider();
///
/// let payload =
/// maps::get_script(&provider)
/// .expect("The data should be valid");
/// let data_struct = payload.get();
/// let script = &data_struct.code_point_trie;
/// assert_eq!(script.get('木' as u32), Script::Han); // U+6728
/// assert_eq!(script.get('🎃' as u32), Script::Common); // U+1F383 JACK-O-LANTERN
/// ```
///
/// [`CodePointTrie`]: icu_codepointtrie::codepointtrie::CodePointTrie
pub fn get_script<'data, D>(provider: &D) -> CodePointMapResult<'data, Script>
where
Expand Down
5 changes: 4 additions & 1 deletion components/properties/src/props.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
//! A collection of enums for enumerated properties.

use num_enum::{TryFromPrimitive, UnsafeFromPrimitive};
#[cfg(feature = "serde")]
use serde::{Deserialize, Serialize};

/// Selection constants for Unicode properties.
/// These constants are used to select one of the Unicode properties.
Expand All @@ -27,6 +29,7 @@ pub enum EnumeratedProperty {
/// GeneralSubcategory only supports specific subcategories (eg `UppercaseLetter`).
/// It does not support grouped categories (eg `Letter`). For grouped categories, use [`GeneralCategory`].
#[derive(Copy, Clone, PartialEq, Eq, Debug, TryFromPrimitive, UnsafeFromPrimitive)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why do we need these serde impls now when we didn't need them before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserialize is needed for unit tests that need to load data as a UnicodePropertyMap data struct (which in turns carries a CodePointMap, which takes these enums as parameters). In the reverse direction, serialization happens when generating testdata.

This is the error from cargo test when I comment out these serde impls:

error[E0277]: the trait bound `GeneralSubcategory: serde::de::Deserialize<'_>` is not satisfied
  --> src/maps.rs:56:5
   |
9  |     maps::get_general_category(&provider)
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `serde::de::Deserialize<'_>` is not implemented for `GeneralSubcategory`
   |
   = note: required because of the requirements on the impl of `for<'de> serde::de::Deserialize<'_>` for `UnicodePropertyMapV1<'de, GeneralSubcategory>`
   = note: 1 redundant requirements hidden
   = note: required because of the requirements on the impl of `for<'de> serde::de::Deserialize<'de>` for `yoke::trait_hack::YokeTraitHack<UnicodePropertyMapV1<'de, GeneralSubcategory>>`
   = note: required because of the requirements on the impl of `icu_provider::data_provider::DataProvider<'_, UnicodePropertyMapV1Marker<GeneralSubcategory>>` for `icu_provider_fs::fs_data_provider::FsDataProvider`
note: required by a bound in `get_general_category`
  --> /usr/local/google/home/elango/oss/icu4x/components/properties/src/maps.rs:67:8
   |
67 |     D: DataProvider<'data, UnicodePropertyMapV1Marker<GeneralSubcategory>> + ?Sized,
   |        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ required by this bound in `get_general_category`

Similarly, this is the error from cargo make testdata:

error[E0277]: the trait bound `GeneralSubcategory: Serialize` is not satisfied
   --> provider/uprops/src/enum_codepointtrie.rs:116:1
    |
116 | / icu_provider::impl_dyn_provider!(EnumeratedPropertyCodePointTrieProvider, {
117 | |     key::GENERAL_CATEGORY_V1 => UnicodePropertyMapV1Marker<GeneralSubcategory>,
118 | |     key::SCRIPT_V1 => UnicodePropertyMapV1Marker<Script>,
119 | | }, SERDE_SE, 'data);
    | |____________________^ the trait `Serialize` is not implemented for `GeneralSubcategory`
    |
    = note: required because of the requirements on the impl of `for<'a> Serialize` for `UnicodePropertyMapV1<'a, GeneralSubcategory>`
    = note: 1 redundant requirements hidden
    = note: required because of the requirements on the impl of `for<'a> Serialize` for `&'a UnicodePropertyMapV1<'a, GeneralSubcategory>`
    = note: required because of the requirements on the impl of `UpcastDataPayload<'_, UnicodePropertyMapV1Marker<GeneralSubcategory>>` for `SerdeSeDataStructMarker`
note: required by `upcast`
   --> /usr/local/google/home/elango/oss/icu4x/provider/core/src/dynutil.rs:40:5
    |
40  | /     fn upcast(
41  | |         other: crate::prelude::DataPayload<'data, M>,
42  | |     ) -> crate::prelude::DataPayload<'data, Self>;
    | |__________________________________________________^
    = note: this error originates in the macro `$crate::impl_dyn_provider` (in Nightly builds, run with -Z macro-backtrace for more info)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I think we do actually need the serde impls here, because human_readable serialization of the ZeroVec needs those impls on T.

#[repr(u8)]
pub enum GeneralSubcategory {
/// A reserved unassigned code point or a noncharacter
Expand Down Expand Up @@ -244,9 +247,9 @@ impl From<GeneralSubcategory> for GeneralCategory {
/// Script_Extensions set for Dogra, Kaithi, and Mahajani.
///
/// For more information, see UAX #24: <http://www.unicode.org/reports/tr24/>.
///
/// See `UScriptCode` in ICU4C.
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[repr(transparent)]
pub struct Script(pub u16);

Expand Down
Loading