Initial implementation of UCD Segmentation properties #166

behnam · 2017-10-02T02:31:15Z

Add UCD Segmentation source data to /data/, implement conversion from new files to property map data tables.

Add unic-ucd-segment component with initial implementation of three
main segmentation-related properties:

Grapheme_Cluster_Break,
Word_Break, and
Sentence_Break.

Current implementation uses char_property!() macro for
EnumeratedCharProperty implementation, which only supports
TotalCharProperty.

Since the Other (abbr: XX) value in all these properties are notions
of non-existance of breaking property, we want to switch to
PartialCharProperty domain type and use Option<enum>. This is left
as a separate step because it needs changes to the macro.

Tracker: #135

Generate tables for three segmentation-related enumerated properties: * Grapheme_Cluster_Break * Word_Break * Sentence_Break

behnam · 2017-10-02T04:02:03Z

Uh... looks like rust 1.17 doesn't like code-blocks in doc-block metas. Need to find a way to get around this now... :|

CAD97 · 2017-10-02T04:05:47Z

This looks like the issue with $(#[:meta])* $:ident macro matching again.

Basically, 1.17 can't tell the difference between a #[:meta] and a $:ident, so gives up on macro parsing. The solution is to add a explicit separator between the repeating capture and the ident. It's really annoying that you cannot have a repeating capture followed by an ident in macros before 1.20.

behnam · 2017-10-02T04:07:13Z

Right, @CAD97! Looks like multiple lines is the problem. I'm working on a fix now. Thanks for the reminder!

Add `unic-ucd-segment` component with initial implementation of three main segmentation-related properties: * `Grapheme_Cluster_Break`, * `Word_Break`, and * `Sentence_Break`.

behnam · 2017-10-02T07:36:58Z

Okay, updated this WIP with the changes suggested in #170. I was so focused on some other parts of the code that I didn't realize the Total implementation is local and can be replaced with Partial.

Now going to fix the macro syntax issue. (Which apparently is only fixed in 1.20.0, so we need a syntax that we tolerate for some time...)

CAD97 · 2017-10-02T07:46:23Z

Suggestion: case Extend ($(#[$variant_meta:meta])+ case $variant:ident). It's kind of silly to introduce case here since Rust doesn't use a case keyword for its enums, but at least it makes sense and would fix the repeating capture followed by ident? Otherwise, I'd just suggest a ; and dealing with ;Extend. Other likely single-char options: E#@$|

Proof of concept https://play.rust-lang.org/?gist=af53821eae5868c2ca8284e3a6fbeeb8

Because of a bug in Rust compiler before `1.20.0` release, the old sytanx won't allow having multi-line doc-blocks on enum variants. Since we don't want to update our minimum-supported-rustc-version at the moment, need to update the macro syntax to mitigate the bug. The new syntax is loosly based on the `match` syntax, plus the newly accepted RFC to allow extra `VERTICAL LINE` prefixes.

CAD97 · 2017-10-02T07:51:50Z

Don't forget to update the example, the # Limitations block, and the // TODO to drop the symbol in the macro documentation. Other than that, I don't hate your choice here. (Oh, and make the meta repeat a * not a + since we can do that now. It was only + so it would work with a singular for the time being.)

behnam · 2017-10-02T07:54:25Z

I think I came up with a more rust-y syntax based on the recent developments on the language side:

$(#[$variant_meta:meta])+
| $variant:ident {
    // ...
}

That's an extra VERTICAL LINE char before the variant name ident.

Now, one open question for this last part is: do we want to only support one syntax for this and update all the call-sites, or should we make it an optional syntax and only use it in new places needed? I like the first (and have done that in the last update) because stability is not a big deal for the macro (being considered an unstable feature at the moment) and keeping the implementation slim is a goal for these parts.

What do you think, @CAD97?

behnam · 2017-10-02T07:57:18Z

Here's the RFC, btw: rust-lang/rfcs#1745

Just a simple search and replace to improve readability of internal implementations, specially to distinguish between names for properties themselves vs. names for property values.

Following the guideline for naming, plus not using a keyword.

behnam · 2017-10-02T08:33:43Z

Okay, 287d950 should address all the issues here. Any other last comments before we land?

CAD97

Looks good to me, barring a few comment updates.

CAD97 · 2017-10-02T13:40:14Z

gen/src/writer/ucd/mod.rs

-    name::generate(&clean_dir("unic/ucd/name/tables"));
-    normal::generate(&clean_dir("unic/ucd/normal/tables"));
-    ident::generate(&clean_dir("unic/ucd/ident/tables"));
+    if false {


Is this left over from debugging?

Oops! Thanks for catching it. Will fix and bors.

CAD97 · 2017-10-02T13:44:13Z

unic/char/property/src/macros.rs

@@ -26,7 +26,7 @@
 ///         human => "Human-Readable Property Name";
 ///
 ///         /// Exactly one attribute


Update this to something like "any number of doc comments"; it's not accurate like this anymoe

CAD97 · 2017-10-02T13:46:07Z

unic/char/property/tests/enum_property_macro.rs

@@ -23,21 +23,21 @@ char_property! {
        human => "My Property";

        /// Required


These aren't required anymore; they should be made significant or deleted.

Also update format script to be callable from any working directory.

behnam · 2017-10-02T18:26:33Z

bors: r+

166: Initial implementation of UCD Segmentation properties r=behnam a=behnam Add UCD Segmentation source data to `/data/`, implement conversion from new files to property map data tables. Add `unic-ucd-segment` component with initial implementation of three main segmentation-related properties: * `Grapheme_Cluster_Break`, * `Word_Break`, and * `Sentence_Break`. Current implementation uses `char_property!()` macro for `EnumeratedCharProperty` implementation, which only supports `TotalCharProperty`. Since the `Other` (abbr: `XX`) value in all these properties are notions of non-existance of breaking property, we want to switch to `PartialCharProperty` domain type and use `Option<enum>`. This is left as a separate step because it needs changes to the macro.

bors · 2017-10-02T18:49:08Z

Build succeeded

continuous-integration/travis-ci/push

behnam added 3 commits September 29, 2017 17:27

[data/ucd] Add Segmentation-related auxiliary files

c744fc5

[etc] Pass arguments down to run commands

4bc6248

[gen] Add ucd-segment writer

5acc65d

Generate tables for three segmentation-related enumerated properties: * Grapheme_Cluster_Break * Word_Break * Sentence_Break

behnam added C: segmentation Unicode Text Segmentation C: ucd Unicode Character Database labels Oct 2, 2017

behnam added this to the UNIC-0.7 milestone Oct 2, 2017

behnam force-pushed the ucd-seg branch from 97b9a1c to 71b53e7 Compare October 2, 2017 02:50

CAD97 mentioned this pull request Oct 2, 2017

Replace SentenceBreak::Other with Option::None #170

Closed

behnam force-pushed the ucd-seg branch from 71b53e7 to 2b27bc6 Compare October 2, 2017 07:33

[ucd] Add new segment component

bf5a946

Add `unic-ucd-segment` component with initial implementation of three main segmentation-related properties: * `Grapheme_Cluster_Break`, * `Word_Break`, and * `Sentence_Break`.

behnam force-pushed the ucd-seg branch from 2b27bc6 to bf5a946 Compare October 2, 2017 07:38

behnam added 2 commits October 2, 2017 01:16

[char/property] Rename macro internals

a06430d

Just a simple search and replace to improve readability of internal implementations, specially to distinguish between names for properties themselves vs. names for property values.

[char/property] Rename bool() to as_bool()

bdd7e5a

Following the guideline for naming, plus not using a keyword.

CAD97 approved these changes Oct 2, 2017

View reviewed changes

[char/property] Update docs and relax syntax

6c21611

behnam force-pushed the ucd-seg branch from 287d950 to 6c21611 Compare October 2, 2017 17:26

Format

ba4fd09

Also update format script to be callable from any working directory.

behnam self-assigned this Oct 2, 2017

bors bot merged commit ba4fd09 into master Oct 2, 2017

behnam deleted the ucd-seg branch October 2, 2017 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of UCD Segmentation properties #166

Initial implementation of UCD Segmentation properties #166

behnam commented Oct 2, 2017 •

edited

Loading

behnam commented Oct 2, 2017

CAD97 commented Oct 2, 2017

behnam commented Oct 2, 2017

behnam commented Oct 2, 2017

CAD97 commented Oct 2, 2017 •

edited

Loading

CAD97 commented Oct 2, 2017 •

edited

Loading

behnam commented Oct 2, 2017 •

edited

Loading

behnam commented Oct 2, 2017

behnam commented Oct 2, 2017

CAD97 left a comment

CAD97 Oct 2, 2017

behnam Oct 2, 2017

CAD97 Oct 2, 2017

CAD97 Oct 2, 2017

behnam commented Oct 2, 2017

bors bot commented Oct 2, 2017

		@@ -23,21 +23,21 @@ char_property! {
		human => "My Property";

		/// Required

Initial implementation of UCD Segmentation properties #166

Initial implementation of UCD Segmentation properties #166

Conversation

behnam commented Oct 2, 2017 • edited Loading

behnam commented Oct 2, 2017

CAD97 commented Oct 2, 2017

behnam commented Oct 2, 2017

behnam commented Oct 2, 2017

CAD97 commented Oct 2, 2017 • edited Loading

CAD97 commented Oct 2, 2017 • edited Loading

behnam commented Oct 2, 2017 • edited Loading

behnam commented Oct 2, 2017

behnam commented Oct 2, 2017

CAD97 left a comment

Choose a reason for hiding this comment

CAD97 Oct 2, 2017

Choose a reason for hiding this comment

behnam Oct 2, 2017

Choose a reason for hiding this comment

CAD97 Oct 2, 2017

Choose a reason for hiding this comment

CAD97 Oct 2, 2017

Choose a reason for hiding this comment

behnam commented Oct 2, 2017

bors bot commented Oct 2, 2017

Build succeeded

behnam commented Oct 2, 2017 •

edited

Loading

CAD97 commented Oct 2, 2017 •

edited

Loading

CAD97 commented Oct 2, 2017 •

edited

Loading

behnam commented Oct 2, 2017 •

edited

Loading