Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of UCD Segmentation properties #166

Merged
merged 9 commits into from
Oct 2, 2017
Merged

Conversation

behnam
Copy link
Member

@behnam behnam commented Oct 2, 2017

Add UCD Segmentation source data to /data/, implement conversion from new files to property map data tables.

Add unic-ucd-segment component with initial implementation of three
main segmentation-related properties:

  • Grapheme_Cluster_Break,
  • Word_Break, and
  • Sentence_Break.

Current implementation uses char_property!() macro for
EnumeratedCharProperty implementation, which only supports
TotalCharProperty.

Since the Other (abbr: XX) value in all these properties are notions
of non-existance of breaking property, we want to switch to
PartialCharProperty domain type and use Option<enum>. This is left
as a separate step because it needs changes to the macro.

Tracker: #135

Generate tables for three segmentation-related enumerated properties:
* Grapheme_Cluster_Break
* Word_Break
* Sentence_Break
@behnam behnam added C: segmentation Unicode Text Segmentation C: ucd Unicode Character Database labels Oct 2, 2017
@behnam behnam added this to the UNIC-0.7 milestone Oct 2, 2017
@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

Uh... looks like rust 1.17 doesn't like code-blocks in doc-block metas. Need to find a way to get around this now... :|

@CAD97
Copy link
Collaborator

CAD97 commented Oct 2, 2017

This looks like the issue with $(#[:meta])* $:ident macro matching again.

Basically, 1.17 can't tell the difference between a #[:meta] and a $:ident, so gives up on macro parsing. The solution is to add a explicit separator between the repeating capture and the ident. It's really annoying that you cannot have a repeating capture followed by an ident in macros before 1.20.

@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

Right, @CAD97! Looks like multiple lines is the problem. I'm working on a fix now. Thanks for the reminder!

Add `unic-ucd-segment` component with initial implementation of three
main segmentation-related properties:
* `Grapheme_Cluster_Break`,
* `Word_Break`, and
* `Sentence_Break`.
@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

Okay, updated this WIP with the changes suggested in #170. I was so focused on some other parts of the code that I didn't realize the Total implementation is local and can be replaced with Partial.

Now going to fix the macro syntax issue. (Which apparently is only fixed in 1.20.0, so we need a syntax that we tolerate for some time...)

@CAD97
Copy link
Collaborator

CAD97 commented Oct 2, 2017

Suggestion: case Extend ($(#[$variant_meta:meta])+ case $variant:ident). It's kind of silly to introduce case here since Rust doesn't use a case keyword for its enums, but at least it makes sense and would fix the repeating capture followed by ident? Otherwise, I'd just suggest a ; and dealing with ;Extend. Other likely single-char options: E#@$|

Proof of concept https://play.rust-lang.org/?gist=af53821eae5868c2ca8284e3a6fbeeb8

Because of a bug in Rust compiler before `1.20.0` release, the old
sytanx won't allow having multi-line doc-blocks on enum variants. Since
we don't want to update our minimum-supported-rustc-version at the
moment, need to update the macro syntax to mitigate the bug.

The new syntax is loosly based on the `match` syntax, plus the newly
accepted RFC to allow extra `VERTICAL LINE` prefixes.
@CAD97
Copy link
Collaborator

CAD97 commented Oct 2, 2017

Don't forget to update the example, the # Limitations block, and the // TODO to drop the symbol in the macro documentation. Other than that, I don't hate your choice here. (Oh, and make the meta repeat a * not a + since we can do that now. It was only + so it would work with a singular for the time being.)

@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

I think I came up with a more rust-y syntax based on the recent developments on the language side:

$(#[$variant_meta:meta])+
| $variant:ident {
    // ...
}

That's an extra VERTICAL LINE char before the variant name ident.

Now, one open question for this last part is: do we want to only support one syntax for this and update all the call-sites, or should we make it an optional syntax and only use it in new places needed? I like the first (and have done that in the last update) because stability is not a big deal for the macro (being considered an unstable feature at the moment) and keeping the implementation slim is a goal for these parts.

What do you think, @CAD97?

@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

Here's the RFC, btw: rust-lang/rfcs#1745

behnam added 2 commits October 2, 2017 01:16
Just a simple search and replace to improve readability of internal
implementations, specially to distinguish between names for properties
themselves vs. names for property values.
Following the guideline for naming, plus not using a keyword.
@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

Okay, 287d950 should address all the issues here. Any other last comments before we land?

Copy link
Collaborator

@CAD97 CAD97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, barring a few comment updates.

name::generate(&clean_dir("unic/ucd/name/tables"));
normal::generate(&clean_dir("unic/ucd/normal/tables"));
ident::generate(&clean_dir("unic/ucd/ident/tables"));
if false {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left over from debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks for catching it. Will fix and bors.

@@ -26,7 +26,7 @@
/// human => "Human-Readable Property Name";
///
/// /// Exactly one attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this to something like "any number of doc comments"; it's not accurate like this anymoe

@@ -23,21 +23,21 @@ char_property! {
human => "My Property";

/// Required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't required anymore; they should be made significant or deleted.

Also update format script to be callable from any working directory.
@behnam
Copy link
Member Author

behnam commented Oct 2, 2017

bors: r+

bors bot added a commit that referenced this pull request Oct 2, 2017
166: Initial implementation of UCD Segmentation properties r=behnam a=behnam

Add UCD Segmentation source data to `/data/`, implement conversion from new files to property map data tables.

Add `unic-ucd-segment` component with initial implementation of three
main segmentation-related properties:
* `Grapheme_Cluster_Break`,
* `Word_Break`, and
* `Sentence_Break`.

Current implementation uses `char_property!()` macro for
`EnumeratedCharProperty` implementation, which only supports
`TotalCharProperty`.

Since the `Other` (abbr: `XX`) value in all these properties are notions
of non-existance of breaking property, we want to switch to
`PartialCharProperty` domain type and use `Option<enum>`. This is left
as a separate step because it needs changes to the macro.
@behnam behnam self-assigned this Oct 2, 2017
@bors
Copy link
Contributor

bors bot commented Oct 2, 2017

Build succeeded

@bors bors bot merged commit ba4fd09 into master Oct 2, 2017
@behnam behnam deleted the ucd-seg branch October 2, 2017 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: segmentation Unicode Text Segmentation C: ucd Unicode Character Database
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants