Skip to content

Commit

Permalink
xid_start/xid_continue lexer classes and the Unicode subsystem refact…
Browse files Browse the repository at this point in the history
…oring (#16)

- `$xid_start` and `$xid_continue` Lexer regex classes added to address an issue with Unicode identifier parsing.
- Support for classes with combined Unicode properties introduced. Users can now write combined classes using the `${...}` syntax in the Token macro rules: `${alpha | num}` means alphabetic or numeric character.
- The choice between individual Unicode classes is now forbidden. Programmers can no longer write `$alpha | $num` expressions (but they can write `${alpha | num}`). This syntax was allowed in the previous version, but it didn't work properly because the corresponding classes intersected in their code-point subsets. In future versions, I will consider partially relaxing this restriction.
- The behavior of `$alpha` has been fixed in this pull request. Previously, it was interpreted as `${upper | lower}`, which does not fit the UCD specification.
- The `lexis::Char` and `lexis::CharProperties` types have been introduced in the main crate. These types allow users to test Unicode properties of characters based on UCD data. These changes also make it easier to introduce new lexer classes into the Token macro regex syntax.
  • Loading branch information
Eliah-Lakhin authored Jul 4, 2024
1 parent cc36876 commit e909fbb
Show file tree
Hide file tree
Showing 15 changed files with 3,733 additions and 266 deletions.
28 changes: 14 additions & 14 deletions work/crates/derive/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,13 @@
//!
//! Copyright (c) 2024 Ilya Lakhin (Илья Александрович Лахин). All rights reserved.
extern crate core;
extern crate proc_macro;
#[macro_use]
extern crate quote;

#[macro_use]
extern crate syn;

extern crate core;
extern crate proc_macro;

use std::str::FromStr;

use proc_macro2::TokenStream;
Expand Down Expand Up @@ -295,20 +293,22 @@ mod utils;
/// The inverted version of the previous operator that matches any character
/// outside of the specified set.
///
/// - Uppercase character: `$upper`. Match any Unicode uppercase character
/// as specified in the [char::is_uppercase] function.
/// - Any Unicode uppercase character: `$upper`.
///
/// - Any Unicode lowercase character: `$lower`.
///
/// - Any Unicode numeric character: `$num`.
///
/// - Any Unicode whitespace character: `$space`.
///
/// - Lowercase character: `$lower`. Match any Unicode lowercase character
/// as specified in the [char::is_lowercase] function.
/// - Any Unicode alphabetic character: `$alpha`.
///
/// - Numeric character: `$num`. Match any Unicode numeric character
/// as specified in the [char::is_numeric] function.
/// - Any Unicode identifier's start character: `$xid_start`.
///
/// - Whitespace character: `$span`. Match any Unicode whitespace character
/// as specified in the [char::is_whitespace] function.
/// - Any Unicode identifier's continuation character: `$xid_continue`.
///
/// - Alphabetic character: `$alpha`. Match any Unicode alphabetic character
/// as specified in the [char::is_alphabetic] function.
/// - A class of the character property combinations: `${alpha | num | space}`.
/// The property names can be any combination of the names listed above.
///
/// - A concatenation of the rules: `<expr1> & <expr2>` or just `<expr1> <expr2>`.
/// Matches `<expr1>`, then matches `<expr2>`. The concatenation expression
Expand Down
67 changes: 67 additions & 0 deletions work/crates/derive/src/token/automata.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@

use std::fmt::{Display, Formatter};

use proc_macro2::Span;
use syn::Result;

use crate::{
Expand Down Expand Up @@ -165,12 +166,78 @@ impl AutomataImpl for TokenAutomata {

Ok(products)
}

fn check_property_conflicts(&self, span: Span) -> Result<()> {
for (_, outgoing) in self.transitions().view() {
let mut other = false;
let mut props = Vec::new();

for (through, to) in outgoing {
let Terminal::Class(class) = through else {
continue;
};

match class {
Class::Char(_) => continue,
Class::Props(through) => {
props.push((through, to));
}
Class::Other => {
other = true;
}
}

if other {
if let Some((through, _)) = props.first() {
return Err(error!(
span,
"Char properties choice ambiguity.\n\
Choice branching in form of \"{through} | .\" or \
\"{through} | ^[...]\" is forbidden.",
));
}
}

if props.len() > 1 {
let (first_props, first_state) = &props[0];
let (second_props, second_state) = &props[1];

return match first_state == second_state {
true => {
let union = first_props.union(**second_props);

Err(error!(
span,
"Char properties choice ambiguity.\n\
Choice branching in form of \"{first_props} | \
{second_props}\" is forbidden.\n\
Consider introducing union property class \
instead: {union}.",
))
}

false => Err(error!(
span,
"Char properties choice ambiguity.\n\
Choice branching between two distinct property \
classes ({first_props} and {second_props}) \
is forbidden.",
)),
};
}
}
}

Ok(())
}
}

pub(super) trait AutomataImpl {
fn merge(&mut self, scope: &mut Scope, variants: &Variants) -> Result<()>;

fn filter_out(&mut self, variants: &Variants) -> Result<ProductMap>;

fn check_property_conflicts(&self, span: Span) -> Result<()>;
}

pub(super) struct Scope {
Expand Down
Loading

0 comments on commit e909fbb

Please sign in to comment.