Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the regex-syntax crate #452

Merged
merged 10 commits into from
Mar 8, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
TBD
===
This release includes a ground-up rewrite of the regex-syntax crate, which has
been in development for over a year.

New features:

* Error messages for invalid regexes have been greatly improved. You get these
automatically; you don't need to do anything. In addition to better
formatting, error messages will now explicitly call out the use of look
around. When regex 1.0 is released, this will happen for backreferences as
well.
* Full support for intersection, difference and symmetric difference of
character classes. These can be used via the `&&`, `--` and `~~` binary
operators within classes.
* A Unicode Level 1 conformat implementation of `\p{..}` character classes.
Things like `\p{scx:Hira}`, `\p{age:3.2}` or `\p{Changes_When_Casefolded}`
now work. All property name and value aliases are supported, and properties
are selected via loose matching. e.g., `\p{Greek}` is the same as
`\p{G r E e K}`.
* A new `UNICODE.md` document has been added to this repository that
exhaustively documents support for UTS#18.
* Empty sub-expressions are now permitted in most places. That is, `()+` is
now a valid regex.
* Almost everything in regex-syntax now uses constant stack space, even when
performing anaylsis that requires structural induction. This reduces the risk
of a user provided regular expression causing a stack overflow.
* [FEATURE #174](https://github.com/rust-lang/regex/issues/174):
The `Ast` type in `regex-syntax` now contains span information.
* [FEATURE #424](https://github.com/rust-lang/regex/issues/424):
Support `\u`, `\u{...}`, `\U` and `\U{...}` syntax for specifying code points
in a regular expression.


0.2.6 (2018-02-08)
==================
Bug fixes:
Expand Down
6 changes: 4 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ travis-ci = { repository = "rust-lang/regex" }
appveyor = { repository = "rust-lang-libs/regex" }

[workspace]
members = ["bench", "regex-capi", "regex-debug", "regex-syntax"]
members = [
"bench", "regex-capi", "regex-debug", "regex-syntax",
]

[dependencies]
# For very fast prefix literal matching.
Expand All @@ -28,7 +30,7 @@ memchr = "2.0.0"
# For managing regex caches quickly across multiple threads.
thread_local = "0.3.2"
# For parsing regular expressions.
regex-syntax = { path = "regex-syntax", version = "0.4.1" }
regex-syntax = { path = "regex-syntax", version = "0.5.0" }
# For accelerating text search.
simd = { version = "0.2.1", optional = true }
# For compiling UTF-8 decoding into automata.
Expand Down
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ by [RE2](https://github.com/google/re2).
### Documentation

[Module documentation with examples](https://docs.rs/regex).
The module documentation also include a comprehensive description of the syntax
supported.
The module documentation also includes a comprehensive description of the
syntax supported.

Documentation with examples for the various matching functions and iterators
can be found on the
Expand Down Expand Up @@ -192,12 +192,13 @@ assert!(matches.matched(6));
### Usage: a regular expression parser

This repository contains a crate that provides a well tested regular expression
parser and abstract syntax. It provides no facilities for compilation or
execution. This may be useful if you're implementing your own regex engine or
otherwise need to do analysis on the syntax of a regular expression. It is
otherwise not recommended for general use.
parser, abstract syntax and a high-level intermediate representation for
convenient analysis. It provides no facilities for compilation or execution.
This may be useful if you're implementing your own regex engine or otherwise
need to do analysis on the syntax of a regular expression. It is otherwise not
recommended for general use.

[Documentation for `regex-syntax` with examples](https://docs.rs/regex-syntax).
[Documentation `regex-syntax`.](https://docs.rs/regex-syntax)

# License

Expand Down
250 changes: 250 additions & 0 deletions UNICODE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# Unicode conformance

This document describes the regex crate's conformance to Unicode's
[UTS#18](http://unicode.org/reports/tr18/)
report, which lays out 3 levels of support: Basic, Extended and Tailored.

Full support for Level 1 ("Basic Unicode Support") is provided with two
exceptions:

1. Line boundaries are not Unicode aware. Namely, only the `\n`
(`END OF LINE`) character is recognized as a line boundary.
2. The compatibility properties specified by
[RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)
are ASCII-only definitions.

Little to no support is provided for either Level 2 or Level 3. For the most
part, this is because the features are either complex/hard to implement, or at
the very least, very difficult to implement without sacrificing performance.
For example, tackling canonical equivalence such that matching worked as one
would expect regardless of normalization form would be a significant
undertaking. This is at least partially a result of the fact that this regex
engine is based on finite automata, which admits less flexibility normally
associated with backtracking implementations.


## RL1.1 Hex Notation

[UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation)

Hex Notation refers to the ability to specify a Unicode code point in a regular
expression via its hexadecimal code point representation. This is useful in
environments that have poor Unicode font rendering or if you need to express a
code point that is not normally displayable. All forms of hexadecimal notation
are supported

\x7F hex character code (exactly two digits)
\x{10FFFF} any hex character code corresponding to a Unicode code point
\u007F hex character code (exactly four digits)
\u{7F} any hex character code corresponding to a Unicode code point
\U0000007F hex character code (exactly eight digits)
\U{7F} any hex character code corresponding to a Unicode code point

Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways
of expressing hexadecimal code points. Any number of digits can be written
within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all
fixed-width variants of the same idea.

Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is
banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode
mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint
U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches
the literal byte `\xFF`.


## RL1.2 Properties

[UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories)

Full support for Unicode property syntax is provided. Unicode properties
provide a convenient way to construct character classes of groups of code
points specified by Unicode. The regex crate does not provide exhaustive
support, but covers a useful subset. In particular:

* [General categories](http://unicode.org/reports/tr18/#General_Category_Property)
* [Scripts and Script Extensions](http://unicode.org/reports/tr18/#Script_Property)
* [Age](http://unicode.org/reports/tr18/#Age)
* A smattering of boolean properties, including all of those specified by
[RL1.2](http://unicode.org/reports/tr18/#RL1.2) explicitly.

In all cases, property name and value abbreviations are supported, and all
names/values are matched loosely without regard for case, whitespace or
underscores. Property name aliases can be found in Unicode's
[`PropertyAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
file, while property value aliases can be found in Unicode's
[`PropertyValueAliases.txt`](http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
file.

The syntax supported is also consistent with the UTS#18 recommendation:

* `\p{Greek}` selects the `Greek` script. Equivalent expressions follow:
`\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`,
`\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and
`Script_Extensions` (or `scx` for short).
* `\p{age:3.2}` selects all code points in Unicode 3.2.
* `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated
via `\p{alpha}` (for example).
* Single letter variants for properties with single letter abbreviations.
For example, `\p{Letter}` can be equivalently written as `\pL`.

The following is a list of all properties supported by the regex crate (starred
properties correspond to properties required by RL1.2):

* `General_Category` \* (including `Any`, `ASCII` and `Assigned`)
* `Script` \*
* `Script_Extensions` \*
* `Age`
* `ASCII_Hex_Digit`
* `Alphabetic` \*
* `Bidi_Control`
* `Case_Ignorable`
* `Cased`
* `Changes_When_Casefolded`
* `Changes_When_Casemapped`
* `Changes_When_Lowercased`
* `Changes_When_Titlecased`
* `Changes_When_Uppercased`
* `Dash`
* `Default_Ignorable_Code_Point` \*
* `Deprecated`
* `Diacritic`
* `Extender`
* `Grapheme_Base`
* `Grapheme_Extend`
* `Hex_Digit`
* `IDS_Binary_Operator`
* `IDS_Trinary_Operator`
* `ID_Continue`
* `ID_Start`
* `Join_Control`
* `Logical_Order_Exception`
* `Lowercase` \*
* `Math`
* `Noncharacter_Code_Point` \*
* `Pattern_Syntax`
* `Pattern_White_Space`
* `Prepended_Concatenation_Mark`
* `Quotation_Mark`
* `Radical`
* `Regional_Indicator`
* `Sentence_Terminal`
* `Soft_Dotted`
* `Terminal_Punctuation`
* `Unified_Ideograph`
* `Uppercase` \*
* `Variation_Selector`
* `White_Space` \*
* `XID_Continue`
* `XID_Start`


## RL1.2a Compatibility Properties

[UTS#18 RL1.2a](http://unicode.org/reports/tr18/#RL1.2a)

The regex crate only provides ASCII definitions of the
[compatibility properties documented in UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties)
(sans the `\X` class, for matching grapheme clusters, which isn't provided
at all). This is because it seems to be consistent with most other regular
expression engines, and in particular, because these are often referred to as
"ASCII" or "POSIX" character classes.

Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware.
Their traditional ASCII definition can be used by disabling Unicode. That is,
`[[:word:]]` and `(?-u)\w` are equivalent.


## RL1.3 Subtraction and Intersection

[UTS#18 RL1.3](http://unicode.org/reports/tr18/#Subtraction_and_Intersection)

The regex crate provides full support for nested character classes, along with
union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`)
operations on arbitrary character classes.

For example, to match all non-ASCII letters, you could use either
`[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]`
(intersecting the negation).


## RL1.4 Simple Word Boundaries

[UTS#18 RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)

The regex crate provides basic Unicode aware word boundary assertions. A word
boundary assertion can be written as `\b`, or `\B` as its negation. A word
boundary negation corresponds to a zero-width match, where its adjacent
characters correspond to word and non-word, or non-word and word characters.

Conformance in this case chooses to define word character in the same way that
the `\w` character class is defined: a code point that is a member of one of
the following classes:

* `\p{Alphabetic}`
* `\p{Join_Control}`
* `\p{gc:Mark}`
* `\p{gc:Decimal_Number}`
* `\p{gc:Connector_Punctuation}`

In particular, this differs slightly from the
[prescription given in RL1.4](http://unicode.org/reports/tr18/#Simple_Word_Boundaries)
but is permissible according to
[UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
Namely, it is convenient and simpler to have `\w` and `\b` be in sync with
one another.

Finally, Unicode word boundaries can be disabled, which will cause ASCII word
boundaries to be used instead. That is, `\b` is a Unicode word boundary while
`(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial
if performance is important, since the implementation of Unicode word
boundaries is currently sub-optimal on non-ASCII text.


## RL1.5 Simple Loose Matches

[UTS#18 RL1.5](http://unicode.org/reports/tr18/#Simple_Loose_Matches)

The regex crate provides full support for case insensitive matching in
accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
"simple" mapping was chosen because of a key convenient property: every
"simple" mapping is a mapping from exactly one code point to exactly one other
code point. This makes case insensitive matching of character classes, for
example, straight-forward to implement.

When case insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`),
then all characters classes are case folded as well.


## RL1.6 Line Boundaries

[UTS#18 RL1.6](http://unicode.org/reports/tr18/#Line_Boundaries)

The regex crate only provides support for recognizing the `\n` (`END OF LINE`)
character as a line boundary. This choice was made mostly for implementation
convenience, and to avoid performance cliffs that Unicode word boundaries are
subject to.

Ideally, it would be nice to at least support `\r\n` as a line boundary as
well, and in theory, this could be done efficiently.


## RL1.7 Code Points

[UTS#18 RL1.7](http://unicode.org/reports/tr18/#Supplementary_Characters)

The regex crate provides full support for Unicode code point matching. Namely,
the fundamental atom of any match is always a single code point.

Given Rust's strong ties to UTF-8, the following guarantees are also provided:

* All matches are reported on valid UTF-8 code unit boundaries. That is, any
match range returned by the public regex API is guaranteed to successfully
slice the string that was searched.
* By consequence of the above, it is impossible to match surrogode code points.
No support for UTF-16 is provided, so this is never necessary.

Note that when Unicode mode is disabled, the fundamental atom of matching is
no longer a code point but a single byte. When Unicode mode is disabled, many
Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid
regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal
byte `\xFF`) is, for example.
2 changes: 1 addition & 1 deletion bench/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ onig = { version = "3", optional = true }
libpcre-sys = { version = "0.2", optional = true }
memmap = "0.6"
regex = { version = "0.2.0", path = "..", features = ["simd-accel"] }
regex-syntax = { version = "0.4.0", path = "../regex-syntax" }
regex-syntax = { version = "0.5.0", path = "../regex-syntax" }
serde = "1"
serde_derive = "1"

Expand Down
Loading