Skip to content

Commit

Permalink
unicode: add ASCII optimization for grapheme segmenter
Browse files Browse the repository at this point in the history
This helps quite a bit on text that is mostly ASCII, and probably
doesn't hurt too much in text that is mostly non-ASCII.

We should re-litigate this once regex-automata 0.3 is out. The new API
will have more knobs we can turn.
  • Loading branch information
BurntSushi committed Sep 7, 2022
1 parent 635e0f6 commit 6df1c9d
Showing 1 changed file with 16 additions and 0 deletions.
16 changes: 16 additions & 0 deletions src/unicode/grapheme.rs
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,22 @@ impl<'a> DoubleEndedIterator for GraphemeIndices<'a> {
pub fn decode_grapheme(bs: &[u8]) -> (&str, usize) {
if bs.is_empty() {
("", 0)
} else if bs.len() >= 2
&& bs[0].is_ascii()
&& bs[1].is_ascii()
&& !bs[0].is_ascii_whitespace()
{
// FIXME: It is somewhat sad that we have to special case this, but it
// leads to a significant speed up in predominantly ASCII text. The
// issue here is that the DFA has a bit of overhead, and running it for
// every byte in mostly ASCII text results in a bit slowdown. We should
// re-litigate this once regex-automata 0.3 is out, but it might be
// hard to avoid the special case. A DFA is always going to at least
// require some memory access.

// Safe because all ASCII bytes are valid UTF-8.
let grapheme = unsafe { bs[..1].to_str_unchecked() };
(grapheme, 1)
} else if let Some(end) = GRAPHEME_BREAK_FWD.find(bs) {
// Safe because a match can only occur for valid UTF-8.
let grapheme = unsafe { bs[..end].to_str_unchecked() };
Expand Down

0 comments on commit 6df1c9d

Please sign in to comment.