forked from meilisearch/charabia
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-side in latin.rs clean and concise. Closes meilisearch#129.
- Loading branch information
Showing
4 changed files
with
70 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
use once_cell::sync::Lazy; | ||
use regex::Regex; | ||
|
||
pub(crate) trait CamelCaseSegmentation { | ||
/// Returns an iterator over substrings of `self` separated on camelCase boundaries. | ||
/// For instance, "camelCase" is split into ["camel", "Case"]. | ||
/// A camelCase boundary constitutes a lowercase letter directly followed by an uppercase letter | ||
/// where lower and uppercase letters are defined by the corresponding Unicode General Categories. | ||
fn split_camel_case_bounds(&self) -> CamelCaseParts; | ||
} | ||
|
||
pub(crate) struct CamelCaseParts<'t> { | ||
state: State<'t>, | ||
} | ||
|
||
enum State<'t> { | ||
InProgress { remainder: &'t str }, | ||
Exhausted, | ||
} | ||
|
||
impl CamelCaseSegmentation for str { | ||
fn split_camel_case_bounds(&self) -> CamelCaseParts { | ||
CamelCaseParts { state: State::InProgress { remainder: self } } | ||
} | ||
} | ||
|
||
/// Matches a lower-case letter followed by an upper-case one and captures | ||
/// the boundary between them with a group named "boundary". | ||
static CAMEL_CASE_BOUNDARY_REGEX: Lazy<Regex> = | ||
Lazy::new(|| Regex::new(r"\p{Ll}(?P<boundary>)\p{Lu}").unwrap()); | ||
|
||
impl<'t> Iterator for CamelCaseParts<'t> { | ||
type Item = &'t str; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
match self.state { | ||
State::Exhausted => None, | ||
State::InProgress { remainder } => { | ||
match CAMEL_CASE_BOUNDARY_REGEX.captures(remainder) { | ||
None => { | ||
// All boundaries processed. Mark `self` as exhausted. | ||
self.state = State::Exhausted; | ||
// But don't forget to yield the part of the string remaining after the last boundary. | ||
Some(remainder) | ||
} | ||
Some(captures) => { | ||
// By the nature of the regex, this group is always present and this should never panic. | ||
let boundary = captures.name("boundary").unwrap().start(); | ||
self.state = State::InProgress { remainder: &remainder[boundary..] }; | ||
Some(&remainder[..boundary]) | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters