-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latin script: Segmenter should split camelCased words #129
Comments
@ManyTheFish Can I take this up? |
Hello @adithyaakrishna Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this! FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue. We are looking forward to reviewing your PR 😊 |
@adithyaakrishna you still doing this otherwise I'd try this next week |
Hey @adithyaakrishna, |
Hey @adithyaakrishna, |
160: Latin Segmenter split camelCased words r=ManyTheFish a=rashmibharambe # Pull Request ## Related issue Fixes #129 ## What does this PR do? - Latin Segmenter - to split camelCased words ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: rashmibharambe <93034034+rashmibharambe@users.noreply.github.com>
Hi, @ManyTheFish! I have a working patch for this. What's not clear to me is whether we want to keep the whole word or not. I.e. if I'd expect the former because if a user searches for Let me know what's your stance on this and I can submit a PR. |
The goal is to split "camelCase" into ["camel", "Case"] and even "snake_case" into ["snake", "_", "case"]. |
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Closes meilisearch#129.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-side in latin.rs clean and concise. Closes meilisearch#129.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-side in latin.rs clean and concise. Closes meilisearch#129.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code, it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Closes meilisearch#129.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code, it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Closes meilisearch#129.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids common pitfalls like eg. ALL_CAPS. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Caveats: - Especially in code, it's common to write abbreviations in all caps. For instance "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. - The implementation is not grapheme-cluster aware. E.g. it won't split on boundaries that are intermitted by combining characters like u{0306}. Closes meilisearch#129.
To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids common pitfalls like eg. ALL_CAPS. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Caveats: - Especially in code, it's common to write abbreviations in all caps. For instance "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. - The implementation is not grapheme-cluster aware. E.g. it won't split on boundaries that are intermitted by combining characters like u{0306}. Closes meilisearch#129.
181: Split camelCase in Latin segmenter r=ManyTheFish a=goodhoko ## Related issue #129 ## What does this PR do? To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (The position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Let me know if that's a blocker. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Fixes #129. ## Performance I benchmarked this locally on my Apple-silicon MB Air with `cargo bench` and saw some significant regressions. The `Latin/Fra` and `Latin/Eng` (in both `132` and `363` variants) regressed some 40% compared to a3eab30. (Here's the full [report](https://github.com/meilisearch/charabia/files/10559955/report.zip)) I'm not sure how big of a problem this is. There definitely are many optimization opportunities but I wanted to propose the simplest solution first and eventually iterate from there. --- PS: This is my very first rust contribution. Sorry for missing any conventions or leaving in rough edges. Co-authored-by: Jen Tak <goodhoko@gmail.com>
181: Split camelCase in Latin segmenter r=ManyTheFish a=goodhoko ## Related issue #129 ## What does this PR do? To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (The position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Let me know if that's a blocker. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Fixes #129. ## Performance I benchmarked this locally on my Apple-silicon MB Air with `cargo bench` and saw some significant regressions. The `Latin/Fra` and `Latin/Eng` (in both `132` and `363` variants) regressed some 40% compared to a3eab30. (Here's the full [report](https://github.com/meilisearch/charabia/files/10559955/report.zip)) I'm not sure how big of a problem this is. There definitely are many optimization opportunities but I wanted to propose the simplest solution first and eventually iterate from there. --- PS: This is my very first rust contribution. Sorry for missing any conventions or leaving in rough edges. Co-authored-by: Jen Tak <goodhoko@gmail.com>
Today, Meilisearch is splitting
snake_case
,SCREAMING_CASE
andkebab-case
properly but doesn't splitPascalCase
norcamelCase
.drawback
Meilisearch doesn't completely support code documentation.
enhancement
Make Latin Segmenter split camelCased/PascalCase words:
"camelCase"
->["camel", "Case"]
"PascalCase"
->["Pascal", "Case"]
"IJsland"
->["IJsland"]
(Language trap)"CASE"
->["CASE"]
(another trap)Files expected to be modified
The text was updated successfully, but these errors were encountered: